Automatic builder discovery with BeautifulSoup
BeautifulSoup is a very popular HTML parsing library for Python. It's about as popular for HTML parsing as requests
is for HTTP requests. BeautifulSoup allows you to provide some HTML markup as a string, and then interrogate and mutate it however you need. Whether to find specific tags, add new ones, or extract just the text - BeautifulSoup does it all.
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Anyone who has used BeautifulSoup knows the main entry point is the self-titled BeautifulSoup
class, whose constructor takes 2 arguments. The first, fairly obviously, is the HTML data itself. The second however is more interesting. The second argument tells BeautifulSoup what "builder" implementation to use for parsing and manipulating the underlying HTML. At the time of writing, it supports 3:
html.parser
- Built into Python's standard library. Probably fine for most peoplehtml5lib
- Slightly more featurefullxml
- Bindings forlibxml2
, written in C. Very fast and memory efficient
I personally lean towards using lxml
where I can - precisely because of the performance improvements. For quick and dirty uses, html.parser
is absolutely fine, and it's not like it's slow, but it nicely avoids an extra dependency for what is otherwise a tiny script.
And that's where the problems start: What happens when you don't know which builder you want to use, because you're a sensible library maintainer who doesn't want to impose restrictions on your downstream users. Sure, you could just specify html.parser
and continue writing actual features rather than worrying about backing implementations for your HTML parser - but where's the fun in that?
If you omit the second argument, BeautifulSoup complains, but then still chooses one anyway based on what's available:
>>> BeautifulSoup("<p></p>")
test.py:3: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 3 of the file test.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
<note>
If you try this in a REPL, you won't get the warning. The BeautifulSoup developers don't want to annoy you too much.
</note>
Whilst it may not seem it, this is a good thing. Ambiguity in production applications is a bad thing, and developers need the ability to ensure their libraries and tools work reliably, no matter the platform or environment. But, let's say we know what we're doing, and our usage of BeautifulSoup is simple and robust enough to work with all builders, but want to use html5lib
or lxml
if they're available, and fall back to html.parser
if not - how might that work?
This is possible. But to understand how, we need to take a closer look at the BeautifulSoup
class, specifically its constructor:
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
Notice the second argument, features
. What you specify isn't actually the name of a builder, but its features, or more specifically a list of features.
When BeautifulSoup is first loaded (imported), it checks whether html5lib
and lxml
are available, and conditionally loads their backing implementation. Each builder has a described list of "features" which it implements, and one of these features is always its own name - which is why specifying just html.parser
works as expected.
Try it for yourself. If you just specify "html"
, BeautifulSoup doesn't complain, and happily loads a builder - in this case it's always html.parser
because it's the first to implement the html
feature. The builder
attribute shows which builder is being used, and from the module name we can see which underlying library is being used.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p></p>", "html")
<p></p>
>>> BeautifulSoup("<p></p>", "html").builder
<bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1218d0>
Back to the problem at hand - what if we want to optionally use lxml
when it's available. Well, then we can additionally specify the "fast"
feature (yes, that's really what it's called). When specified, BeautifulSoup will first find all the backends which implement HTML (which is all of them), and then filter those by ones which are "fast". If there aren't any fast ones (because lxml
isn't available), it simply ignores that feature request, as if you had only asked for html
.
>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder
<bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1cddb0>
But if lxml
is installed, you get the lxml
backend:
>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder
<bs4.builder._lxml.LXMLTreeBuilder object at 0x7fd9aa356140>
If instead of lxml
, I had installed html5lib
, BeautifulSoup would select that instead, because BeautifulSoup prefers it.
These are the features BeautifulSoup implements natively, but there's no reason you couldn't implement your own using an alternative builder if you wanted to.
Features | html.parser | html5lib | lxml |
---|---|---|---|
|
❌ | ❌ | ✅ |
|
❌ | ✅ | ✅ |
|
✅ | ❌ | ❌ |
|
❌ | ❌ | ✅ |
|
✅ | ✅ | ✅ |
|
❌ | ✅ | ❌ |
<reminder>
HTML is not XML. If you need to parse XML, you'll need to use lxml
- even if html.parser
works "well enough" for you.
</reminder>
This does raise a problem, but one that's only applicable in some specific cases. Using feature discovery to select the backend means your code no longer knows for certain which backend it's going to get. This is exactly what we wanted, but means there may be a difference between environments if the available dependencies may change. If for some reason there's a niche bug in only the html.parser
implementation, it can be difficult to diagnose if you have lxml
installed, as you'd never see it (a Heisenbug). In reality, this shouldn't really happen, as you should be defining the dependencies of your application specifically, and ensuring they're the same between environments. If however, you're a library maintainer, chances are the only option is improved test coverage.
If you're not pinning your dependencies, then you'll have much bigger problems besides just which backend your HTML parser is using.
Share this page