BeautifulSoup is a very popular HTML parsing library for Python. It's about as popular for HTML parsing as
requests is for HTTP requests. BeautifulSoup allows you to provide some HTML markup as a string, and then interrogate and mutate it however you need. Whether to find specific tags, add new ones, or extract just the text - BeautifulSoup does it all.
soup = BeautifulSoup(html_doc, 'html.parser') for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie
Anyone who has used BeautifulSoup knows the main entry point is the self-titled
BeautifulSoup class, whose constructor takes 2 arguments. The first, fairly obviously, is the HTML data itself. The second however is more interesting. The second argument tells BeautifulSoup what "builder" implementation to use for parsing and manipulating the underlying HTML. At the time of writing, it supports 3:
html.parser- Built into Python's standard library. Probably fine for most people
html5lib- Slightly more featureful
lxml- Bindings for
libxml2, written in C. Very fast and memory efficient
I personally lean towards using
lxml where I can - precisely because of the performance improvements. For quick and dirty uses,
html.parser is absolutely fine, and it's not like it's slow, but it nicely avoids an extra dependency for what is otherwise a tiny script.
And that's where the problems start: What happens when you don't know which builder you want to use, because you're a sensible library maintainer who doesn't want to impose restrictions on your downstream users. Sure, you could just specify
html.parser and continue writing actual features rather than worrying about backing implementations for your HTML parser - but where's the fun in that?
If you omit the second argument, BeautifulSoup complains, but then still chooses one anyway based on what's available:
>>> BeautifulSoup("<p></p>") test.py:3: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 3 of the file test.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
If you try this in a REPL, you won't get the warning. The BeautifulSoup developers don't want to annoy you too much.
Whilst it may not seem it, this is a good thing. Ambiguity in production applications is a bad thing, and developers need the ability to ensure their libraries and tools work reliably, no matter the platform or environment. But, let's say we know what we're doing, and our usage of BeautifulSoup is simple and robust enough to work with all builders, but want to use
lxml if they're available, and fall back to
html.parser if not - how might that work?
This is possible. But to understand how, we need to take a closer look at the
BeautifulSoup class, specifically its constructor:
def __init__(self, markup="", features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, **kwargs):
Notice the second argument,
features. What you specify isn't actually the name of a builder, but its features, or more specifically a list of features.
When BeautifulSoup is first loaded (imported), it checks whether
lxml are available, and conditionally loads their backing implementation. Each builder has a described list of "features" which it implements, and one of these features is always its own name - which is why specifying just
html.parser works as expected.
Try it for yourself. If you just specify
"html", BeautifulSoup doesn't complain, and happily loads a builder - in this case it's always
html.parser because it's the first to implement the
html feature. The
builder attribute shows which builder is being used, and from the module name we can see which underlying library is being used.
>>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p></p>", "html") <p></p> >>> BeautifulSoup("<p></p>", "html").builder <bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1218d0>
Back to the problem at hand - what if we want to optionally use
lxml when it's available. Well, then we can additionally specify the
"fast" feature (yes, that's really what it's called). When specified, BeautifulSoup will first find all the backends which implement HTML (which is all of them), and then filter those by ones which are "fast". If there aren't any fast ones (because
lxml isn't available), it simply ignores that feature request, as if you had only asked for
>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder <bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1cddb0>
lxml is installed, you get the
>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder <bs4.builder._lxml.LXMLTreeBuilder object at 0x7fd9aa356140>
If instead of
lxml, I had installed
html5lib, BeautifulSoup would select that instead, because BeautifulSoup prefers it.
These are the features BeautifulSoup implements natively, but there's no reason you couldn't implement your own using an alternative builder if you wanted to.
HTML is not XML. If you need to parse XML, you'll need to use
lxml - even if
html.parser works "well enough" for you.
This does raise a problem, but one that's only applicable in some specific cases. Using feature discovery to select the backend means your code no longer knows for certain which backend it's going to get. This is exactly what we wanted, but means there may be a difference between environments if the available dependencies may change. If for some reason there's a niche bug in only the
html.parser implementation, it can be difficult to diagnose if you have
lxml installed, as you'd never see it (a Heisenbug). In reality, this shouldn't really happen, as you should be defining the dependencies of your application specifically, and ensuring they're the same between environments. If however, you're a library maintainer, chances are the only option is improved test coverage.
If you're not pinning your dependencies, then you'll have much bigger problems besides just which backend your HTML parser is using.
Share this page