Automatic builder discovery with BeautifulSoup

2023-04-14

4 minutes

#programming #python

BeautifulSoup is a very popular HTML parsing library for Python. It's about as popular for HTML parsing as requests is for HTTP requests. BeautifulSoup allows you to provide some HTML markup as a string, and then interrogate and mutate it however you need. Whether to find specific tags, add new ones, or extract just the text - BeautifulSoup does it all.

Python

soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Anyone who has used BeautifulSoup knows the main entry point is the self-titled BeautifulSoup class, whose constructor takes 2 arguments. The first, fairly obviously, is the HTML data itself. The second however is more interesting. The second argument tells BeautifulSoup what "builder" implementation to use for parsing and manipulating the underlying HTML. At the time of writing, it supports 3:

html.parser - Built into Python's standard library. Probably fine for most people
html5lib - Slightly more featureful
lxml - Bindings for libxml2, written in C. Very fast and memory efficient

I personally lean towards using lxml where I can - precisely because of the performance improvements. For quick and dirty uses, html.parser is absolutely fine, and it's not like it's slow, but it nicely avoids an extra dependency for what is otherwise a tiny script.

And that's where the problems start: What happens when you don't know which builder you want to use, because you're a sensible library maintainer who doesn't want to impose restrictions on your downstream users. Sure, you could just specify html.parser and continue writing actual features rather than worrying about backing implementations for your HTML parser - but where's the fun in that?

If you omit the second argument, BeautifulSoup complains, but then still chooses one anyway based on what's available:

Python

>>> BeautifulSoup("<p></p>")
test.py:3: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 3 of the file test.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

<note>

If you try this in a REPL, you won't get the warning. The BeautifulSoup developers don't want to annoy you too much.

</note>

Whilst it may not seem it, this is a good thing. Ambiguity in production applications is a bad thing, and developers need the ability to ensure their libraries and tools work reliably, no matter the platform or environment. But, let's say we know what we're doing, and our usage of BeautifulSoup is simple and robust enough to work with all builders, but want to use html5lib or lxml if they're available, and fall back to html.parser if not - how might that work?

This is possible. But to understand how, we need to take a closer look at the BeautifulSoup class, specifically its constructor:

bs4/__init__.py

def __init__(self, markup="", features=None, builder=None,
                 parse_only=None, from_encoding=None, exclude_encodings=None,
                 **kwargs):

Notice the second argument, features. What you specify isn't actually the name of a builder, but its features, or more specifically a list of features.

When BeautifulSoup is first loaded (imported), it checks whether html5lib and lxml are available, and conditionally loads their backing implementation. Each builder has a described list of "features" which it implements, and one of these features is always its own name - which is why specifying just html.parser works as expected.

The builder attribute shows which builder is being used, and from the module name we can see which underlying library is being used.

Python

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p></p>", "html")
<p></p>
>>> BeautifulSoup("<p></p>", "html").builder
<bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1218d0>

Back to the problem at hand - what if we want to optionally use lxml when it's available. Well, then we can additionally specify the "fast" feature (yes, that's really what it's called). When specified, BeautifulSoup will first find all the backends which implement HTML (which is all of them), and then filter those by ones which are "fast". If there aren't any fast ones (because lxml isn't available), it simply ignores that feature request, as if you had only asked for html.

Python

>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder
<bs4.builder._htmlparser.HTMLParserTreeBuilder object at 0x7fad9e1cddb0>

But if lxml is installed, you get the lxml backend:

Python

>>> BeautifulSoup("<p></p>", ["html", "fast"]).builder
<bs4.builder._lxml.LXMLTreeBuilder object at 0x7fd9aa356140>

If instead of lxml, I had installed html5lib, BeautifulSoup would select that instead, because BeautifulSoup prefers it.

These are the features BeautifulSoup implements natively, but there's no reason you couldn't implement your own using an alternative builder if you wanted to.

Features	html.parser	html5lib	lxml
`fast`	❌	❌	✅
`permissive`	❌	✅	✅
`strict`	✅	❌	❌
`xml`	❌	❌	✅
`html`	✅	✅	✅
`html5`	❌	✅	❌

<reminder>

HTML is not XML. If you need to parse XML, you'll need to use lxml - even if html.parser works "well enough" for you.

</reminder>

Now, about that warning. Sadly, no matter what you do, if you want BeautifulSoup to guess the correct builder, it'll warn you it's a bad idea (which it is). However, if you're like me, and/or are reading this, you probably want to hide the warning. To achieve that, you can add a filter for the specific warning to ignore it.

Python

import warnings
from bs4 import GuessedAtParserWarning

# Ignore parser warnings if they originated from a module starting `website.`
warnings.filterwarnings("ignore", category=GuessedAtParserWarning, module=r"^website\..+")

This does raise a problem, but one that's only applicable in some specific cases. Using feature discovery to select the backend means your code no longer knows for certain which backend it's going to get. This is exactly what we wanted, but means there may be a difference between environments if the available dependencies may change. If for some reason there's a niche bug in only the html.parser implementation, it can be difficult to diagnose if you have lxml installed, as you'd never see it (a Heisenbug). In reality, this shouldn't really happen, as you should be defining the dependencies of your application specifically, and ensuring they're the same between environments. If however, you're a library maintainer, chances are the only option is improved test coverage.

If you're not pinning your dependencies, then you'll have much bigger problems besides just which backend your HTML parser is using.

Share this page

Instance vs Static: A tale of memory leaks and OOP in Python

2019-04-27

3 minutes

#programming #python

Object-Oriented Programming (OOP) teaches that classes can have two kinds of attributes: Instance and Static. Instance variables are attached to a specific instance of the class, and each has separate memory locations. Static variables are tied to the class itself, and are shared between instances. The difference between the two…

Adding blog posts to my GitHub profile

2024-02-06

4 minutes

#programming #python #website

In case you didn't know, I have a blog - you're reading it now. It's not like what most people think of when they think "blog". It's guides, tales and random thoughts about the things I do, play around with or find interesting. The same can be said for the…

" Identity layers "
By penetrating any lower layer of a person, you can see a different face of that character and his appearance gradually disappears .

Changing the user-agent of urllib

2023-11-28

3 minutes

#programming #python

If you need to make HTTP requests in Python, you're probably using the fantastically simple and ergonomic requests library (or httpx in the async world). However, it's another dependency to manage and keep up-to-date. If you want to make HTTP requests without an additional dependency, or another library author has…

Automatic builder discovery with BeautifulSoup

Similar content

Instance vs Static: A tale of memory leaks and OOP in Python

Adding blog posts to my GitHub profile

Changing the user-agent of urllib