Object-Oriented Programming (OOP) teaches that classes can have two kinds of attributes: Instance and Static. Instance variables are attached to a specific instance of the class, and each has separate memory locations. Static variables are tied to the class itself, and are shared between instances. The difference between the two…
If you need to make HTTP requests in Python, you're probably using the fantastically simple and ergonomic
requests library (or
httpx in the async world). However, it's another dependency to manage and keep up-to-date. If you want to make HTTP requests without an additional dependency, or another library author has made that decision for you,
urllib is the standard library tool for you.
urllib is built in to Python, so you can use it in any Python application. Its interface is significantly simpler and more manual than
requests, but it gets the job done:
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
It being built-in to Python presents an unfortunate issue.
urllib is nice enough to add a User Agent header by default to all requests, but because of its prevalence it's usually blocked by any self-respecting Web Application Firewall (WAF).
requests is probably also be blocked by firewalls for the same reasons. If you are building something which reaches out to external APIs, set a nice user agent!
requests, it's very simple to add custom headers or change what's there. With
urllib it's more manual, and in some cases not directly possible. The main API (
urllib.request.urlopen) is simply a function which takes a URL, optional POST body and a few configuration parameters for SSL - nothing around headers. If you want to change the request, you can pass a
Request object as the URL parameter instead, which supports custom headers.
However in some cases this API isn't usable. For example,
mkdocs.utils.cache.download_and_cache_url method explicitly requires a
str for the
url argument, so I can't use a
Request to change the
User-Agent header. Additionally, this user agent has to be set at each invocation - what if I want to change it globally?
Under the hood,
urlopen uses an "opener director" (or "opener") to perform the actual request. This is where much of the magic happens, and is where the defaults like which headers to send are handled.
The API for openers may seem familiar if you're used to using a
requests.Session object. You create one, and then call
.open much like you would
urlopen. Openers also handle functionality such as basic auth, proxy handling, cookies and more. Fortunately, since there's a class we can modify, we can poke some of its internal attributes, like
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
And there is is, as promised, a custom user agent with
urllib. But we're not quite done yet.
We've changed the user agent in our own code, but what about someone else's? Sure, we could be good internet citizens and open a PR to add a custom user agent, but the maintainers may not be receptive, or contactable. What if it's already set to a value, but we need to change it to something else?
Wouldn't it be nice if we could mark an opener as being the global opener, and the one used by the global
urlopen method? Well, we can!
To "install" an opener, we use the helpfully-named
install_opener method. This replaces the default opener with one we define, so all future
urlopen methods will use it, regardless of where they're called.
As with all global variables, this presents a problem - what if someone else has already modified the global opener? Well, we can retrieve the current one from
urllib.request._opener, mutate it as we need and the changes will take effect everywhere.
Who doesn't love a bit of mutable, undocumented shared state.
Presumably to save memory and startup cost,
None by default, and only gets created when it first tries to be used by
urlopen. If we need to override it first, we need to handle this case:
from urllib.request import build_opener, install_opener, _opener
if _opener is None:
opener = build_opener()
opener = _opener
opener.addheaders = [('User-Agent','my application')]
And there we have it, a fully-formed method to change the
urllib. Now I can return to writing dependency-free scripts which don't get flagged by firewalls!
View all →
In case you didn't know, I have a blog - you're reading it now. It's not like what most people think of when they think "blog". It's guides, tales and random thoughts about the things I do, play around with or find interesting. The same can be said for the…
Django already does some pretty incredible things when it comes to performance, especially in the ORM layer. The lazy loading, fluent interface for querying means it’ll only fetch the data you need, when you need it. But it can’t handle everything for you, and often needs some help to work…