" Identity layers " 
By penetrating any lower layer of a person, you can see a different face of that character and his appearance gradually disappears .

Changing the user-agent of urllib

2023-11-28
3 minutes

If you need to make HTTP requests in Python, you're probably using the fantastically simple and ergonomic requests library (or httpx in the async world). However, it's another dependency to manage and keep up-to-date. If you want to make HTTP requests without an additional dependency, or another library author has made that decision for you, urllib is the standard library tool for you.

urllib is built in to Python, so you can use it in any Python application. Its interface is significantly simpler and more manual than requests, but it gets the job done:

Python
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
...    print(f.read(100).decode('utf-8'))

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm

It being built-in to Python presents an unfortunate issue. urllib is nice enough to add a User Agent header by default to all requests, but because of its prevalence it's usually blocked by any self-respecting Web Application Firewall (WAF).

<note>

requests is probably also be blocked by firewalls for the same reasons. If you are building something which reaches out to external APIs, set a nice user agent!

</note>

With requests, it's very simple to add custom headers or change what's there. With urllib it's more manual, and in some cases not directly possible. The main API (urllib.request.urlopen) is simply a function which takes a URL, optional POST body and a few configuration parameters for SSL - nothing around headers. If you want to change the request, you can pass a Request object as the URL parameter instead, which supports custom headers.

However in some cases this API isn't usable. For example, mkdocs's mkdocs.utils.cache.download_and_cache_url method explicitly requires a str for the url argument, so I can't use a Request to change the User-Agent header. Additionally, this user agent has to be set at each invocation - what if I want to change it globally?

#Openers

Under the hood, urlopen uses an "opener director" (or "opener") to perform the actual request. This is where much of the magic happens, and is where the defaults like which headers to send are handled.

The API for openers may seem familiar if you're used to using a requests.Session object. You create one, and then call .open much like you would urlopen. Openers also handle functionality such as basic auth, proxy handling, cookies and more. Fortunately, since there's a class we can modify, we can poke some of its internal attributes, like addheaders:

Python
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

And there is is, as promised, a custom user agent with urllib. But we're not quite done yet.

#Global opener

We've changed the user agent in our own code, but what about someone else's? Sure, we could be good internet citizens and open a PR to add a custom user agent, but the maintainers may not be receptive, or contactable. What if it's already set to a value, but we need to change it to something else?

Wouldn't it be nice if we could mark an opener as being the global opener, and the one used by the global urlopen method? Well, we can!

To "install" an opener, we use the helpfully-named install_opener method. This replaces the default opener with one we define, so all future urlopen methods will use it, regardless of where they're called.

As with all global variables, this presents a problem - what if someone else has already modified the global opener? Well, we can retrieve the current one from urllib.request._opener, mutate it as we need and the changes will take effect everywhere.

<aside>

Who doesn't love a bit of mutable, undocumented shared state.

</aside>

Presumably to save memory and startup cost, _opener is None by default, and only gets created when it first tries to be used by urlopen. If we need to override it first, we need to handle this case:

Python
from urllib.request import build_opener, install_opener, _opener

if _opener is None:
    opener = build_opener()
    install_opener(opener)
else:
    opener = _opener

opener.addheaders = [('User-Agent','my application')]

And there we have it, a fully-formed method to change the User-Agent of urllib. Now I can return to writing dependency-free scripts which don't get flagged by firewalls!

Share this page

Similar content

View all →

Instance vs Static: A tale of memory leaks and OOP in Python

2019-04-27
3 minutes

Object-Oriented Programming (OOP) teaches that classes can have two kinds of attributes: Instance and Static. Instance variables are attached to a specific instance of the class, and each has separate memory locations. Static variables are tied to the class itself, and are shared between instances. The difference between the two…

Scattered white paper

Adding blog posts to my GitHub profile

2024-02-06
4 minutes

In case you didn't know, I have a blog - you're reading it now. It's not like what most people think of when they think "blog". It's guides, tales and random thoughts about the things I do, play around with or find interesting. The same can be said for the…

Busy freeway traffic at night

Django ORM Performance

2020-06-07
10 minutes

Django already does some pretty incredible things when it comes to performance, especially in the ORM layer. The lazy loading, fluent interface for querying means it’ll only fetch the data you need, when you need it. But it can’t handle everything for you, and often needs some help to work…