If you need to make HTTP requests in Python, you're probably using the fantastically simple and ergonomic requests
library (or httpx
in the async world). However, it's another dependency to manage and keep up-to-date. If you want to make HTTP requests without an additional dependency, or another library author has made that decision for you, urllib
is the standard library tool for you.
urllib
is built in to Python, so you can use it in any Python application. Its interface is significantly simpler and more manual than requests
, but it gets the job done:
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
... print(f.read(100).decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm
It being built-in to Python presents an unfortunate issue. urllib
is nice enough to add a User Agent header by default to all requests, but because of its prevalence it's usually blocked by any self-respecting Web Application Firewall (WAF).
<note>
requests
is probably also be blocked by firewalls for the same reasons. If you are building something which reaches out to external APIs, set a nice user agent!
</note>
With requests
, it's very simple to add custom headers or change what's there. With urllib
it's more manual, and in some cases not directly possible. The main API (urllib.request.urlopen
) is simply a function which takes a URL, optional POST body and a few configuration parameters for SSL - nothing around headers. If you want to change the request, you can pass a Request
object as the URL parameter instead, which supports custom headers.
However in some cases this API isn't usable. For example, mkdocs
's mkdocs.utils.cache.download_and_cache_url
method explicitly requires a str
for the url
argument, so I can't use a Request
to change the User-Agent
header. Additionally, this user agent has to be set at each invocation - what if I want to change it globally?
#Openers
Under the hood, urlopen
uses an "opener director" (or "opener") to perform the actual request. This is where much of the magic happens, and is where the defaults like which headers to send are handled.
The API for openers may seem familiar if you're used to using a requests.Session
object. You create one, and then call .open
much like you would urlopen
. Openers also handle functionality such as basic auth, proxy handling, cookies and more. Fortunately, since there's a class we can modify, we can poke some of its internal attributes, like addheaders
:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
And there is is, as promised, a custom user agent with urllib
. But we're not quite done yet.
#Global opener
We've changed the user agent in our own code, but what about someone else's? Sure, we could be good internet citizens and open a PR to add a custom user agent, but the maintainers may not be receptive, or contactable. What if it's already set to a value, but we need to change it to something else?
Wouldn't it be nice if we could mark an opener as being the global opener, and the one used by the global urlopen
method? Well, we can!
To "install" an opener, we use the helpfully-named install_opener
method. This replaces the default opener with one we define, so all future urlopen
methods will use it, regardless of where they're called.
As with all global variables, this presents a problem - what if someone else has already modified the global opener? Well, we can retrieve the current one from urllib.request._opener
, mutate it as we need and the changes will take effect everywhere.
<aside>
Who doesn't love a bit of mutable, undocumented shared state.
</aside>
Presumably to save memory and startup cost, _opener
is None
by default, and only gets created when it first tries to be used by urlopen
. If we need to override it first, we need to handle this case:
from urllib.request import build_opener, install_opener, _opener
if _opener is None:
opener = build_opener()
install_opener(opener)
else:
opener = _opener
opener.addheaders = [('User-Agent','my application')]
And there we have it, a fully-formed method to change the User-Agent
of urllib
. Now I can return to writing dependency-free scripts which don't get flagged by firewalls!