urllib3 is a HTTP client written in Python. We just got 7 billions downloads, 1 million GitHub repositories depend on us, and popular libraries like requests and boto3 use us for performing secure HTTP requests.
We’re currently working on a v2 version that is more modern and secure. It has been in the work for two years now, depending on how you count. We've made great progress in the past months thanks to our bounty program. Suddenly we as maintainers became the bottleneck, so we decided to spent one week full time with Seth Michael Larson to finally get 2.0.0 out, starting with an alpha version.
Sponsors
Again, I can't emphasize how important the money we're receiving through sponsors is. It enables us to pay contributors, maintainers (and even stickers!) to advance the project in a way that would not have been possible even a few years ago. Everything is tracked transparently on Open Collective so that you can see how we spend our money. For example, here's the invoice for my week of development. Thank you!
40 hours of work
So what did I do? From November 7th to November 11th, I was hoping to:
- write the migration guide for urllib3 v2.0
- merge #2712, improving the
behavior of
HTTPResponse.read()
- merge #2331 which incorporates the requests-toolbelt multipart/form-data encoder and decoder to urllib3.
I started with what initially seemed to be an easy task, HTTPResponse.read()
,
but it ended up taking the whole week (more on this later). This means I did not
get to merge #2331 but that's OK
as it is not a breaking change so we can get to it later.
I also got some minor improvements merged:
- #2764 removed an useless pylint pragma.
- #2771 fixed the macOS CI that had been plaguing us for weeks. Fixing #2770 would be the proper fix though.
- #2776 removed
HTTPConnection.auto_open
which we realized was unnecessary. This is a great example of collaboration between Seth and I during the week. He mentioned this during a meeting we had, I offered to help, and a few hours later he had one less thing to worry about! - #2783 removed some unreachable code that was still covered by unit tests
- #2785 configured more projects in pyproject.toml as two contributors incorrectly configured options there, not realizing we still used setup.cfg.
- #2786 fixed a dead image link
- #2797 fixed the build by removing a botocore patch that was merged upstream.
Deprecation warnings
Thanks to the involvement of Thomas Grainger, we had a separate track about deprecation warnings. urllib3 has a long history and is often a early adopter of features, but that means we get a lot of warnings when those features get deprecated.
Now that urllib3 dropped Python 2.7 support, there are many cases where we can
get rid of those deprecations. One such case is the "There is no current event
loop" Python 3.10 warning that was issued with our usage of Tornado, our test
server. To avoid this warning, we had to switch to using asyncio.run
,
introduced in Python 3.7. I opened
#2772, and Thomas fixed it by
significantly refactoring our
code for launching Tornado.
Thomas then opened #2790,
configuring filterwarnings
in pytest to error on deprecation warnings. This
forces us to build an allow list in the pytest configuration, making sure that
we notice new deprecation warnings. It also gives us a list of existing warnings
that we can fix one by one.
HTTPResponse.read(X)
OK, so I finally get to talk about the main thing I worked on! In urllib3 1.26.x, what do you think the following code prints?
import urllib3
http = urllib3.PoolManager()
r = http.request(
"GET",
"https://quentin.pradet.me/",
preload_content=False,
headers={"Accept-Encoding": "gzip"}
)
assert isinstance(r, urllib3.HTTPResponse)
print(r.read(20))
It prints... b''
as the read() call did not return any bytes. What
happens is that urllib3 read 20 bytes from the gzip stream, and that
was not enough to return any uncompressed bytes. This is very
confusing, but we never touched it because it would have been a
breaking change. But that's what 2.0 is about: removing various paper
cuts even if doing so is technically breaking. We also wondered in
#2769 if the
existing behavior should be possible in 2.0, and concluded that no,
given that we did not think of a valid use case. Additionally, users
could support it themselves by decompressing outside of urllib3.
The only thing left to do was to review #2712 from Franek Magiera. It raised a few questions that deserved their own issues:
- Do we flush the decoder when reaching EOF in partial reads? #2799
- Should we prevent
read(decode_content=True)
followed byread(decode_content=False)
? #2800 - Should BaseHTTPResponse.readable() return False? #2765
But then, I checked one last thing. Franek mentioned in passing that
he had to disable one test on Windows that complained about a
MemoryError
. Having to skip certain tests in certain cases happens
all the time, so I looked into it. And it turns out that the test was
consuming 8GB of memory just to send 2GB of zeros! This was a pretty
bad regression that we nearly missed.
In #2787, I experimented to see how I could minimize memory consumption when decoding data, and finally settled on a buffer made of a queue of bytes. Using the excellent memray package, I managed to fix the memory consumption issue and even added tests to make sure it never comes back.
The final result was #2798 which borrowed heavily on #2712 and was finished at the very end of my week.
Thankfully Seth wrote the migration guide, helping me out as I would not have enough time. Thanks Seth :D
I'm on Mastodon!
Comments