I got paid to work on Open Source #3

urllib3 is a HTTP client written in Python. We just got 7 billions downloads, 1 million GitHub repositories depend on us, and popular libraries like requests and boto3 use us for performing secure HTTP requests.

We’re currently working on a v2 version that is more modern and secure. It has been in the work for two years now, depending on how you count. We've made great progress in the past months thanks to our bounty program. Suddenly we as maintainers became the bottleneck, so we decided to spent one week full time with Seth Michael Larson to finally get 2.0.0 out, starting with an alpha version.

Sponsors

Again, I can't emphasize how important the money we're receiving through sponsors is. It enables us to pay contributors, maintainers (and even stickers!) to advance the project in a way that would not have been possible even a few years ago. Everything is tracked transparently on Open Collective so that you can see how we spend our money. For example, here's the invoice for my week of development. Thank you!

40 hours of work

So what did I do? From November 7th to November 11th, I was hoping to:

  • write the migration guide for urllib3 v2.0
  • merge #2712, improving the behavior of HTTPResponse.read()
  • merge #2331 which incorporates the requests-toolbelt multipart/form-data encoder and decoder to urllib3.

I started with what initially seemed to be an easy task, HTTPResponse.read(), but it ended up taking the whole week (more on this later). This means I did not get to merge #2331 but that's OK as it is not a breaking change so we can get to it later.

I also got some minor improvements merged:

  • #2764 removed an useless pylint pragma.
  • #2771 fixed the macOS CI that had been plaguing us for weeks. Fixing #2770 would be the proper fix though.
  • #2776 removed HTTPConnection.auto_open which we realized was unnecessary. This is a great example of collaboration between Seth and I during the week. He mentioned this during a meeting we had, I offered to help, and a few hours later he had one less thing to worry about!
  • #2783 removed some unreachable code that was still covered by unit tests
  • #2785 configured more projects in pyproject.toml as two contributors incorrectly configured options there, not realizing we still used setup.cfg.
  • #2786 fixed a dead image link
  • #2797 fixed the build by removing a botocore patch that was merged upstream.

Deprecation warnings

Thanks to the involvement of Thomas Grainger, we had a separate track about deprecation warnings. urllib3 has a long history and is often a early adopter of features, but that means we get a lot of warnings when those features get deprecated.

Now that urllib3 dropped Python 2.7 support, there are many cases where we can get rid of those deprecations. One such case is the "There is no current event loop" Python 3.10 warning that was issued with our usage of Tornado, our test server. To avoid this warning, we had to switch to using asyncio.run, introduced in Python 3.7. I opened #2772, and Thomas fixed it by significantly refactoring our code for launching Tornado.

Thomas then opened #2790, configuring filterwarnings in pytest to error on deprecation warnings. This forces us to build an allow list in the pytest configuration, making sure that we notice new deprecation warnings. It also gives us a list of existing warnings that we can fix one by one.

HTTPResponse.read(X)

OK, so I finally get to talk about the main thing I worked on! In urllib3 1.26.x, what do you think the following code prints?

import urllib3

http = urllib3.PoolManager()
r = http.request(
    "GET",
    "https://quentin.pradet.me/",
    preload_content=False,
    headers={"Accept-Encoding": "gzip"}
)
assert isinstance(r, urllib3.HTTPResponse)
print(r.read(20))

It prints... b'' as the read() call did not return any bytes. What happens is that urllib3 read 20 bytes from the gzip stream, and that was not enough to return any uncompressed bytes. This is very confusing, but we never touched it because it would have been a breaking change. But that's what 2.0 is about: removing various paper cuts even if doing so is technically breaking. We also wondered in #2769 if the existing behavior should be possible in 2.0, and concluded that no, given that we did not think of a valid use case. Additionally, users could support it themselves by decompressing outside of urllib3.

The only thing left to do was to review #2712 from Franek Magiera. It raised a few questions that deserved their own issues:

  • Do we flush the decoder when reaching EOF in partial reads? #2799
  • Should we prevent read(decode_content=True) followed by read(decode_content=False)? #2800
  • Should BaseHTTPResponse.readable() return False? #2765

But then, I checked one last thing. Franek mentioned in passing that he had to disable one test on Windows that complained about a MemoryError. Having to skip certain tests in certain cases happens all the time, so I looked into it. And it turns out that the test was consuming 8GB of memory just to send 2GB of zeros! This was a pretty bad regression that we nearly missed.

In #2787, I experimented to see how I could minimize memory consumption when decoding data, and finally settled on a buffer made of a queue of bytes. Using the excellent memray package, I managed to fix the memory consumption issue and even added tests to make sure it never comes back.

The final result was #2798 which borrowed heavily on #2712 and was finished at the very end of my week.

Thankfully Seth wrote the migration guide, helping me out as I would not have enough time. Thanks Seth :D

I'm on Mastodon!

Comments