I got paid to work on Open Source #2

urllib3 is a HTTP client written in Python. We get millions of downloads every day, 500k GitHub repositories depend on us, and popular libraries like requests and boto3 use us for performing secure HTTP requests. In short, urllib3 is Python’s most depended-on HTTP client library.

We’re currently working on a v2 version that will be more modern and secure. An important part of security is our TLS support. We had already made progress here:

  • Seth, our lead maintainer, made TLS 1.2+ mandatory by default,
  • Hasan, another urllib3 maintainer, started removing support for common names, a TLS feature that was deprecated 20 years ago.

However, another feature was still missing: using Python’s SSLContext to match hostnames whenever possible. Indeed, Python only added hostname matching in 2011, when Python 2.5 was still supported. Python’s ssl module has improved a lot since then, and we want to rely on it for increased security.

This is a hard problem to tackle, as urllib3 supports many TLS features. We have three TLS backend (Python’s ssl, pyOpenSSL and SecureTransport), support alternate hostnames, fingerprints, SNI, HTTPS proxies and TLS-in-TLS. And most of those features were built when old Python 2 versions were supported, which makes the code complicated. Changing anything here requires a lot of time and energy in order to avoid breaking things. I don’t usually have that time and energy: as a working parent, being able to actually focus on urllib3 for hours is really rare.

Sponsors

I can’t understate how important the sponsoring we receive is. As you'll see below, our sponsors have a very real effect on urllib3. Thank you!

Last week, I had a great opportunity to work on urllib3 as I was teaching a NLP course part-time in the university I graduated from. This left me with about 20 hours to focus on urllib3, for which I will get paid $800 pre-taxes. Note that I'm talking a lot about myself here, but none of that would have been possible without the reviews of Seth M. Larson, our lead maintainer. Thank you, Seth!

100% coverage

One small but important thing that I did was to maintain 100% coverage (#2167, #2174). 100% coverage is underrated, and is often thought to be more effort than it’s worth. While it certainly takes discipline and requires marking specific lines as uncovered, the main benefit is that when a line is not covered, you know that it’s a bug. This helped us find bugs in the past, which is why when coverage is not at 100% I stop what I’m doing and fix it.

I also contacted GitHub Support, which is something that I meant to do for weeks. They helped us understand that our PR checks were all messed up because we had required the codecov check. They don't know yet why this is causing issues, but it certainly was preventing us from merging PRs correctly, as most checks were just showing as "expected" most of the time. Fixing that was instrumental in getting the PRs below merged efficiently.

Preparatory work

With the coverage out of the way, the rest of the focused time I had allowed me to gradually work towards the SSLContext goal:

  • I first improved the documentation of two options in #2169 (documenting things requires understanding them, so that took a lot of time but helped a lot with the other steps)
  • I finished the simplification of check_hostname (started in #2159, finished in #2170). This also was crucial to understand that SSLContext.check_hostname was now always available and that we were just setting it to False
  • I simplified the server_hostname handling code thanks to a Python 3.5 improvement (#2176)
  • I removed an old OpenSSL 0.9.8 check (#2173)

CVE-2021-28363

Note that at that point, I had not achieved much. However, not only this simplified our code a lot, but I also understood it much better than before. This clarity about our code and its limitations allowed me to find CVE-2021-28363, a severe vulnerability for TLS-in-TLS proxy support where we failed to match hostnames in the default case. I would never have found that before because the way we matched hostnames was unclear to me until now.

Jorge, the urllib3 maintainer that added TLS-in-TLS support to urllib3, fixed that quickly, and 1.26.4 is now released with the fix, and already represents 90% of our downloads.

Leaning on SSLContext

Armed with that clarity, I then added the last missing pieces for SSLContext support:

  • I moved one SNI feature to our TLS backends in #2177 (I closed the PR because those commits were merged as part of #2178)
  • I actually relied on SSLContext to match hostnames for us, in the cases where it’s possible, which was difficult since there are four cases where it’s not! (#2178)

I also realized that common name removal wasn’t complete, which I fixed in #2186.

Python's ssl bug

At this point, something interesting happened. We merged #2186 and #2178 separately, as all tests passed for those two PRs. But after the merges, the main branch was failing! Indeed, we were now relying on Python’s hostname_checks_common_name flag to remove support for common names on Python 3.7+. But that did not work! It took some time to realise that I had encountered a bug in Python’s ssl module. This was quite surprising: finding a Python ssl bug is a bit like finding a compiler bug, something that is not supposed to happen!

So that bug is https://bugs.python.org/issue43522, which is going to be fixed for Python 3.10 and will probably going to lead to a fix in OpenSSL itself. On the one hand, this means we won’t be able to lean on SSLContext as much as we would have liked, but on the other hand one of our goal for v2 was increased security, and my 20 hours of work will ultimately result in increasing security for urllib3, Python and OpenSSL. Not bad!

Time and time again, we find that investing in open source is unreasonably effective. Consider investing yourself by sponsoring urllib3 on GitHub Sponsors!

I'm on Mastodon!

Comments