Tracking down an old OkHttp regression
While it’s pretty easy to look up bugs and the corresponding bug-fix for open source projects, you rarely see the process that went into it.
I think it’s in the benefit of everyone that we also share the how, not just the what, because at the worst, you just read something you already knew, but at the best, you learn about new approaches/tools/thinking.
I found some spare time to maintain one of my side projects, Graticule.
After bringing the dependencies up to date, I started testing the new build on some older Android phones that I keep laying around (an approach that has caught more bugs than just limiting myself to emulators).
On one of them, specifically an Alcatel OT 918 (Android 2.3.6 / API 10), the SSL handshake failed.
Most likely, it had to do with upgrading OkHttp (which is a really awesome HTTP client, BTW) to the latest version.
Understandably, it’s a pretty old device, and while I run modern cryptography server-side (A+ on Qualys), I haven’t modified the ciphers for a while.
So since it worked with the older version, it didn’t make sense that it would be strictly a device related issue.
Finding the last working version
I reverted git to my previous stable project release and checked the device again to make sure it wasn’t something that I had changed earlier.
Nope, it worked fine with OkHttp 2.2.0.
So that’s one option, leaving it as it is, but that way we can’t take advantage of HTTP/2.0 support introduced in 2.3.0, good to keep as a fail-safe option, but I’d rather upgrade.
Next step was trying to find out if any others have had this issue.
Some relevant matches were found, including issue #1543 on the project page.
Lots of information, but no real solution.
I decided to log a bug-report (#2694) with the project, as I figured they know more about it than I do, or at least they could point me in the right direction.
It gave me a bit better understanding of the problem, but still no solution.
I tried the
ConnectionSpec approach, but that didn’t really work and that made sense, because I don’t have RC4 enabled on my server, since it was now considered a security risk.
At least I had some much needed reading on TLS/ciphers/protocols.
Diving into the project code
I downloaded the source code for 2.2.0 (good) and 2.3.0 (bad) and ran kdiff3 (I’m not a KDE user, but that tool is really great) on the folders to find out what change might have caused this issue.
Phew… well… time to dig in.
Some cipher changes, mostly RC4 dropped (as mentioned in the changelog), nothing that was immediately obvious or helpful.
<a few hours of trying stuff, no luck>
By now I’ve spent half a day on this issue.
I started wondering if it’s really worth the effort in keeping compatibility with such an old version of Android, as it accounted for barely 2% of Android users.
But yeah, for Graticule it does make sense, since it’s a privacy conscious location sharing app, it’s free, it’s not tied to any google services – it makes sense to be able to run it on a cheap secondary phone (one that might not be logged in to a Google account) that you put in your car or whatever, so I wasn’t ready to give up yet.
But it was getting late and I wasn’t being coherent anymore, so it’s time to disconnect, get some sleep and approach it again tomorrow.
After a good night’s sleep, I remembered that git had a wonderful feature called bisect
[…] uses a binary search algorithm to find which commit in your project’s history introduced a bug. You use it by first telling it a “bad” commit that is known to contain the bug, and a “good” commit that is known to be before the bug was introduced. Then git bisect picks a commit between those two endpoints and asks you whether the selected commit is “good” or “bad”. It continues narrowing down the range until it finds the exact commit that introduced the change.
So here we go:
After 20 minutes of building the source code for okhttp, rebuilding Graticule, installing it on the phone, seeing if it works, git bisect good/bad, we finally have something that looks relevant to our issue.
(Another reason to keep your commits atomic, in the sense that it should build/work whatever commit from
master you checkout, because otherwise I wouldn’t have been able to use git bisect)
The commit that changed the behavior was f695ea7, one and a half year ago.
Now there’s a lot less code to compare between the working and the non-working state.
It still wasn’t clear to me which change in that commit, so it’s time for some chocolate.
Chocolate gives me this incredible mental boost.
(Anyone else? I’m curious how common this is; it makes sense since the brain runs on carbs, but I never heard anyone talk about it)
But if I abuse it, I get fat, so I limit my use to when I really need it.
Anyway, 5 minutes later we have the offending change:
It might not look like much, but here’s the difference:
And the server supports:
Keen viewers will notice that
TLS_DHE_RSA_WITH_AES_256_CBC_SHA is the common ground, which is distinctively missing from the enabled ciphers, but present in the supported ciphers.
With a temporary solution in place, I could now open a new bug report (#2698) to start a discussion on what would be a more appropriate permanent fix.
Deep diving into bugs is really useful when there’s sufficient time, as I now have a better understanding of what actually happens in the lower layers of TLS connections.
Although it wasn’t the case for this personal project, I now have something to point people to when they ask: Why don’t project estimates work?
Thanks for reading.
Stack traces for search engines
SSL handshake aborted: ssl=0xffe30: Failure in SSL library, usually a protocol error org.apache.harmony.xnet.provider.jsse.NativeCrypto.SSL_do_handshake(Native Method) org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:505) org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:359) com.squareup.okhttp.Connection.upgradeToTls(Connection.java:242) com.squareup.okhttp.Connection.connect(Connection.java:159) com.squareup.okhttp.Connection.connectAndSetOwner(Connection.java:175) com.squareup.okhttp.OkHttpClient$1.connectAndSetOwner(OkHttpClient.java:120) com.squareup.okhttp.internal.http.HttpEngine.nextConnection(HttpEngine.java:330) com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:319) com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:241) com.squareup.okhttp.Call.getResponse(Call.java:271) com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:228) com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:199) com.squareup.okhttp.Call.execute(Call.java:79)
D/NativeCrypto: ssl=0x1a5a50 NativeCrypto_SSL_do_handshake fd=0x40631340 shc=0x4075d6d0 timeout=10000 client_mode=1 D/NativeCrypto: doing handshake ++ D/NativeCrypto: doing handshake -- ret=-1 D/NativeCrypto: ssl=0x1a5a50 NativeCrypto_SSL_do_handshake ret=-1 errno=11 sslError=2 timeout=10000 D/NativeCrypto: sslSelect type=READ fd=49 appData=0x1834c8 timeout=10000 D/NativeCrypto: sslSelect READ fd=49 appData=0x1834c8 timeout=10000 => 1 D/NativeCrypto: doing handshake ++ D/NativeCrypto: doing handshake -- ret=-1 D/NativeCrypto: ssl=0x1a5a50 NativeCrypto_SSL_do_handshake ret=-1 errno=0 sslError=1 timeout=10000
javax.net.ssl.SSLProtocolException: SSL handshake aborted: ssl=0x103f40: Failure in SSL library, usually a protocol error error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure (external/openssl/ssl/s23_clnt.c:690 0xad1297ef:0x00000000) org.apache.harmony.xnet.provider.jsse.NativeCrypto.SSL_do_handshake(Native Method) org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:505) org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:359) com.squareup.okhttp.internal.io.RealConnection.connectTls(RealConnection.java:192) com.squareup.okhttp.internal.io.RealConnection.connectSocket(RealConnection.java:149) com.squareup.okhttp.internal.io.RealConnection.connect(RealConnection.java:112) com.squareup.okhttp.internal.http.StreamAllocation.findConnection(StreamAllocation.java:184) com.squareup.okhttp.internal.http.StreamAllocation.findHealthyConnection(StreamAllocation.java:126) com.squareup.okhttp.internal.http.StreamAllocation.newStream(StreamAllocation.java:95) com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:281) com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:224) com.squareup.okhttp.Call.getResponse(Call.java:286) com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:243) com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:205) com.squareup.okhttp.Call.execute(Call.java:80)