Monday, May 29, 2023

Anatomy of panic over site instability

This morning I had a bit of a sitemageddon, where the MyFlightbook website and service were only barely running for almost two hours.  Woke up, was about to sit down for breakfast at about 7:30 when "ping", I have a text from Google telling me that the site is down.

First step: look at the site monitoring.  Hmm, CPU is fine; that's usually the culprit.  And I can ping the server, so the server is up and running.  But when I try to browse to the site, I get a 503 error ("service unavailable") - yikes.  So I remote desktop in.

Let me back up just a bit: my stack is C# code targeting ASP.NET (with MySQL) on top of IIS (Internet Information Services - the web server) on top of Windows.  And when I remote desktop in to the server, I can see that Windows is up and running fine, but IIS has stopped.  I reset IIS, the site comes back and runs great, but within about 2 minutes, it's dead again.  Lather, rinse, repeat.  Something is wrong here.  

So I reboot the server.  Same behavior. Oh, shit, this is bad.  And - I assume - how can this be me?  IIS is stopping.  If it's my fault, that's like a bug in Excel causing a blue-screen of death in Windows.  Very bad.  And the CPU is spiking (80-100%, when it should be around 10-30%)  One process taking a lot of CPU is Windows Error Reporting, so I do some Googling to find out how to read its error reports and it says...just that the IIS process was being killed.

Yikes.  Meanwhile, I'm still resetting IIS every couple of minutes, and lots of you were getting 503 errors.

Look through EventViewer - I'm seeing two errors, both of which should be totally benign.  One is a bug in the MySQL code that I've been seeing for weeks; it's minor - it fails the first time, succeeds the next time.  The other is an exception in FlightStats.  (I'll come back to that one).  I haven't changed FlightStats code in years, so I ignore that and downgrade the version of MySQL to one that didn't generate the bug, and deploy that change.  That particular bug goes away but...I'm still resetting IIS every 2-3 minutes.

At this point I'm hungry and I haven't had any coffee, and I'm stressed - what on earth is going on?  So I take a deeper look at the FlightStats error.  It's an exception that's being generated because a dictionary lookup that shouldn't fail is failing.  This is code I've had (essentially unchanged) for years - why is it crashing now, and why doesn't it crash on my development machine?

So here is where it finally starts to come together.  Some additional Googling (while I'm still resetting IIS every 2-3 minutes!) reveals that IIS has a (news to me) "Rapid Fail Protection" feature to protect the system wherein if it has 5 exceptions in 5 minutes, it kills the process.  A bit more searching and I find that I can configure this, so I set it to 40 exceptions within 5 minutes and...lo and behold the site is staying up!  

Phew, after 2 hours of manually recycling IIS every 2-3 minutes, I can get some food and coffee into my belly.

But...what happened?  Exceptions are actually a pretty normal thing.  If I "catch" (handle) the exception, usually that's when you see an error message and it's benign, normal, and expected.  Sometimes, an exception arises that I don't catch.  When that happens, I have code that captures the exception information and emails it to me so that I can figure out the issue.  Not uncommon to have a few of these a day (generally all also from perfectly benign scenarios).  But alas, FlightStats is being computed in a background thread, and when the exception happens there, it wasn't being handled, and thus I was bumping into more than 5 exceptions in 5 minutes, and IIS was self-destructing.

That's a lot of background, but it matters here.  What is FlightStats and what was the issue?  On the home page and in a few other places, I show fun stats like "in the last 7 days, MyFlightbook users recorded X flights in Y aircraft to airports in Z countries" and so forth.  But these are slow to compute, so what I do is when they are first requested, I quickly return empty stats but I kick off a background process (that can take a few minutes) to gather the relevant stats, which I then cache for 30 minutes.  It's been this way - flawlessly - for literally years.  But today, that computation on the background thread was crashing, and thus the next time the home page was hit, it would kick off the process again, which would crash, and then...5 crashes in 5 minutes = kill the IIS thread.  And since it's triggered on the home page, that's enough to bump up the crash rate above the threshold.  40 crashes is enough to give me some breathing room, but it's still restarting the process with every crash, which is going to screw some user experiences.

But this worked fine for years, and worked fine on my development machine.  So I point my development machine to the production database (something I almost never do - that's insanely dangerous!) and I can isolate the bug.  

The issue is that part of the stats I retrieve are airport stats, including longest flights and similar, and so I need to dedupe airports.  E.g., "OGG" and "PHOG" are both Maui (IATA and ICAO, respectively), and because the two airports in the databases may come from different sources, the latitude/longitudes might differ slightly (usually only a few meters apart).  To dedupe them, I put them into a dictionary (a lookup structure) where the key is a hash of the latitude/longitude.  That way, two airports that have almost identical locations will have the same key, and thus coalesce.  The key I'm using is a simple concatenation of the latitude and longitude - both rounded to 2 digits.  

And finally, this is where the bug is.  There's a user-defined airport that somebody logged in the last 30 days that has a latitude/longitude of 53.9522222222, -1.1749999999999545.  So the dictionary key for this is (remember, rounded to 2 places) "ALa53.95Lo-1.17".  ("A" for airport, "La" for latitude, "Lo" for longitude).  I put the airport into the dictionary with that key, and (in theory), I should be able to read it back with the same key.

But alas, the code was crashing because when I tried to read it back with the same key, it wasn't in the dictionary and threw an exception.  Notice all of those 9's" in the longitude?  Rounding that to 2 digits properly rounds to -1.17, but there were some circumstances (I still needed to debug) where it would round in the key as -1.18.  Alas, if you put something into the dictionary with a key of "ALa53.95Lo-1.17" and try to read it back with "ALa53.95Lo-1.18", it fails.

The very quick fix was to adjust the airport to use a longitude of -1.17498.  Wahoo!  No more IIS recycling; I can reset the Rapid Fail Protection to 5 errors in 5 minutes.

The next band-aid was to deploy a fix to production that checks for whether an item is in the dictionary before attempting to read it and doing something innocuous if it's not there.  So more rounding errors can't cause this again.  At this point, it's been almost 4 hours since the site started acting up, so that was good for me for the day and I went for my walk.

My next task - which I can do at a more leisurely pace (thankfully!) - is to figure out why -1.1749999999999545 is somehow becoming -1.18.  My best hypothesis - totally unconfirmed as yet - is that -1.1749999999999545 is being rounded somewhere to -1.175, and *that* (correctly) rounds to -1.18.  I am tracking that bug and will look into it later this week.

So in the end, it's a data issue!  (Well, my failure to handle it was obviously the bigger issue, but it was something that could be fixed with a simple data change!!)

UPDATE: 

I was correct that somewhere something was getting rounded from -1.174999… to -1.175

Here’s what’s happening: The FlightStats process is using each airport to create an “AirportStats” object.  When it does this, the AirportStats does a deep copy of the airport, using a utility that just blindly enumerates the airport’s read/write properties and writes them to the AirportStats object.  So far so good

But for old legacy reasons, the airport object has a Latitude (String) and Longitude (String) property (in addition to a decimal LatLong property), and these use 8 digits of precision; getting one of these values returns the underlying decimal value as a string formatted to 8 decimal places, and if you set it, it sets the underlying decimal value to the converted string.  I don’t use these properties anywhere, but the deep copy sees them and dutifully reads from the source and writes to the destination.  While the order of copying is undefined, it looks like it happens to be copying the string values after copying the underlying decimal values, and thus it overwrites.  And that is where -1.1749999999999545 becomes -1.175.

I am storing into the dictionary using the key generated on the source airport (-1.174999...), but later when I retrieve the value I am doing so based on the AirportStats object's key.  But alas, the AirportStats object – due to this copy functionality – now has “-1.175”.  And that – quite correctly – rounds to -1.18.  Later in the code, when I’m reading from the dictionary, I’m using the AirportStats’ key, which is now different.

Trivial fix: since Latitude (String) and Longitude (String) are legacy/deprecated, I’m simply commenting out the code to set them.   You can still read display strings of the underlying latitude/longitude (extracted from the latlong object), but writing it is a no-op.  I could *probably* remove these properties altogether, but there’s a slight possibility of breaking a mobile app, so I’m going to leave them there.