Anatomy of a disaster (plus new features!)

Today we experienced the biggest failure of SymbolSource ever, but hopefully from now on, no other issue will even qualify to compete with it.

Keeping with the spirit of openness in the NuGet ecosystem, as exhibited by the NuGet Team in their blog post about their meltdown (The NuGet Gallery Outage on March 9th) we'd like to write a few words about what happened today.

The positive

But first lets highlight a few positive things we did recently.

First of all, you might have noticed that we moved our blog comments to Disqus. This was also inspired by the NuGet Team and their blog, and allowed us to entirely get rid of a database for our website. Posts are simply served from disk, whereas all other data (users, packages, etc.) is handled by our API, for which the website is merely a frontend. The API is public for anyone to use, by the way, although not documented at the moment. The folks from MyGet have been using it for some time now to enable seamless integration of our services. Let me thank them here for putting in the effort! So far MyGet is the only package repository host that provides symbol and source support through our proven service (well, perhaps not so proven today, but please bare with us).

The second, larger change, after eliminating 50% of our databases, was getting rid of the package processing daemon, further simplifying our architecture. All packages are now processed synchronously during upload, which will give you an instant feedback from the pushing client, whether the package failed or not. You can also have a look at the updated status page which now offers detailed error reports for logged in package owners. This should greatly eliminate the need to run with Fiddler to see submission details, which we have been suggesting as a diagnostic for submission failures. Synchronous processing has a downside, however, which is the obviously longer push time. This will manifest itself in the occasional timeouts reported by nuget.exe. This does not mean that the package failed - check with the status page. Rest assured that we will work together with the NuGet Team on this matter, perhaps to introduce a client-server progress reporting mechanism of some sort. Together with this work, we'll also try to make errors as comprehensible as possible. Symbol packages are inspected and processed, not treated as-is like the regular ones, so submitting them will never be totally pain-free, but we can definitely do better than now. In any case, all the inconveniences of the submission process, are there only to ensure that if an upload succeeds, your users will be guaranteed to have an issue-free debugging experience.

The negative

The changes I described so far where necessary steps to make transition to Azure easier, and as this plan matured over that last few days, I will call out the wickedness of inanimate matter, and our server in particular, which sensed somehow the retirement closing in, and refused to work today.

In an explicable way, the routine process of rotating (i.e. deleting) of Elmah error files forced the entire server to come to a grinding halt today. And I am talking about a full scale doesn't-matter-that-it's-run-under-a-hypervisor meltdown. Each time we started the SymbolSource virtual machine everything came to a halt, even other virtual machines on that hardware node. Somehow the Linux client-host pair managed to create what we believe was a massive IO storm, that slowed everything down to pocket calculator speed.

Since we were in the middle of the Azure migration anyway, we decided to speed it up a bit, and a really intensive workday, it is largely finished now. Packages are flowing in again, and we'll do our best to keep it that way. Azure is a great platform that makes that almost painless.

This also ends our more than two years old experiment in hosting a major service on Linux and Mono. When we started, Azure wasn't there yet, at least not for the general public. Thinking about hosting our own farm of low-end servers for SymbolSource, we decided to go with the cheapest solution available at the time: Linux with Mono. In hindsight this was probably a decision that caused more pain than gave us benefits, but that's a topic for a separate post perhaps.

The where-we-go-from-now part

We'll focus on stabilizing our migration over the next few days, and then on enhancing nuget.exe as I described earlier.

We are committed to maintaining SymbolSource as THE symbol and source host for NuGet packages, and we are really proud that we started the trend of publishing symbols and sources in a debug friendly way, which is really catching on lately. Many of the major .NET OSS projects are pushing to SymbolSource on their own now, like Castle and NHibernate, and we are committed to helping out anyone that would like to start doing that too.

We also have some nice new features planned for SymbolSource, that we will announce in a future post.

So for now let me emphasize how sorry we are for failing to deliver a quality service today, and promise that better times lay ahead.

Best regards,

The SymbolSource Team

Posted by Marcin Mikołajczak (TripleEmcoder) on Thursday, April 26, 2012

blog comments powered by Disqus