Outage Postmortem: February 20

johncs · February 20, 2020, 10:18pm

Shmeppy went down today for a long while: around 12 hours (most of that time was late at night). Unfortunately I didn’t know it was down for almost all of that time, and once I found out I was able to fix it within a few minutes.

So let’s talk about two things: monitoring (which would let me know when Shmeppy goes down) and why Shmeppy crashed.

Monitoring

Unfortunately I have almost no systems for automatically detecting when Shmeppy crashes and becomes unavailable. If I did, those systems could have sent loud messages to my phone and I could have resolved the issue immediately.

Despite this outage, I’m not going to run and add such systems immediately. I want to focus on the things I’ve prioritized in the February 2020 Roadmap. However, some basic monitoring would be easy enough to build, so I’ll try and work that into the roadmap for next month.

Until then, please ping me by name on Discord if you see that Shmeppy has gone down. That will send an alert to my phone. You can also email support@shmeppy.com, which also goes directly to my phone.

Thank you to Wobble Waffle in Discord who posted that Shmeppy was down. It would have taken me even longer to notice if they hadn’t done that.

What made Shmeppy crash?

This was a simple enough bug, and I should be able to fix it easily enough.

The issue was that an automated process on Shmeppy’s server shutdown the database and brought it back up again. That hasn’t happened before, and I didn’t know it could happen, so I hadn’t programmed Shmeppy to understand how to recover from this situation and it crashed.

I’ll add handling for this kind of error now, so the next time the automated process messes with the database it won’t take down Shmeppy.