Agocs.org

Helluva software engineer

Getting Reliability Right

It’s an exciting time in the Shipping industry. Shipowners are increasingly turning to high frequency data to inform their decisions. Of course, they don’t have the infrastructure to support this influx of new data – why should they? Their core competency is managing a fleet, not a fleet of VMs.

This has opened up a niche market for Maritime SAAS providers. These companies, Nautilus Labs included, want to help shipowners manage their data, perform analytics, and generally make better decisions. When it works, it works!

I’ve noticed, however, that many SAAS platforms built for shipowners are being built by industry players, not tech companies. As such, they’re missing out on a wealth of hard-won best practices. They’re not terribly concerned about reliability, availability, or uptime. Nautilus wants to bring these practices into shipping, to provide a better, more reliable, seamless experience for everyone involved.

How Nautilus gets Reliability right

Sometimes a website doesn’t work right, and we call that downtime. Downtime can come from a number of places: software bugs, software updates, servers restarting, unexpected network hiccups, cosmic rays, and any number of other sources. It might seem obvious that we’d want to avoid downtime as much as possible, but that’s not always the case.

One of the biggest points in the SRE Book is the need to embrace risk. At Nautilus, we embrace occasional downtime. It’s a dirty secret, but our product is down for a few seconds every time we deploy new code. We knew we wanted to embrace the risk of that downtime by deploying code every week: deploying every week, sometimes several times a week, lets us fix problems rapidly and deliver value to our clients as quickly as possible. We’ve engineered our software deployment process to be quick and painless, and in the near future we’ll probably spend more time minimizing the amount of downtime experienced every week. Until then, however, we embrace the fact that a few seconds every week is better than a few weeks of downtime every year (and yes, a shipowner we work with experienced that from a different software provider).

You have to measure downtime

In order to know that we experience a few seconds of downtime a week, I had to measure that downtime. Before I could even do that, I had to define downtime. What is downtime? Is it when a webserver does not respond to a ping? When a /health/ endpoint doesn’t say OK?

This is what I settled on: Downtime is when a user cannot log in to our system and view some data from the ships they care about.

I use Pingdom transaction monitoring to tell me this. I like it so much that I have it perform this check in our prod environment, our staging environment, and again in prod but from a European datacenter. This is one of the few alerts I treat as an emergency – if this check is failing, then a user can’t use our website.

What happened here?

Defining what’s acceptable

There’s two classes of problems: problems and Problems. Little-p problems are transient issues that well-designed systems handle and recover from. Big-P Problems are the ones that catch you by surprise when you wake up to half a dozen email alerts and then you spend the next few hours or days fixing systems, writing summaries, recovering data, and figuring out how you’ll prevent that from happening again. For any given system your rate of little-p problems should improve over time, but as long as a problem can fix itself quickly, it’s not worth losing sleep over.

My organizational goal is that little-p problems should affect users no more than 0.5% of the time. That is, a user should experience at least 99.5% uptime. My goal for the next quarter is to cut that downtime in half with faster, blue-green deployments, better metrics, and better test environments.

Every failure is an opportunity to learn

Now for a couple war stories:

A brunch outage

One Saturday morning, I woke up to a half a dozen email alerts about our product not working. I discovered that our product couldn’t communicate with the database because it had exhausted the database’s available connections. The short-term fix was to restart the web servers, and the long-term fix was a quick code change to make a new feature close connections when it was done.

We used this as an opportunity to learn in a few ways: first, I set up a new alarm around database connection count that would fire off if the connection count grew to about half the maximum number. Second, we all learned a valuable lesson about how our database handles stale connections, and about ensuring we close connections when we don’t need to hang on to them. Third, this incident sparked a number of discussions about how we should organize our team to handle future incidents like this.

These learnings came into play a over the summer – a new engineer wrote a one-off tool that accidentally opened a bunch of database connections all at once. Rather than taking our product down during the middle of a workday, we noticed the problem before it became an emergency, looked at the engineer’s code, and helped him make his tooling better.

An old fiend

Back in the early early early days of Nautilus before we had any paying customers, we hit a problem one afternoon where the webserver would run out of memory and restart itself. We had two webservers running side by side, so it was hard to notice this problem unless they both went down simultaneously. This eventually happened, we diagnosed the memory leak, fixed it, and, most importantly, set up monitoring that watched for web server restarts that would have gone unnoticed otherwise.

Just a few weeks ago on my bus ride into work, I got a series of alerts that our webservers were restarting. They weren’t restarting all at once (so there was no actual downtime), but I was concerned. When I got in, I discovered that an internal user was pulling down a year of data at a time. Even this wasn’t enough to restart the server by itself, but some set of conditions around all this data extraction was enough to cause the server to run out of memory maybe one out of every ten times.

Myself and another engineer ran profiling against our webservers, and identified a trouble area in the code. He then went off to fix the problem, and managed to reduce memory utilization by over 50%. I thanked the user who nearly caused this problem to become a Problem; if it wasn’t for him, we would have had this bug lurking, ready to bite us.

Summary

At Nautilus Labs, we work hard to learn from the problems we experience and improve ourselves every day. We have an operations culture that believes experiencing a small amount of regularly scheduled downtime is much better than large amounts of unexpected downtime. When we experience big-P Problems, we turn them into little-p problems and make sure we’re better off for having experienced them.

My hope is that we can take these practices from the tech industry and help apply them to maritime.


Share