Using command line tools to analyze a crash

At Nautilus Labs, we’re advancing the efficiency of maritime transportation by collecting data and recognizing patterns. An interesting side effect of this is that we have insight into a ships' actions, not just day-by-day, but second-by-second. It’s not as simple as dots on a map either; engine power, wind, rate-of-turn, it’s all important.

We worked with one of our clients to take advantage of this recently. A few days ago, one of their ships struck a jetty during a berthing maneuver.

No one was hurt, but they need to gather data for their insurers. They asked for a high-resolution dump of data from that day so they could begin investigating. That’s a straightforward request, but we keep raw data encrypted in AWS S3 buckets. Pulling it down is easy (aws s3 sync ...), but it has to be decrypted and parsed. Fortunately, we’ve written the software to do that. Here’s how we used common tools to visualize the data.

Please understand, this is not an accident investigation. I’m purposefully obscuring the crash location and date, and I’m intentionally avoiding making any statements about cause or fault. My intention here is to write about command line tools.

Decrypting and parsing the data is all a matter of internal tooling, so I’ll skip these steps. Suffice to say I’ve got a binary called client_parser that outputs these data points in a comma separated value format.

client_parser | ( head -n 1 && tail -n +2 | sort ) > ship-data.csv

The data coming out of client_parser is not necessarily ordered, so we want to fix that. The first line is always the CSV header, so we pull that off of the output stream with head -n 1. Lines 2 and greater are selected with tail -n +2, and passed through sort. Fortunately, we chose to use RFC 3339 timestamps as the first column in that file, so sort puts the lines passed to it in order from earliest to latest. Finally, we’re writing the headers and the sorted data to ship-data.csv.

This file now contains a few hundred thousand lines, each with over a hundred values. To make the data set easier to understand, we want to narrow it down some. The client told us the crash occurred on, say, the night of January 15th, so I’m going to limit the dataset to that.

$ grep -n 2018-01-15T12:00:00 ship-data.csv
318995:2018-01-15T12:00:00Z,...
$
$ grep -n 2018-01-16T12:00:00 ship-data.csv
321875:2018-01-16T12:00:00Z,...

Remember, we’re using RFC 3339 timestamps as the first column in our data. I know that 2018-01-15T12:00:00 is going to appear exactly once in that dataset, so I figure out which line it’s on using grep -n. I can do the same thing for January 16th, and determine that I only care about lines 318995 through 321875. Let’s make an easier to handle file that only has those.

sed -n '1p;318995,321875p' ship-data.csv > ship-jan-15.csv

This prints the first line, and lines 318995-321875. The output is redirected to a new file. This is much better. Already I can load up that file in a spreadsheet program and begin to poke around. An interesting visualization I found was the absolute value of the ship’s change in speed every 30 seconds.

That big spike happened at 18:35:00, so let’s see what was happening then.

$ grep -n T18:34:30 ship-data-day.csv
791:2018-01-15T18:35:00Z
$
$ sed -n '1p;786,796p' ship-jan-15.csv

You might be tempted to pipe the result of that sed to less, but the result is an impenetrable wall of text. We need a way to manipulate CSV data on the command line. Enter CSVKit.

Let’s pipe that through CSVCut instead, to show us just a couple columns. To start, let’s go for timestamp and speed through water.

$ sed -n '1p;786,796p' ship-jan-15.csv | csvcut -c "Timestamp,Speed TW"
Timestamp,Speed TW
2018-01-15T18:32:00Z,3.200000047683716
2018-01-15T18:32:31Z,3.299999952316284
2018-01-15T18:33:00Z,3.200000047683716
2018-01-15T18:33:30Z,3.0999999046325684
2018-01-15T18:34:00Z,3
2018-01-15T18:34:30Z,2.200000047683716
2018-01-15T18:35:00Z,1.2999999523162842
2018-01-15T18:35:30Z,0.20000000298023224
2018-01-15T18:36:00Z,-0.5
2018-01-15T18:36:30Z,-0.20000000298023224
2018-01-15T18:37:00Z,0.10000000149011612

Between 18:34:00 and 18:36:00, we see a sharp drop in speed. The biggest change is, exactly as the chart predicted, between 18:35:00 and 18:35:30.

I’m confident enough to try putting this on a map. Let’s pick about 100 points around 18:35:00 and generate a CSV.

sed -n '1p;736,836p' ship-jan-15.csv | csvcut -c "Timestamp,GPS Latitude,GPS Longitude,Speed TW" > ship-jan-15-map.csv

I learned that Google MyMaps can understand CSV data if you tell it which columns are latitude and longitude, so I was able to put these points on a map quickly.

This is pretty clearly the incident in question. I added a few more columns (heading, RPM, engine power, wind speed, wind direction, etc.) and shared the map with the client. He loved it, and asked if we could compare this incident with a berthing the same ship made a few days prior. I went through most of the same steps and generated this:

That’s obviously a very different profile. I’m not an accident investigator or a mariner, so I’m hesitant to say what I think was the reason for the difference between the two approaches.

Nevertheless, I enjoyed the exercise. Working with these data sets is always interesting. Nobody’s happy this ship is out of service, but I believe we can collect the data and effect the changes necessary to make incidents like these a thing of the past.