I don’t work for Backstop any more (but you should), but back when I did, we discovered we had no idea how long it took our users to load our tools. Hell, we couldn’t even figure out how many clicks per month we had! I was working with New Relic trying to get a quote from them, and the best I could do was 2 million plus or minus 800 thousand. We had insight into how long it took our servers to respond to requests, but no thought was given to DNS, loading outside resources, rendering DOM, etc.
Colin found a tool called Yahoo Boomerang which seemed to do what we needed. It would collect metrics on how long it took the page, and various resources on the page, to load, and it would report those data back to a server over here. Simple, right?
I took two issues with Boomerang. First, it again broke our testing. Second, it used some ridiculous scheme of a GET request with URL parameters inside of an invisible iframe to get data back to the server without any cross-site request issues. This seemed entirely too complicated, and simplicity was one of my overarching goals. I decided to roll my own.
window.performance.Timing, and thus Uluru was born.
The name, incidentally, came from back when I was experimenting with Boomerang. I needed a server to throw Boomerang data at, so I picked the famous Australian sandstone formation Uluru.
- Light weight: When minified, Uluru.js squishes down under 500 bytes.
- Speed: Uluru.js has no dependencies on other JS libraries.
- No hacks: Uluru.js makes a single POST request to a remote server. It does not e.g. cram metrics into query parameters and make a series of requests GETting hidden images or iframes.
Uluru is a function that, when called, gets the time since navigation started and some other metrics, and sends those to an endpoint. We’ve set it up such that
uluru(endpoint) is called when
window.onload fires, under the assumption that most of our product is useful by that point.
Uluru collects the following data:
- url: the
- connectionTime: the time (ms since UNIX epoch) the connection was initiated.
- connectionDelta: the time spent establishing a connection to the server
- firstByte: the time spent waiting for the server to respond with the first byte of data
- responseDelta: the time the server spent sending a response to the client
- loadTime: time between the
It also calls
window.performance.getEntries() to provide specific timing on every resource (script, image, stylesheet, etc.) included in the page load.
Uluru rolls all the data it collects up into one PUT request. The easiest way to collect Uluru data is to include a data collection REST endpoint in your existent web application (or, if you’re feeling clever, make your web proxy route Uluru requests to a specific Uluru logging server.
Alternatively, you could set the
Access-Control-Allow-Origin header to allow the browser to make POST requests to a separate server you’ve spun up for this purpose.
Either way, what we’ve wound up doing is writing the Uluru data out to Splunk, and aggregating it there in interesting ways.
Here’s some of the data we were able to collect!
Page loads and Appdexes⌗
One neat thing we didn’t have before is an easy way to count how many page loads our system saw, how many unique users we had actively using our system, or how many clients were logged in on a given day. When you start making changes to system performance, you start thinking about things differently when you consider the 1ms wait you just removed will be multiplied by 73,000 over the course of a day.
Appdex is an interesting measure of system speed. You select a goal speed (2 seconds on the left, 4 seconds on the right), and you count page loads. Every page load that took under the goal speed counts for 1. Every load that took greater than the goal speed but under twice the goal speed counts for .5, and the rest count for 0. You divide the counted page loads by the total page loads, and you get a number between 0 and 1 that gives you a decent metric for how happy or sad your users are.
One really interesting thing we saw from this, we found by looking at individual clients. One of our clients had a significantly lower Appdex than the rest. We did some digging, and found that a condition in their data made every JSP load a huge number of records from the database to generate a list they never used. We disabled that for them, and they got much happier without ever realizing they were sad!
This is a bit more boring, but it lets us know how many page loads are fast, acceptable, slow, or super slow. Comparing, e.g., last hour with last 7 days lets us see any system wide problems before our clients call in.
Page load histograms⌗
These let us know at a glance how our page loads are distributed. We could, for example, notice a bi-modal distribution. It wouldn’t affect our average or median scores, but it would be indicative of something fishy.
I once noticed a particular user sending back negative page load speeds. I investigated it for a little bit and decided it was just one of those weird things that happen when you rely on multiple sources of time.
This report gives us some neat insight into what’s been taking a long time to load. You look at each one, make sure it makes sense, and then use this graph to help prioritize your projects.
Likewise, just counting the total amount of time sunk into a resource is helpful. You might (much like we do) have a page that clients hit all the time. It might load in under two seconds, but if you can shave 10 % off of that, you’re saving 20 hours per week aggregated across all of your users.
You can get a lot of really interesting data from visualizations like this. Use these to help justify speed projects.
Currently, the Uluru project includes some stub code for a generic REST endpoint. I want to flush this out into something that writes logs for splunkd or logstash, or forwards data to services like Sensu or Riemann.