A long while back, I experimented with making a Go web server behave differently
based on its heap utilization. I used
runtime.ReadMemStats, but I didn’t run
ReadMemStats on every request because it stops the world.
More recently, I started getting curious about logging memory utilization,
against a production system, where we found that it was important to see the
memory in use for each and every web request. I thought back to my old friend
ReadMemStats, but I was frightened of stopping the world.
I also found Go’s
pprof.Lookup("heap"), which, I think, tells you some
interesting information about the state of the heap the last time the garbage
collector ran. I thought it would be interesting to compare the two.
Benchmark in isolation
The first thing I did was write a naïve benchmark of the two methods, available here: https://github.com/agocs/goMemProfileTester/blob/master/mem_test.go
It should be noted that my Go version is go1.11.5 darwin/amd64
The results surprised me.
go test -bench=. goos: darwin goarch: amd64 pkg: github.com/agocs/goMemProfileTester BenchmarkMemstats-8 100000 13966 ns/op BenchmarkPProfHeap-8 10000 1453765 ns/op PASS ok github.com/agocs/goMemProfileTester 16.125s
runtime.ReadMemStats stabilized at about 14μs per operation, while
stabilized at 1.4ms, just about 100x slower!
However, I realize that a benchmark doesn’t reflect the complexities of multiple goroutines stopping the world on top of one another. I’d need to write a webserver to simulate this properly.
Heavily loaded web server
So I wrote a webserver that Does Stuff: https://github.com/agocs/goMemProfileTester/blob/master/main.go
It fills a slice with some number of randomly generated float64s, and then sorts them. I figured that was a good combination of memory intensive and CPU intensive work.
I used the Vegeta tool for load testing. The exact command was:
echo "GET http://localhost:8080/readMem$N/" | vegeta attack -duration 10s | vegeta report
$N is one of 0, 1, or 2, corresponding to Control, ReadMemStats, or pprof.Lookup.
Vegeta launches 50 requests per second against the endpoint for 10 seconds, and then generates a report on the results.
Preliminary results were promising!
First pass results, array length 1024
Requests [total, rate] 500, 50.10 Duration [total, attack, wait] 9.981021439s, 9.980178389s, 843.05µs Latencies [mean, 50, 95, 99, max] 1.076732ms, 991.45µs, 1.565569ms, 1.955077ms, 7.243589ms Bytes In [total, mean] 9866887, 19733.77 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 200:500 Error Set:
Requests [total, rate] 500, 50.09 Duration [total, attack, wait] 9.982931537s, 9.982067082s, 864.455µs Latencies [mean, 50, 95, 99, max] 1.077519ms, 966.665µs, 1.586833ms, 1.97482ms, 6.859381ms Bytes In [total, mean] 9866533, 19733.07 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 200:500 Error Set:
Requests [total, rate] 500, 50.09 Duration [total, attack, wait] 9.983083109s, 9.981228359s, 1.85475ms Latencies [mean, 50, 95, 99, max] 1.637835ms, 1.557003ms, 2.705042ms, 3.478394ms, 7.55065ms Bytes In [total, mean] 9866530, 19733.06 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 200:500 Error Set:
The relevant (to me) information is in the Latencies, both mean and 95th percentile. It appears that the difference between Control and ReadMemStats is roughly negligible, and pprof.Lookup is ever so slightly slower.
Affect of work size
However, in playing around with the size of the array, I noticed that these results vary with the arraysize in interesting ways. I ran the same test with the following array sizes:
Anything hire than 409600 and requests started failing. I noticed that at around 204800, requests were coming in just slightly faster than it took my webserver to respond. That means that ReadMemStats Stop The World behavior started to have an effect on multiple simultaneous requests.
My results are as follows:
At low load, it appears that the difference between Control, ReadMemStats, and pprof.Lookup is roughly negligible. ReadMemStats is slightly better than pprof.Lookup. However, as load increases and the server starts handling concurrent requests, pprof.Lookup starts to surpass ReadMemStats. This makes sense in retrospect; ReadMemStats’s STW behavior affects multiple running goroutines, and each request might have to pause for several successive STWs.
A Million Little Requests
I was also curious to see how ReadMemStats vs pprof.Lookup would behave in a scenario where the webserver was trying to field a bunch of very small requests. To do this, I set arraysize to 2 and started ramping up the Vegeta rate.
At 500 requests per second, I noticed that Control and ReadMemStats were both running at about 20% of a CPU, while pprof.Lookup was running at about 300% cpu. At 5000 requests per second, pprof.Lookup requests started timing out with CPU utilization above 600%. Control and ReadMemStats were both chugging along at 50% cpu and 80% cpu respectiely.
At 20000 requests per second, both Control and ReadMemStats started dropping requests. Control dropped 3 requests total, while ReadMemStats dropped 70% of requests.
At a very rapid rate of very small requests, the benchmark results quickly rear their head. pprof.Lookup shows itself to be a resource hog early on, and only gets worse. ReadMemStats performs admirably, but is definitely worse than no memory analytics.
I was surprised at how little ReadMemStats affected concurrent requests at 10000 requests per second.
You might notice that my load was very CPU and memory intensive, but there were not a lot of I/O pauses. I’d be curious to test how ReadMemStats and pprof.Lookup compare when faced with I/O pauses as opposed to heavy computation.
The difference between the ReadMemStats and pprof.Lookup seems pretty negligible, and is not much worse than the control. I’m going to go with ReadMemStats for now, because I don’t have to parse the results.
I’d be curious to know if anyone has any thoughts on the matter!