A long while back, I experimented with making a Go web server behave differently based on its heap utilization. I used runtime.ReadMemStats, but I didn’t run ReadMemStats on every request because it stops the world.

The M/V Nordic Odin, in port in Gdansk, Poland

More recently, I started getting curious about logging memory utilization, against a production system, where we found that it was important to see the memory in use for each and every web request. I thought back to my old friend ReadMemStats, but I was frightened of stopping the world.

I also found Go’s pprof.Lookup("heap"), which, I think, tells you some interesting information about the state of the heap the last time the garbage collector ran. I thought it would be interesting to compare the two.

Benchmark in Isolation

The first thing I did was write a naïve benchmark of the two methods, available here: https://github.com/agocs/goMemProfileTester/blob/master/mem_test.go

It should be noted that my Go version is go1.11.5 darwin/amd64

The results surprised me.

go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/agocs/goMemProfileTester
BenchmarkMemstats-8    	  100000	     13966 ns/op
BenchmarkPProfHeap-8   	   10000	   1453765 ns/op
PASS
ok  	github.com/agocs/goMemProfileTester	16.125s

runtime.ReadMemStats stabilized at about 14μs per operation, while pprof.Lookup("heap") stabilized at 1.4ms, just about 100x slower!

However, I realize that a benchmark doesn’t reflect the complexities of multiple goroutines stopping the world on top of one another. I’d need to write a webserver to simulate this properly.

#Heavily Loadded Webserver

So I wrote a webserver that Does Stuff: https://github.com/agocs/goMemProfileTester/blob/master/main.go

It fills a slice with some number of randomly generated float64s, and then sorts them. I figured that was a good combination of memory intensive and CPU intensive work.

I used the Vegeta tool for load testing. The exact command was:

echo "GET http://localhost:8080/readMem$N/" | vegeta attack -duration 10s | vegeta report

where $N is one of 0, 1, or 2, corresponding to Control, ReadMemStats, or pprof.Lookup.

Vegeta launches 50 requests per second against the endpoint for 10 seconds, and then generates a report on the results.

Preliminary results were promising!

First pass results, array length 1024

##Control

Requests      [total, rate]            500, 50.10
Duration      [total, attack, wait]    9.981021439s, 9.980178389s, 843.05µs
Latencies     [mean, 50, 95, 99, max]  1.076732ms, 991.45µs, 1.565569ms, 1.955077ms, 7.243589ms
Bytes In      [total, mean]            9866887, 19733.77
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:500
Error Set:

ReadMemStats

Requests      [total, rate]            500, 50.09
Duration      [total, attack, wait]    9.982931537s, 9.982067082s, 864.455µs
Latencies     [mean, 50, 95, 99, max]  1.077519ms, 966.665µs, 1.586833ms, 1.97482ms, 6.859381ms
Bytes In      [total, mean]            9866533, 19733.07
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:500
Error Set:

pprof.Lookup

Requests      [total, rate]            500, 50.09
Duration      [total, attack, wait]    9.983083109s, 9.981228359s, 1.85475ms
Latencies     [mean, 50, 95, 99, max]  1.637835ms, 1.557003ms, 2.705042ms, 3.478394ms, 7.55065ms
Bytes In      [total, mean]            9866530, 19733.06
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:500
Error Set:

The relevant (to me) information is in the Latencies, both mean and 95th percentile. It appears that the difference between Control and ReadMemStats is roughly negligible, and pprof.Lookup is ever so slightly slower.

Affect of work size

However, in playing around with the size of the array, I noticed that these results vary with the arraysize in interesting ways. I ran the same test with the following array sizes:

  • 1024
  • 10240
  • 102400
  • 204800
  • 409600

Anything higher than 409600 and requests started failing. I noticed that at around 204800, requests were coming in just slightly faster than it took my webserver to respond. That means that ReadMemStats Stop The World behavior started to have an effect on multiple simultaneous requests.

My results are as follows:

Graph showing little difference between the methods

At low load, it appears that the difference between Control, ReadMemStats, and pprof.Lookup is roughly negligible. ReadMemStats is slightly better than pprof.Lookup. However, as load increases and the server starts handling concurrent requests, pprof.Lookup starts to surpass ReadMemStats. This makes sense in retrospect; ReadMemStats’s STW behavior affects multiple running goroutines, and each request might have to pause for several successive STWs.

A Million Little Requests

I was also curious to see how ReadMemStats vs pprof.Lookup would behave in a scenario where the webserver was trying to field a bunch of very small requests. To do this, I set arraysize to 2 and started ramping up the Vegeta rate.

At 500 requests per second, I noticed that Control and ReadMemStats were both running at about 20% of a CPU, while pprof.Lookup was running at about 300% cpu. At 5000 requests per second, pprof.Lookup requests started timing out with CPU utilization above 600%. Control and ReadMemStats were both chugging along at 50% cpu and 80% cpu respectiely.

At 20000 requests per second, both Control and ReadMemStats started dropping requests. Control dropped 3 requests total, while ReadMemStats dropped 70% of requests.

Graph showing that ReadMemStats is better than pprof

At a very rapid rate of very small requests, the benchmark results quickly rear their head. pprof.Lookup shows itself to be a resource hog early on, and only gets worse. ReadMemStats performs admirably, but is definitely worse than no memory analytics.

I was surprised at how little ReadMemStats affected concurrent requests at 10000 requests per second.

I/O Wait

You might notice that my load was very CPU and memory intensive, but there were not a lot of I/O pauses. I’d be curious to test how ReadMemStats and pprof.Lookup compare when faced with I/O pauses as opposed to heavy computation.

Conclusion

The difference between the ReadMemStats and pprof.Lookup seems pretty negligible, and is not much worse than the control. I’m going to go with ReadMemStats for now, because I don’t have to parse the results.

I’d be curious to know if anyone has any thoughts on the matter!