Go is often used for writing distributed systems, advanced data stores, and microservices. Performance is key in these domains.
In this tutorial, you’ll learn how to profile your programs to make them lightning fast (utilize the CPU better) or feather light (use less memory). I will cover CPU and memory profiling, using the pprof (the Go profiler), visualizing the profiles, and even flame graphs.
Profiling is measuring the performance of your program in various dimensions. Go comes with great support for profiling and can profile the following dimensions out of the box:
- a sampling of CPU time per function AND instruction
- a sampling of all heap allocations
- stack traces of all current goroutines
- stack traces that led to the creation of new OS threads
- stack traces that led to blocking on synchronization primitives
- stack traces of holders of contended mutexes
You can even create custom profiles if you want. Go profiling involves creating a profile file and then analyzing it using the pprof
go tool.
How to Create Profile Files
There are several ways to create a profile file.
Using “go test” to Generate Profile Files
The easiest way is to use go test
. It has several flags that allow you to create profile files. Here is how to generate both a CPU profile file and a memory profile file for the test in the current directory: go test -cpuprofile cpu.prof -memprofile mem.prof -bench .
Download Live Profile Data From a Long-Running Service
If you want to profile a long-running web service, you can use the built-in HTTP interface for providing profile data. Add somewhere the following import statement:
import _ "net/http/pprof"
Now, you can download live profile data from the /debug/pprof/
URL. More info is available in the net/http/pprof package documentation.
Profiling in Code
You can also add direct profiling into your code for complete control. First you need to import runtime/pprof
. CPU profiling is controlled by two calls:
pprof.StartCPUProfile()
pprof.StopCPUProfile()
Memory profiling is done by calling runtime.GC()
followed by pprof.WriteHeapProfile()
.
All the profiling functions accept a file handle that you are responsible for opening and closing appropriately.
The Sample Program
To see the profiler in action, I’ll use a program that solves Project Euler’s Problem 8. The problem is: given a 1,000-digit number, find the 13 adjacent digits within this number that have the largest product.
Here is a trivial solution that iterates over all the sequences of 13 digits, and for each such sequence it multiplies all 13 digits and returns the result. The largest result is stored and finally returned:
package trivial import ( "strings" ) func calcProduct(series string) int64 { digits := make([]int64, len(series)) for i, c := range series { digits[i] = int64(c) - 48 } product := int64(1) for i := 0; i < len(digits); i++ { product *= digits[i] } return product } func FindLargestProduct(text string) int64 { text = strings.Replace(text, "n", "", -1) largestProduct := int64(0) for i := 0; i < len(text); i++ { end := i + 13 if end > len(text) { end = len(text) } series := text[i:end] result := calcProduct(series) if result > largestProduct { largestProduct = result } } return largestProduct }
Later, after profiling, we will see some ways to improve the performance with another solution.
CPU Profiling
Let’s profile the CPU of our program. I’ll use the go test method using this test:
import ( "testing" ) const text = ` 73167176531330624919225119674426574742355349194934 96983520312774506326239578318016984801869478851843 85861560789112949495459501737958331952853208805511 12540698747158523863050715693290963295227443043557 66896648950445244523161731856403098711121722383113 62229893423380308135336276614282806444486645238749 30358907296290491560440772390713810515859307960866 70172427121883998797908792274921901699720888093776 65727333001053367881220235421809751254540594752243 52584907711670556013604839586446706324415722155397 53697817977846174064955149290862569321978468622482 83972241375657056057490261407972968652414535100474 82166370484403199890008895243450658541227588666881 16427171479924442928230863465674813919123162824586 17866458359124566529476545682848912883142607690042 24219022671055626321111109370544217506941658960408 07198403850962455444362981230987879927244284909188 84580156166097919133875499200524063689912560717606 05886116467109405077541002256983155200055935729725 71636269561882670428252483600823257530420752963450 ` func TestFindLargestProduct(t *testing.T) { for i := 0; i < 100000; i++ { res := FindLargestProduct(text) expected := int64(23514624000) if res != expected { t.Errorf("Wrong!") } } }
Note that I run the test 100,000 times because the go profiler is a sampling profiler that needs the code to actually spend some significant time (several milliseconds cumulative) on each line of code. Here is the command to prepare the profile:
go test -cpuprofile cpu.prof -bench . ok _/github.com/the-gigi/project-euler/8/go/trivial 13.243s
It took a little over 13 seconds (for 100,000 iterations). Now, to view the profile, use the pprof go tool to get into the interactive prompt. There are many commands and options. The most basic command is topN; with the -cum option it shows the top N functions that took the most cumulative time to execute (so a function that takes very little time to execute, but is called many times, can be at the top). This is usually what I start with.
> go tool pprof cpu.prof Type: cpu Time: Oct 23, 2017 at 8:05am (PDT) Duration: 13.22s, Total samples = 13.10s (99.06%) Entering interactive mode (type "help" for commands) (pprof) top5 -cum Showing nodes accounting for 1.23s, 9.39% of 13.10s total Dropped 76 nodes (cum <= 0.07s) Showing top 5 nodes out of 53 flat flat% sum% cum cum% 0.07s 0.53% 0.53% 10.64s 81.22% FindLargestProduct 0 0% 0.53% 10.64s 81.22% TestFindLargestProduct 0 0% 0.53% 10.64s 81.22% testing.tRunner 1.07s 8.17% 8.70% 10.54s 80.46% trivial.calcProduct 0.09s 0.69% 9.39% 9.47s 72.29% runtime.makeslice
Let's understand the output. Each row represents a function. I elided the path to each function due to space constraints, but it will show in the real output as the last column.
Flat means the time (or percentage) spent within the function, and Cum stands for cumulative—the time spent within the function and all the functions it calls. In this case, testing.tRunner
actually calls TestFindLargestProduct()
, which calls FindLargestProduct()
, but since virtually no time is spent there, the sampling profiler counts their flat time as 0.
Memory Profiling
Memory profiling is similar, except that you create a memory profile:
go test -memprofile mem.prof -bench . PASS ok _/github.com/the-gigi/project-euler/8/go/trivial
You can analyze your memory usage using the same tool.
Using pprof to Optimize Your Program's Speed
Let's see what we can do to solve the problem faster. Looking at the profile, we see that calcProduct()
takes 8.17% of the flat runtime, but makeSlice()
, which is called from calcProduct()
, is taking 72% (cumulative because it calls other functions). This gives a pretty good indication of what we need to optimize. What does the code do? For each sequence of 13 adjacent numbers, it allocates a slice:
func calcProduct(series string) int64 { digits := make([]int64, len(series)) ...
That's almost 1,000 times per run, and we run 100,000 times. Memory allocations are slow. In this case, there is really no need to allocate a new slice every time. Actually, there is no need to allocate any slice at all. We can just scan the input array.
The following code snippet shows how to compute the running product by simply dividing by the first digit of the previous sequence and multiplying by the cur
digit.
if cur == 1 { currProduct /= old continue } if old == 1 { currProduct *= cur } else { currProduct = currProduct / old * cur } if currProduct > largestProduct { largestProduct = currProduct }
Here is a short list of some of the algorithmic optimizations:
- Computing a running product. Suppose we computed the product at index N... N+13 and call it P(N). Now we need to compute the product at index N+1..N+13. P(N+1) equals to P(N) except that the first number at index N is gone and we need to take into account the new number at index N+14T. This can be done by dividing the previous product by its first number and multiplying by the new number.
- Not computing any sequence of 13 numbers that contain 0 (the product will always be zero).
- Avoiding division or multiplication by 1.
The complete program is here. There is some thorny logic to work around the zeros, but other than that it is pretty straightforward. The main thing is that we only allocate one array of 1000 bytes at the beginning, and we pass it by pointer (so no copy) to the findLargestProductInSeries()
function with a range of indexes.
package scan func findLargestProductInSeries(digits *[1000]byte, start, end int) int64 { if (end - start) < 13 { return -1 } largestProduct := int64((*digits)[start]) for i := 1; i < 13 ; i++ { d := int64((*digits)[start + i]) if d == 1 { continue } largestProduct *= d } currProduct := largestProduct for ii := start + 13; ii < end; ii++ { old := int64((*digits)[ii-13]) cur := int64((*digits)[ii]) if old == cur { continue } if cur == 1 { currProduct /= old continue } if old == 1 { currProduct *= cur } else { currProduct = currProduct / old * cur } if currProduct > largestProduct { largestProduct = currProduct } } return largestProduct } func FindLargestProduct(text string) int64 { var digits [1000]byte digIndex := 0 for _, c := range text { if c == 10 { continue } digits[digIndex] = byte(c) - 48 digIndex++ } start := -1 end := -1 findStart := true var largestProduct int64 for ii := 0; ii < len(digits) - 13; ii++ { if findStart { if digits[ii] == 0 { continue } else { start = ii findStart = false } } if digits[ii] == 0 { end = ii result := findLargestProductInSeries(&digits, start, end) if result > largestProduct { largestProduct = result } findStart = true } } return largestProduct }
The test is the same. Let's see how we did with the profile:
> go test -cpuprofile cpu.prof -bench . PASS ok _/github.com/the-gigi/project-euler/8/go/scan 0.816s
Right off the bat, we can see that the runtime dropped from more than 13 seconds to less than one second. That's pretty good. Time to peek inside. Let's use just top10
, which sorts by flat time.
(pprof) top10 Showing nodes accounting for 560ms, 100% of 560ms total flat flat% sum% cum cum% 290ms 51.79% 51.79% 290ms 51.79% findLargestProductInSeries 250ms 44.64% 96.43% 540ms 96.43% FindLargestProduct 20ms 3.57% 100% 20ms 3.57% runtime.usleep 0 0% 100% 540ms 96.43% TestFindLargestProduct 0 0% 100% 20ms 3.57% runtime.mstart 0 0% 100% 20ms 3.57% runtime.mstart1 0 0% 100% 20ms 3.57% runtime.sysmon 0 0% 100% 540ms 96.43% testing.tRunner
This is great. Pretty much the entire runtime is spent inside our code. No memory allocations at all. We can dive deeper and look at the statement level with the list command:
(pprof) list FindLargestProduct Total: 560ms ROUTINE ======================== scan.FindLargestProduct 250ms 540ms (flat, cum) 96.43% of Total . . 44: . . 45: . . 46:func FindLargestProduct(t string) int64 { . . 47: var digits [1000]byte . . 48: digIndex := 0 70ms 70ms 49: for _, c := range text { . . 50: if c == 10 { . . 51: continue . . 52: } . . 53: digits[digIndex] = byte(c) - 48 10ms 10ms 54: digIndex++ . . 55: } . . 56: . . 57: start := -1 . . 58: end := -1 . . 59: findStart := true . . 60: var largestProduct int64 . . 61: for ii := 0; ii < len(digits)-13; ii++ { 10ms 10ms 62: if findStart { . . 63: if digits[ii] == 0 { . . 64: continue . . 65: } else { . . 66: start = ii . . 67: findStart = false . . 68: } . . 69: } . . 70: 70ms 70ms 71: if digits[ii] == 0 { . . 72: end = ii 20ms 310ms 73: result := f(&digits,start,end) 70ms 70ms 74: if result > largestProduct { . . 75: largestProduct = result . . 76: } . . 77: findStart = true . . 78: } . . 79: }
This is pretty amazing. You get a statement by statement timing of all the important points. Note that the call on line 73 to function f()
is actually a call to findLargestProductInSeries()
, which I renamed in the profile due to space limitations. This call takes 20 ms. Maybe, by embedding the function code in place, we can save the function call (including allocating stack and copying arguments) and save those 20 ms. There may be other worthwhile optimizations that this view can help pinpoint.
Visualization
Looking at these text profiles can be difficult for large programs. Go gives you a lot of visualization options. You'll need to install Graphviz for the next section.
The pprof tool can generate output in many formats. One of the easiest ways (svg output) is simply to type 'web' from the pprof interactive prompt, and your browser will display a nice graph with the hot path marked in pink.
Flame Graphs
The built-in graphs are nice and helpful, but with large programs, even these graphs can be difficult to explore. One of the most popular tools for visualizing performance results is the flame graph. The pprof tool doesn't support it out of the box yet, but you can play with flame graphs already using Uber's go-torch tool. There is ongoing work to add built-in support for flame graphs to pprof.
Conclusion
Go is a systems programming language that is used to build high-performance distributed systems and data stores. Go comes with excellent support that keeps getting better for profiling your programs, analyzing their performance, and visualizing the results.
There is a lot of emphasis by the Go team and the community on improving the tooling around performance. The full source code with three different algorithms can be found on GitHub.
Powered by WPeMatico