GC impact on throughput and latency

January 15, 2014 by Nikita Salnikov-Tarnovski

One type of the problems each and every Java application out there has to wrestle with is related to garbage collection. When the garbage collector works, it represents a wonderful invention. When it does not – or when the way GC is doing its housekeeping becomes unpredictable – then you have a friend who has turned into a foe.

This post is about garbage collection pause times. Or more precisely – why should you care about the pauses.

Some posts ago, I explained throughput and latency via mr Apple CEO Tim Cook planning for the iPad demand and building factories. I will stick to the same illustrative story:

  • We have a factory line producing one iPad per second. Each second, every second. So the throughput of the line is 86,400 iPads/day
  • It takes four hours to complete an iPad from the start where the casing is molded to the finish when the acceptance tests on the iPad have been concluded. So the latency of the line is four hours.

The system above and the calculations are based on the assumption that the factory line is operational 24 hours a day, each day, every day. But all factory lines tend to need maintenance which is equivalent to garbage collection running inside the JVM.

As an example – lets take small maintenance tasks, which can be handled without much interruptions. Examples could involve adding oil to the machinery or picking up excess trash from the floor next to the molding equipment. Those operations are similar to minor GC’s within the JVM – it is maintenance you have to deal with, but the implementation is so clever that the performance of the system is not affected.

But in the very same factory mr. Tim Cook is going to face long-lasting maintenance tasks as well. Those tasks involve stopping the whole production line and are equivalent to the Full GC runs, where the JVM needs to stop servicing the threads in order to do some important housekeeping tasks.

Now, lets assume that after months of uninterrupted service, our hypothetical factory line gets jammed and the tech team takes four hours to resolve the issue. During this period the line is stopped. How do we measure the effect? As always, the impact can be measured by two different means:

  • Impact on throughput. The four-hour stop means we have 14,400 seconds during which no iPads are completed. Throughput-wise it means we have reduced the system’s capacity in this particular day from 86,400 to 72,000. Which means approximately ~16.5% loss in throughput.
  • Impact on latency. Now, if we took an iPad which was still on the line when the interruption occurred, it took not four but eight hours to complete. This represents a 100% increase in worst-case latency.

If you recall then mr. Cook did not care about latency. What was important for him was the overall throughput during a longer period, so mr. Cook would decide to optimize his processes in a way that the impact on throughput would be minimized.

Similar decisions need to be made in software development as well. If you have a Java EE application responsible for order processing, then a GC pause spanning four seconds would definitely reduce the throughput of your system. But for most of us it will not be a major issue. On the other hand, the users who were trying to accomplish things during the four-second stop-the-world-i-have-cleaning-to-do pause would get a perception that our systems are sluggish. And operating a service which is perceived by users as sluggish is a darn good way to go out of business.

The morale of the story? Pick your goals wisely and make sure you do not confuse throughput with latency. Then make sure you understand how GC can affect either of those by monitoring your GC logs, looking for unexpected Full GCs and tuning your application and/or GC to minimize their impact.

If you managed to reach this section here, then I might have had an interesting story to tell. Check out our older posts and consider following Plumbr in Twitter.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more