Slow death – the cause and the remedy

October 9, 2012 by Vladimir Šor

Another day in the office. Except that you have just realized that the application at your guard seems to take forever to respond to requests. Operations which usually take just few hundred milliseconds to complete are now slower than snails on Xanax. This weird situation gradually worsens throughout the next hours.

You manage to discover that the CPU usage is unusually high. Before you have time to dig further you have to surrender to the boss breathing down your neck. You restart the application on his demand. Only to discover the Xanax guys to return within half an hour. This time you decide to stand your ground and find out the cause.

Monitoring tools show high CPU usage and quick growth in memory consumption. In conjunction with quickly decreasing throughput. You plot the data on a graph and see the following trend taking shape:

Slow Death

It’s quite apparent that your application has not achieved much after the first 20 minutes of it’s lifetime. For the last 2 hours and 10 minutes it has only served ~2,000 requests as opposed to the 40,000 it managed to crunch during the first 20 minutes. 130x loss in throughput. Something must have gone horribly wrong.

You also see that at the very same time when the throughput falls the application heap usage has reached close to the maximum heap size allowed by the -Xmx512m option given at the startup script. And from the same moment the variance in heap usage starts to diminish.

Third symptom is not visible from the graph, but checking from your CPU monitoring scripts you also see that all cores become completely utilized right about the same time when the throughput falls and memory usage first reaches close to the maximum available memory. The process utilizing the cores is your Java application. But for some reason it seems to spend the time in system or kernel space.

Those symptoms altogether give us a pretty clear picture about what happened with our Xanax patient:

  • First, your application is started and the users start to drive traffic. For the first 20 minutes application throughput is just fine.
  • Then something happens – memory available for your Java process runs out.
  • This triggers something that is not completely clear – somehow it the CPU usage is also affected and all your available cores become completely utilised.
  • After this, memory usage still seems to grow, but the superlinear growth in used memory has been replaced by sublinear.
  • And the throughput is 130x worse. Great.

Now, if you just could figure out what your expensive multi-core processors are doing when the available memory is close to exhaustion you might just have found cure for the disease slowly killing your application.

Here now comes the first truly useful hint in this blog post: if you face a situation with aforementioned symptoms, most likely your application is busy in the garbage collection cycle. To verify this assumption, you need to switch on garbage collection logs. You add

-XX:-PrintGCDetails -XX:-PrintGCTimeStamps -Xloggc:gc.log

to the startup scripts. After enabling it, you can see output similar to the following in the gc.log file:


194.145: [GC 83769K(96192K), 0.0086283 secs]
194.217: [GC 86494K(96192K), 0.0071360 secs]
194.790: [Full GC 96152K->79469K(96192K), 0.1673717 secs]

Now you can see the log file growing as the users continue bombarding the application. You let them continue until you see frequency of the Full GC increasing. Now its time to dig into the log files to verify whether we have been too busy in collecting garbage. To do this we use a tool called GC Viewer to visualize and summarize the data in the logs. Following screenshot illustrates the results from the log (the test was ran on a different environment than the first, thus the difference in heap size and timings):

GC Viewer

There is definitely something interesting going on. The red area represents the available heap size. Blue chart stands for used memory. Green and black areas stand for time spent on minor GCs and full GCs respectively. From this we can conclude that the application exhausted most of the available memory approximately in three and a half minutes after the launch and then started spending time in GC cycles.

Another interesting data is hidden in the pause times. Accumulated pauses are 77.24 seconds in total, out of which full GC pauses took 97.6%. Considering the GC kicked in from the third minute your application spent more than 50% time trying to get rid of garbage after that. It should raise a red flag – if your application is spending so much time in GC it is definitely a hint that either you have not allocated enough memory to your application, or you have a memory leak in your application.

In fact, the slow death resulting in poor throughput of the applications has been large enough problem that JVM vendors have even built in checks for failing faster – for example  parallel / concurrent collector will throw an OutOfMemoryError with explanation “GC Overhead limit exceeded” if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.

To summarize the story – if you see your application throughput suddenly declining and discover that the CPU usage peaks and memory is exhausted, then the most likely you are spending too much time in garbage collection. To verify this we have given you the hints to follow in your troubleshooting quests.

After this you can start finding the root cause of why all your precious multicores are spending their time in collecting garbage. Whether its just that you truly need to add more heap or you have a memory leak to solve. With the tool of your choice – either you are preferring an APM solution, memory profiler or you are one of the happy users of Plumbr.

Enjoyed the post? We have a lot more under our belt. Subscribe to either our RSS feed or Twitter stream and enjoy.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more