Increasing heap size – beware of the Cobra Effect

October 17, 2012 by Nikita Salnikov-Tarnovski

The term ‘Cobra effect’ stems from an anecdote set at the time of British rule of colonial India. The British government was concerned about the number of venomous cobra snakes. The Government therefore offered a reward for every dead snake. Initially this was a successful strategy as large numbers of snakes were killed for the reward. Eventually however Indians began to breed cobras for the income.

When this was realized the reward was canceled, but the cobra breeders set the snakes free and the wild cobras consequently multiplied. The apparent solution for the problem made the situation even worse.

So how is Java heap size related with Colonial India and poisonous snakes? Bear with me and I’ll guide you through the analogy using a story from a real life as a reference.

You have created an amazing application. So amazing that it becomes truly popular and the sheer amount of traffic to your new service starts to push your application to its knees. Digging through the performance metrics you decide that the amount of heap available for your application will soon become a bottleneck.

So you take the time to launch new infrastructure with six times the original heap. You test your application to verify that it works. You then launch it on the new infrastructure. And immediately complaints start flowing in – your application has become less responsive than with your original tiny 2GB heap. Some of your users face delays in length of minutes when waiting for your application to respond. What has just happened?

There can be numerous reasons of course. But let’s focus on the most likely suspect – heap size change. This has several possible side effects like extended caching warmup times, problems with fragmentation, etc. But from the symptoms experienced you are most likely facing latency problems in your application during full GC runs.

What this means is – as Java is a garbage collected language – your heap used is regularly being garbage collected by JVM internal processes. And as one might expect – if you have a larger room to clean then it tends to take more time for the janitor to clean the room. The very same applies to cleaning unused objects from memory.

When running applications on small heaps (below 4GB) you often do not need to think about GC internals. But when increasing heap sizes to tens of gigabytes you should definitely be aware of the potential stop-the-world pauses induced by the full GC. The very same pauses did also exist with small heap sizes, but their length was significantly shorter – your pauses that now last for more than a minute might have originally spanned only a few hundred milliseconds.

So what can you do in cases when you really need more heap for your application?

  • The first option would be to consider scaling horizontally instead of vertically. What this means for our current case is – if your application is either stateless or easily partitionable then just add more small nodes and balance the load between them. In this case you could stick with 32bit-architectures which also imposes smaller memory footprint.
  • If horizontal scaling is not possible then you should focus on your GC configuration. If latency is what you are after, then you should forget about the throughput oriented stop-the-world GCs and start looking for alternatives. Which you will soon find to be limited to Concurrent Mark and Sweep (CMS) or Garbage-First (G1) collectors. The saddest news being that your best choice between those two collector types and other heap configuration parameters can only be found by experimenting. So do not make choices just by reading something, go out there and try it out with your actual production load.

But be aware of their limitations as well – both of those collectors pose throughput overhead on your application – especially G1 tends to show worse throughput numbers than the stop-the-world alternatives. And when the CMS garbage collector is not fast enough to finish operation before the tenured generation is full, it falls back to the standard stop-the-world GC. So you can still face 30 or more second pauses for heaps of size 16 GB and beyond.

  • If you cannot scale horizontally or fail to achieve the required latency results on garbage collectors shipping with Oracle’s JVM, then you might also look into Zing JVM built by Azul Systems. One of the features making Zing to stand out is the pauseless garbage collector (C4), which might be exactly what you are looking for. Full disclosure though – we haven’t yet tried C4 in practice. But it does sound cool.
  • Your last option is something for the true hardcore guys out there. You can allocate memory outside the heap. Those allocations obviously aren’t visible to the garbage collector and thus will not be collected. It might sound scary, but already from Java 1.4 we have access to the java.nio.ByteBuffer class which provides us a method allocateDirect() for off-heap memory allocations. This allows us to create very large data structures without bumping into multi-second GC pauses. This solution is not too uncommon – many BigMemory implementations are using ByteBuffers under the hood. Terracotta BigMemory and Apache DirectMemory for example.

To conclude – even when making changes backed with good intentions, be aware of both the alternatives and the consequences. Just as the Government of India back in the days publishing rewards for dead cobras.

[Shameless ad] – Wish to go to Devoxx this year? Do not have a ticket? Get one from us – check the conditions from this post.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

COMMENTS

typo: “push your application to itu2019s knees” — should be possessive “its” (no apostrophe)

SI Hayakawa

Thanks for notifying, fixed!

Ivo Mägi

I think your Apache Directmemory link also takes me to BigMemory.

gregorsam

Your link for Apache DirectMemory links to Terracotta instead

Foo

Thanks for pointing it out, fixed!

Ivo Mägi

Hi,nvery nice article! I don’t understand how off-heap allocations would be beneficial on 32-bit architecture. There is still a per-process limitation on OS level. Am I right?nnI would say that off-heap allocations have more usage when running on 64 bits architecture – you can access & use insane amount of memory without paying a performance penalty due GC overhead.

Jaromir Hamala

Thank’s for the comment. You are right, I will correct the text.

iNikem

I can vouch for every word in this article. We ran an application that went through the exact growth outlined and had the exact same experiences. I would add that we had a Sun consultant actually state to us “Don’t run your application above 12GB” which kind of shocked us, but it was true. Scaling horizontally was far better for performance then increasing heap. We also looked into Azul and Terracotta as alternatives, but never had to go down those paths.nnCaveat emptor! Java GC tuning is a complex dark art. After 6 months of production monitoring GC activity, we found our optimal GC settings using about 15 parameters, and even then, there were pauses that “shouldn’t have happened” (i.e. logging did not reveal the source/reason of the pause). G1 was supposed to be the golden bullet, but it seems to solve some of the CMS problems at the cost of performance.

Alan Smithee

Always glad to hear if we have managed to publish something that can be verified with the real world experience!

Ivo Mägi

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more