To blog |

Tuning GC – it does not have to be that hard

April 10, 2014 by Ivo Mägi Filed under: Garbage Collection

We really do not like complexity. Memory leaks, threadlocks and GC tuning have historically been a pain to deal with. Performance issues caused by those three evil guys have been the toughest to reproduce, which in turn makes patching such issues a nightmare. If you do not believe us, check out a great post about LinkedIn recent experience with GC tuning.

Even though it is an amazingly good insight into a performance tuning job, the post also serves as an excellent example about the complexity of the domain. LinkedIn engineers used the following set of options when tweaking the GC to improve throughput and latency:

-server -Xms40g -Xmx40g -XX:MaxDirectMemorySize=4096m -XX:PermSize=256m -XX:MaxPermSize=256m 
-XX:NewSize=6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 
-XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 
-XX:+UseConcMarkSweepGC -XX:CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled 
-XX:+CMSClassUnloadingEnabled  -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow

Raise your hand if you know what those parameters do. Or even half of them. I mean, really – to reach the configuration finally satisfying the throughput and latency criteria for LinkedIn, the process must have been something similar to:

  • Gathering insight of the current situation. Before tuning anything, you need to understand what is the underlying problem. In the particular case, it seems to have been boiled down to long and frequent GC pauses, but just as easily it could have been locks or leaks. Or anything else for that matter. But let’s assume you that by sheer luck you picked the right area to optimize.
  • Next, you need to know how to gather data about GC pauses. Say hello to the first set of configuration parameters ( -XX:+PrintGCDetails -XX:+PrintGCTimeStamps) that give you data about GC pauses.
  • Now you  need to be able to interpret this data. I mean, without prior experience, going through hundreds of pages of the following data might not be too encouraging:
0.167: [Full GC [PSYoungGen: 3071K->0K(3584K)] [ParOldGen: 8191K->227K(7168K)] 11263K->227K(10752K) [PSPermGen: 2544K->2544K(21504K)], 0.0064320 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
0.173: [GC [PSYoungGen: 0K->0K(3584K)] 227K->227K(11776K), 0.0004670 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

But again, let’s assume that you had the magic wand to highlight the problems in this log. Now you must distinguish between CMS and G1 to know when to use which garbage collector (-XX:+UseParNewGC in new vs -XX:+UseConcMarkSweepGC in old space) and understand the difference between eden and survivor spaces to size them appropriately (-XX:NewSize=6g -XX:MaxNewSize=6g -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8).

Did you know that GC stops 20% of Java applications regularly for more than 5 seconds? Don’t spoil the user experience – increase GC efficiency with Plumbr instead.

And this is just only the beginning – if you go through the original post you see that the final configuration included tuning the GC worker threads, optimizing heap fragmentation and even dealing with page swapping to the OS. I mean, really, raise your hand if you have heard about -XX:ParGCCardsPerStrideChunk confguration option.

Now, this post is not just about whining about unnecessary complexity. If the above feels like a mess, you might be interested in what we have been brewing in our labs for a while and are now happy to hand out to wider user base. During the past month, select customers have had access to the Plumbr version which analyzes GC behaviour and recommends better suited configuration, tailored to your application.

If you wish to get rid of your long GC pauses or want to make sure you do not have problems with throughput or latency – leave us your email and we will gladly make the beta functionality of GC optimization available to you.

ADD COMMENT

Comments

G1 in this case with no options other than xmx and xms is superior to either of your examples

9.5 per second and less than 1 second of accumulated GC time.

Also, that linked in article you pointed out has some errors in it, but they don’t allow comments on their blog so it’s difficult to point them out. (They gave g1 horrible options, and abandoned it to soon in favor of cms)

Ryan Gardner

Use xmx and xms 4g and -XX:+UseG1GC (tested with both 1.7_60 and 1.8_20)

Lower latency and higher throughout, and no crazy tuning was needed.

Ryan Gardner

Woops. Wrong post.

Ryan Gardner