To blog |

GC tuning in practice

June 11, 2015 by Nikita Salnikov-Tarnovski Filed under: Garbage Collection

GC tuningTuning Garbage Collection is no different from any other performance-tuning activities.

Instead of giving in to temptation for tweaking random parts of the application, you need to make sure you understand the current situation and the desired outcome. In general it is as easy as following the following process:

  1. State your performance goals
  2. Run tests
  3. Measure
  4. Compare to goals
  5. Make a change and go back to running tests

It is important that the goals can be set and measured the three dimensions, all relevant to performance tuning. These goals include latency, throughput and capacity, understanding which I can recommend to take a look at the corresponding chapter in the Garbage Collection Handbook.

Lets see how we can start investigating how setting and hitting such goals looks like in practice. For this purpose, lets take a look at an example code:

//imports skipped for brevity
public class Producer implements Runnable {

  private static ScheduledExecutorService executorService = Executors.newScheduledThreadPool(2);

  private Deque<byte[]> deque;
  private int objectSize;
  private int queueSize;

  public Producer(int objectSize, int ttl) {
    this.deque = new ArrayDeque<byte[]>();
    this.objectSize = objectSize;
    this.queueSize = ttl * 1000;

  public void run() {
    for (int i = 0; i < 100; i++) {
      deque.add(new byte[objectSize]);
      if (deque.size() > queueSize) {

  public static void main(String[] args) throws InterruptedException {
    executorService.scheduleAtFixedRate(new Producer(200 * 1024 * 1024 / 1000, 5), 0, 100, TimeUnit.MILLISECONDS);
    executorService.scheduleAtFixedRate(new Producer(50 * 1024 * 1024 / 1000, 120), 0, 100, TimeUnit.MILLISECONDS);

The code is submitting two jobs to run every 100 ms. Each job emulates objects with the specific lifespan: it creates objects, lets them leave for a predetermined amount of time and then forgets about them, allowing GC to reclaim the memory.

When running the example with GC logging turned on with the following parameters

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

we start seeing the impact of GC immediately in the log files, similar to the following:

2015-06-04T13:34:16.119-0200: 1.723: [GC (Allocation Failure) [PSYoungGen: 114016K->73191K(234496K)] 421540K->421269K(745984K), 0.0858176 secs] [Times: user=0.04 sys=0.06, real=0.09 secs] 
2015-06-04T13:34:16.738-0200: 2.342: [GC (Allocation Failure) [PSYoungGen: 234462K->93677K(254976K)] 582540K->593275K(766464K), 0.2357086 secs] [Times: user=0.11 sys=0.14, real=0.24 secs] 
2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs] 

Based on the information in the log we can start improving the situation with three different goals in mind

  1. Making sure the worst-case GC pause does not exceed a predetermined threshold
  2. Making sure the total time during which application threads are stopped does not exceed a predetermined threshold
  3. Reducing infrastructure costs while making sure we can still achieve reasonable latency and/or throughput targets.

For this, the code above was run for 10 minutes on three different configurations resulting in three very different results summarized in the following table:

Heap GC Algorithm Useful work Longest pause
-Xmx12g -XX:+UseConcMarkSweepGC 89.8% 560 ms
-Xmx12g -XX:+UseParallelGC 91.5% 1,104 ms
-Xmx8g -XX:+UseConcMarkSweepGC 66.3% 1,610 ms

The experiment ran the same code with different GC algorithms and different heap size to measure the duration of garbage collection pauses with regards to latency and throughput. Details of the experiments and interpretation of results are presented in our Garbage Collection Handbook. Take a look at the handbook for examples in how simple changes in configuration turn the example to behave completely differently in regards of latency, throughput of capacity.

Note that in order to keep the example as simple as possible only a limited amount of input parameters were changed, for example the experiments do not test on different number of cores or with a different heap layout.