Two leak discoveries in one case study

August 27, 2013 by Ivo Mägi

This is a modified guest post published with the permission of the original author Pierre-Hugues Charbonneau. The original post describing the case in more details is available here.

We have slightly modified the content to highlight some aspects we find extremely valuable in the case study Pierre-Hugues conducted.

But enough of the introduction, lets start with the case itself. In order to give Plumbr some challenges, Pierre-Hugues engineered the following memory leak:

A JAX-RS (REST) Web Service was created and exposed via the /jvmleak URI as per below attributes.

@GET
@Path("/jvmleak")
@Produces(MediaType.APPLICATION_JSON)
public Integer jvmLeak() {}

Each invocation of /jvmleak is performing the following logic:

  1. Allocate a high amount of short-lived objects (no reference).
  2. Allocate a small amount of long-lived objects (normal or hard references) to a static ConcurrentHashMap data structure.
  3. Returns the current count of the “leaking” ConcurrentHashMap.

He also created 3 extra Java classes:

  • JVMMemoryAllocator. This class is responsible to perform the short-lived and long-lived memory allocations.
  • ShortLivedObj. This class represents a short-lived object with no reference.
  • LongLivedObj. This class represents a long-lived object with hard references.

The case created serves as a very good illustration to have Plumbr discovering a leak for you. In order to understand it, let me dig a bit under the hood and explain how Plumbr internals work.

When attached to your application Plumbr starts monitoring all your objects – Plumbr intercepts all object creations and garbage collections and keeps track on how many garbage collection cycles the instances of a particular class are surviving. When we have gathered enough data we start looking into anomalies.

Anomaly is application specific – for example applications dealing with long-lived business processes can have perfectly normally behaving objects which live for days or even weeks. On the other hand Plumbr might be attached to a highly transactional application where even few hundred milliseconds is a suspiciously long lifetime. The normal and abnormal distinction is based on machine learning algorithms trained upon the millions of memory snapshots from thousands of different applications.

Now, looking back at the case study, we see that Pierre-Hugues has created both leaking and non-leaking objects. This gives Plumbr something to base the decision upon – if you create a too small synthetic test case just filling one collection with leaking objects Plumbr does not stand a chance – all the objects are created once and lived forever which seems normal for this particular application …

So when trying out Plumbr, you are best off by attaching it to a real application or making sure your artificial test case has got different object behaviour in place. But lets look what steps Pierre conducted next:

Now that the Plumbr agent is connected, it is now time to fire our memory leak simulator. In order to simulate an interesting memory leak, we will use the following load testing specifications:

  • JMeter will be configured to execute our REST Web Service URI jvmleak.
  • Our JMeter thread group will be configured to run “forever” with 20 concurrent threads.

This is second important aspect on conducting a case study – Plumbr can only find leaks in an application which is actually being used. Otherwise there just are not any object creations and destructions to monitor and learn about the potential villain.

The amount of time Plumbr needs to analyze a particular application is situation-dependent. We have seen cases where Plumbr reports a leak already after five minutes. On the other hand, some leaks take days before Plumbr can alert you. But on more than half of the leaks Plumbr has found, we have been able to alert you in less than 70 minutes.

In Pierre-Hugues case, he was able to discover the leak over just a handful of major garbage collection cycles. When Plumbr was ready, it alerted Pierre-Hugues via all the channels we provide

  • A message in the standard output logs
  • Email alert referring to a leak report in My Plumbr
  • MBean publishing the status of Plumbr via JMX interface.

Having all those possibilities explained in the post, the article again serves a good case study about the possibilities you can be alerted from the possible leaks.

Now let us take a look to the report Pierre-Hugues received:

Java Memory Leak

We can see from the above report that Plumbr was perfectly able to identify the memory leak Pierre-Hugues had created. You can notice that the leak report is split into 4 main sections:

  • The header contains the number of leaks found along with the detail on the memory footprint occupied by the leaking objects vs. the total Java heap capacity.
  • Leaking object type: This represents the object type of the instances accumulating over time in between major collections.
  • Leak creation location: This represents the caller and Java class where the leaking objects are created.
  • Memory references: This represents the object reference tree where the leaking objects are still referenced or held.

In this case, Plumbr was able to identify the exact location of the engineered memory leak.

  • LongLivedObj is indeed the expected leaking object type.
  • JVMMemoryAllocator is definitely the Java class where the leak is created.
  • ConcurrentHashMap is the implemented “referencer” or container.

This is again a good demonstration about our strengths – we equip you with the knowledge you need to actually fix the leak. What Pierre-Hugues did not discover yet in his test is that – in roughly 20% of the cases we are already able to enhance the report and give you even the step-by-step guidelines to create the fix. But even without those guidelines – when you have the leaking class, the line in the source code where the leaked objects were created and where the references are currently hanging – I bet you can already fix the leak anyway.

One additional interesting aspect of Pierre-Hugues case study got barely mentioned though and I would like to stress out one more benefit gotten. In the end of the post, he is referring to the fact that

Interestingly, Plumbr was also able to identify another potential leak inside WildFly 8 Alpha 3 itself…

Now, when looking through the comments section and digging into the WildFly issue tracker, we can see that indeed, there was a memory leak present in the next major release of a major application server. Or more precisely, in the reference implementation (RI) for JSR-299: Java Contexts and Dependency Injection. Looking at the chronology of the bug reports WELD-1482 and WELD-1487 we can be quite sure Plumbr was the first to discover this bug.

We are glad that thanks to Pierre-Hugues this bug did not slip through into production installations, but this again serves us a good illustration about possible benefits of Plumbr – when planning upgrades for the infrastructure you can escape some truly nasty bugs which would start affecting your application in production.

So do not hesitate, go and grab your free copy and find out whether you are among the 50% of the applications which actually do leak memory!

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more