How do leak detectors work – Plumbr case study

July 31, 2012 by Vladimir Šor

We are often asked ‘How does Plumbr work internally?’ – ever since we made the tool publicly available. As the number of Plumbr users is growing, the number of these questions is also rising, and we have finally reached a point where it is more efficient to write the answer down than recite it over and over again. In the following article I will describe how memory leaks can be detected in general and which approaches does Plumbr use internally to do its job.

Background

Finding a memory leak is a troubleshooting task. To troubleshoot a problem, you always perform the following steps, in the following order:

  1. Measure something (collect data);
  2. Analyze the gathered data;
  3. Decide on what to measure and analyze next.

This is why, when you ask an experienced application performance consultant for how to find a memory leak, in the majority of cases you are recommended to use a profiler. I have to agree that the profilers really can and will gather a lot of data to analyze. But – there’s just one catch – if you do not know what you are looking for, finding the needle from the haystack might prove very exhausting.

This is where specialized tools come into play – they know which parameters to observe for detecting any particular problem and how to interpret the gathered data to come to the conclusion. Plumbr is fine tuned to discover memory leaks as early as possible, and to report all the needed details for its user to fix the leak. And, as such, it gathers the minimal amount of data required to detect and analyze memory leaks.

Science vs practice

The possible algorithms for detecting memory leaks differ in what kind of data they use for analysis. While conducting research for my PhD I investigated different techniques of memory leak detection. I found out that the topic has been pretty extensively covered by research over the years. Based on my research I divided the existing algorithms into two broad categories – scientific and practical.

The scientific algorithms, in my classification, require data that is so expensive or difficult to acquire that it makes them unusable in the real world. For example, a lot of scientific research is done with research JVMs (like Jikes RVM), which allow you to implement any wild ideas directly inside the JVM. In theory this sounds nice, because it is the cheapest way to acquire some JVM internal data. But I am yet to see a product owner willing to launch their app on a research VM.

Another popular example of scientific methods is related to object staleness. I.e., how long an object has been lying around without anyone accessing it. Sure – if it lies in memory without anyone needing it, it is a good signal of a probable leak. But to obtain the last access time for any particular object implies enormous overhead, as for every read access also a write to some other memory area holding the access counter is needed.

Practical methods, on the other hand, rely on the data that is possible to acquire using standard APIs in a way that does not hinder performance (there’s a method in JVMTI that lets you monitor field access, but it works only on class level, meaning that it will monitor all instances of a class, which is not a way to go in production). Using a standard documented API will also guarantee that the method would work across all JVMs that conform to the chosen API.

A good example of a practical method is monitoring known Collection classes for unlimited growth. That is exactly the approach used by many Application Performance Monitoring tools to power their leak finding functionality. On the plus side is implementation simplicity – this method only calls for java bytecode instrumentation and imposes low overhead. But the drawback is that it can handle only known collection classes. Should you use your own collection classes or custom collections like Trove, Guava or Apache Commons you are out of luck until you reconfigure (if possible) your leak detector, and tell it how to handle the internal structures of your collection classes.

Our way

This is why we chose a different metric to base our detection function on – the age of objects. The key here is the property of objects called “Infant mortality” used by modern generational garbage collectors. The glory details are available in the official JVM documentation, but let us quote the most relevant part: “The most important of these observed properties is the weak generational hypothesis, which states that most objects survive for only a short period of time”. So, if the JVM’s Garbage Collector relies on this observation, why can’t we?

Let’s take a closer look at what infant mortality means for memory leak detection on an example of a typical Java web application. Every web app has some common objects loaded during the start up – configuration, service layer singletons in your favorite dependency injection framework, the framework itself, etc. These objects are created once in the beginning of your application’s life and they stick around until the application stops. Consequently, they are usually the oldest objects in the heap at any particular moment in time.

Next in line are some objects needed to serve incoming requests. These objects are created when an HTTP request arrives, to assemble and ship the response. As soon as you flush your HTTP response and close the stream, all these objects become garbage. If we look at heap contents at any moment, then chances are these objects are the youngest, and they tend to disappear from the heap quickly.

You may also have HTTP session objects which survive over several requests until the end user logs out or until their session expires. If you don’t limit either the number of sessions or the amount of data per session you will have created conditions for OutOfMemoryErrors to occur. In those cases the sessions start to abuse memory and show the very same symptoms as a memory leak would. If we look at the ages of correctly limited HTTP session objects in the heap we can see that they tend to remain either young or middle aged throughout the life of the application.

Now, what is a leak from such a viewpoint? Leaks are objects that are created again and again, and are left in the heap while being unused. If we analyze the ages of these objects, we can normally find many different age groups present – we can find old, young, and middle-aged objects of the leaking class. And the number of different ages of these objects grows with the number of new instances of the leaking class.

Now, an experienced reader surely asks – what about caches? Let’s see -

  • An eager cache (e.g., a list of countries loaded during application startup) just looks like a bunch of old-aged objects, and definitely does not represent a memory leak.
  • An uncontrolled (either by age or by the last access or by whatever parameter of your liking) lazy cache… hmm, if it grows without control, it will sooner or later fill the heap, and therefore is by definition really a leak.
  • A correctly limited cache, however, looks more like the HTTP sessions described earlier.

As you may note there are a quite a lot of nuances in object behavior over time. To overcome that, we collected huge amounts of memory statistics of how object ages behave in different applications, applied our brain-powered manual analysis and magic-powered machine learning algorithms to identify how leaks are different from normal objects, and assembled that knowledge into a memory leak detection product. Download Plumbr here.

One thing to always keep in mind when using any product that ties itself to the JVM, is the overhead it introduces. So does Plumbr – we have to calculate object ages somehow. This information is readily available inside the JVM and the Garbage Collector, but not available for external agents. To collect that info, we basically had to duplicate some internal JVM bookkeeping. We did our best in keeping this overhead down – in a typical web application we are facing only 20-25% heap and CPU overhead. This is usually affordable in QA and production systems, and a fair price to pay for added service stability.

For that, you get a leak detection solution that:

  • Doesn’t depend on any implementation of some collection framework;
  • Is able to adjust to the specifics of your particular application, taking into account also user actions and monitoring all the code all the time;
  • Is able to report the leaks as soon as it notices some deviation from normal object behavior, rather than having to wait for the memory usage to cross some thresholds;
  • Is able to report the name of the leaking class, the exact line in the source code where the leaking objects are created, and the whole reference stack of the leaking classes.

Is there anything else about detecting Java memory leaks that you would like us to write about? Let us know in the comments below or via Twitter!

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more