Innocent until proven guilty

June 14, 2012 by Nikita Salnikov-Tarnovski

I would like to describe a recent case from my career as troubleshooter, which again reminds me that one should not put all his eggs into one basket. And that one should look around and use every possible tool from his toolbox, not just his favorite hammer. And that wrong and unverified assumptions are the most common reasons for long debugging sessions. And mostly of all: you should never jump to conclusions too fast!

The beginning

I was developing a new feature for our product Plumbr. We are looking to add monitoring and reporting of classloader leaks in web applications. For that, naturally, I needed an application that would leak classloaders.

It is no surprise for the reader that before starting changing anything, you should fix your baseline. So, before introducing an artificial, controlled memory leak into my application, I wanted to make sure that there was no other, uncontrolled, memory leak in it.

The definition

First of all I would like to describe what particular type of memory leak I was interested in. Let us imagine a java web application. And let us further imagine that this application starts a new java.lang.Thread. Either directly via new Thread(), or indirectly via some java.util.concurrent.ExecutorService or twice indirectly via using some 3rd party library that starts new threads such as the Quartz Scheduler.

In our imaginary application, that thread is meant to run for some time. You can probably see nothing very wrong in such set up, possibly even recognize it from an application of your own. Multi-threading is the reality of today. But let’s say now that our application (or the 3rd party library) is a somewhat sloppy one. Lazy or just freshly written. It does not stop the thread when undeployed by the server. So, our hard-working thread continues to run even when the application is no longer around. Not a big deal, one little thread certainly cannot pose any threat to the server, you might say.

The devil, as always, is in the details. Every Java object, including any instance of java.lang.Thread, holds a reference to its class definition, an instance of java.lang.Class. And every class holds a reference to the classloader which was used to load and initialize that class. And the classloader, usually (like, in 99.99% of the time), holds a list of all classes that it has loaded.

The straw that kills our camel is the fact that, usually, web application servers use a separate classloader for every web application. When we add the whole thing up, we can conclude that one tiny thread left behind prevents all classes that were loaded by the web application’s classloader from being garbage collected. Consequently, that part of permanent generation (or just heap in case of JRockit) is lost to us.

The experiment

How can I see that there is no apparent classloader leak in the given application? I start a web server, Jetty in my case, and deploy Spring’s PetClininc application to it. Then I hit the redeploy button (touch webapps/petclinic.war). This forces Jetty to undeploy the application and deploy it again. I repeat this every 20 seconds, while monitoring permanent generation usage with jstat. In theory, if the application/server does not leak, memory usage should not change notably in the long run.

The research

Boom! To my great surprise, Jetty crashed with OutOfMemoryError: PermGen in a couple of minutes! What the hell? Does PetClinic leak? I started my routine investigation of memory leak: made a heap dump, opened it with Eclipse MAT and, as the problem was with permanent generation, started looking for classloaders.

I saw 12 instances of org.eclipse.jetty.webapp.WebAppClassLoader. Now I only had to right-click, select “Merge shortest paths to GC roots” and… wait a minute… uhm, what do you mean by “unreachable”? All paths end in some arbitrary classes in just 1 step and MAT says they are “unreachable”. What?

Eclipse MAT

Long story short – after 3 hours I was completely lost. That was the first case in my troubleshooting career when I had no clue to why the hell those objects were not garbage collected?! Then I decided to test the very same application using the very same technique in Tomcat. And it worked! No more memory leaks, no more OOM in two hours! From that, I concluded that there should be some bug in Jetty that would prevent the classloaders from being garbage collected. In retrospect, I owe you an apology, guys from Jetty. I was wrong.

But I did go home clueless and blaming my web server for ruining my day.

The answer

Fortunately, the next morning I was lucky enough to note something in the Tomcat’s log:

SEVERE: The web application [/petclinic] registered the JDBC driver [org.hsqldb.jdbcDriver] but failed to unregister it when the web application was stopped. To prevent a memory leak, the JDBC Driver has been forcibly unregistered.

“Aha!” I said to myself and opened the memory dump from Jetty once more. And surely, there it was. The leaking classloader was referenced from org.hsqldb.jdbcDriver class, referenced from org.hsqldb.jdbcDriver object, referenced from java.sql.DriverInfo, referenced from java.sql.DriverManager that was loaded by the system classloader.

Another screenshot of Eclipse MAT

The conclusion

Phew, problem solved. The problem was Spring’s PetClinic, which didn’t unregister the jdbc driver during undeploy. And  the problem was was between my chair and the laptop – for wasting a couple of hours digging inside the heap dump instead of verifying my assumptions and trying the same application in a different server.

Jetty’s good name is restored. And I can go on developing one more amazing feature for Plumbr, the easiest tool to find memory leaks in Java applications.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more