A 12-year-old bug in JDK, still out there leaking memory in our applications

December 17, 2012 by Nikita Salnikov-Tarnovski

This story goes back. For weeks or even decades, depending on how you mark the starting date. Anyhow, few weeks ago one of our customers had problems with interpreting a leak reported by Plumbr. Quoting his words “It seems that Java itself is broken”.

As a matter of fact, the customer was right. Java was indeed broken. But let’s check the case and see what we can learn from it. Lets start by looking into the report generated by Plumbr. It looked similar to the one below.

From the report we can see that the application at hand contains a classloader leak. This is a specific type of a memory leak when classloaders cannot be unloaded (for example on Java EE application redeploys) and thus all the class definitions referenced by the classloader are left hanging in your permanent generation.

In this specific case there are 14,343 class definitions wasting your precious PermGen:

Plumbr report leak

Those classes are all loaded by the org.apache.catalina.loader.WebAppClassloader which cannot be garbage collected because it is still referenced through the following chain:

  • This classloader is referenced from a field contextClassLoader in a java.lang.Thread instance.
  • The Thread blocking our classloader to be garbage collected is referenced from the sun.net.www.http.KeepAliveCache instance field keepAliveTimer
  • And last in this hierarchy is sun.net.www.http.HttpClient which seems to do something clever and keeping a cache of something internally used in an instance variable kac.

Now we are indeed in a situation where all the symptoms of the problem at hand are pinpointing to the JVM internals and not to the application code. Could this really be true?

Immediately after googling for “sun.net.www.http.HttpClient leak” I stumbled upon endless pages of references to the same problem. And about the same amount of different workarounds found for different libraries and application servers. So for some reason indeed it seems like the caching solution in this HttpClient class does not let go of the internal keep-alive cache. Which in turn refuses to release a reference to the classloader it was created in.

But what is the actual cause for it? Most of the stackoverflow threads or the application server vendor bugreports only offered workarounds to the problem. But there has to be a real reason why this keeps happening. Some more googling revealed a possible suspect in Oracle Java SE public bug database – an issue 7008595.

Let’s look into the issue and see what we can conclude from it. First, those of you who are not familiar with a nice bug report – take another look into it and learn. This is how you should file a report – with minimal test case to reproduce the problem and just two steps to go through when concluding the test. But praising aside, it seems that this problem has been present in Java at least since 1.4 was released. And was patched in a 2011 Java 7 release. Which translates to at least NINE years of buggy releases and thousands (maybe even millions) of affected applications.

But now into the code packaged along with the sample testcase. Its relatively simple. In very general level it goes through the following steps:

  • After start, application creates a new classloader and sets this newly created classloader as a context classloader to the running thread. This is done to emulate a typical web application, where the classloader of the current thread is a special classloader and not inherited from the system. So the author sets the context classloader to his own.
  • Next it loads a new class using the newly created classloader and invokes a getConnection() static method on this class.
  • The getConnection() method opens an URL, connects to it, reads the content and closes the stream. In the very same method author is doing something completely weird as well. Namely allocating 20MB to a byte array never used. He is doing it solely to highlight the leak later on, so I guess we do not have to point fingers and call him mad here. Let’s be grateful instead.
  • Now all the references are set to null and System.gc() is called within the the code.
  • One should now expect that the to the ApplicationClass declaration is now garbage collected as it is no longer referenceable from anywhere.

After walking you through the steps the test application is using we are now ready to compile and run the application. For this run I used the latest Java 6 build 37 available. After I have run the application, taken a heap dump and opened it in Eclipse MAT we see the problem staring right into our face:

JDK leak Eclipse MAT
JDK leak Eclipse MAT root

As we can see, out ApplicationClass with its 20MB byte array is still alive. Why? Because it is held by our custom MyClassloader which is used as context class loader for KeepAlive time thread.

And if you are thinking now that you will never mess with custom classloaders and so this whole talk is not relevant to you, then think again. Vast majority of java developers work with customer classloaders every day. Most often with classloaders your application servers (like Tomcat or Glassfish or JBoss) use for creating and loading web applications. If your web application open a http connection somewhere, and as a result KeepAlive timer is spawn, I congratulate you. You have the exact memory leak described in this article.

So indeed, we have verified the assumption that “Java is broken”. And has been broken ever since the Java 1.4 was release. Which was 12 years ago. Luckily the new patches to Java 7 no longer have this problem. But as the different statistics show – vast majority of the applications out there have not migrated to Java 7 as we speak. So most often than not, your application at hand has got the very same problem waiting to surface.

In either case, the story definitely serves as a great case study on how hard it is to trace down a memory leak. Or how difficult it used to be without Plumbr. It took just one customer with one report and the culprit was staring right into our face. But this is now turning into a commercial and this is not you guys are into, so I am going to stop here.

If you enjoyed the post then – stay tuned for more. We get new and interesting insights from JVM on a daily basis nowadays. Unfortunately we do have to work with our product also every once in awhile, but I do promise interesting posts on a weekly basis!

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

COMMENTS

Good article, but I can’t help but think that it implies that calling System.gc() will trigger a GC when the javadoc for that method indicates that it will make a ‘best effort’ and really makes no guarantees. I could be wrong but my understanding is that you cannot assume that GC has run simply because System.gc() has returned, it may have, it may not have.

Frederic Bull

In terms of ‘Java is broken’, well yes, almost every application has atleast one bug, ergo they are all broken. The question should be ‘How serious is this bug’ and ‘How difficult is it to fix’. It is quite annoying but not super critical. It’s also possible it was delayed to v7 because it was created by a fundamental problem in the way class loading works. n

Philip Whitehouse

Judging by the simplicity of the fix, I fail to see any reasons, why this bug could not be fixed way earlier. Context classloaders were introduced in java 1.2

iNikem

That was actually fixed in JDK7

ArtemZ

thats why he refused to test it with 7.

JJRambo

“-Now all the references are set to null and System.gc() is called within the the code.- One should now expect that the to the ApplicationClass ndeclaration is now garbage collected as it is no longer referenceable nfrom anywhere.”nnIncorrectnnA call the System.gc() is a _suggestion_ to the garbage collector. It may or not result in garbage collection. Even if the GC take the hint, there is no guarantee that a particular piece of garbage will be gc’ed by the GC as it seldom collects all garbage in one pass.n

Glen Newton

It should not make the bug description or steps to reproduce invalid. But I do agree, that it could be rephrased that: “One should now expect there is no path via hard references from any GC root to the ApplicationClass’s definition”. This would be more precise. I personally prefer the original, not that accurate, but definitely simpler, explanation :)

iNikem

I agree that not every java application is written the same. The article’s title should probably mention class loaders, application servers or leaks. The real issue is that there is a known bug. This bug is exposed in many application servers’ internal code used in custom class loaders which developers don’t have immediate control over. Simply using the Apache commons HttpClient class is not a solution in this case. nnOn another note:nnI would hope most developers would never use the com.sun package. I am disappointed that Apache did this. Though, I do believe the oracle technical document stating the usage of ‘sun’ packages should be avoided needs to be updated. It says the “sun.*” packages and not the “com.sun.*” packages. So, I am not completely sure if “com.sun” is okay. Here is a link to a tech note: http://www.oracle.com/technetwork/java/faq-sun-packages-142232.html

Marc Miller

Marc, is your comment based on looking at the code in question? I would have thought that Tomcat was using a java.net.HttpURLConnection, which was using the sun.net.www.http.HttpClient class internally, as code in the standard API is entitled to do. But I haven’t looked at the code in question either.

Paul Clapham

Application written in Java != application has the problem you describe. Many apps dont use this class. And most of the apps which needs HttpClient use HttpClient from Apache Software Foundation. If you dig deeper you will find many old bugs in java. But we (developers) know about them and know how to workaround them. So no shocker here – every app/software/platform has bugs, right? So please, cool down, etc…

Cyber123

Sorry if the post sounds like trolling towards a bug in Java. It was intended to be a showcase on how difficult it could be to spot a bug which from our best guess has affected thousands of applications across the world. nnBut I would definitely disagree that it is normal that developers have to be aware of different workarounds. If you have 10-years of history on the platform – most likely it wont bother you. But if you are new to a platform and trying to figure out what is going on – there is a chance that the Java community will lose a good developer. Because of the workarounds and hacks needed …nnThere has to be an easier way to backport fixes to older releases – if up to 80% of your customer base could be affected it just seems to make sense to patch

Ivo Mägi

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more