RMI enforcing Full GC to run hourly

December 19, 2013 by Nikita Salnikov-Tarnovski

During all the troubleshooting exercises I have carried out in my career, I have picked up a feeling that the bugs I am after are evolving and becoming meaner and uglier over time. Or maybe it is just my age kicking in. This particular Heisenbug – lookalike this post is about has again kept me awake for more than I would actually have liked to.

As other particularly nasty bugs, the one I was now after surfaced with a symptom that “sometimes the system is slow”. Great, I already felt the shivers going down my spine. Things got a bit better when I discovered that the system had a simple monitoring solution set up in production, monitoring the system. Essentially it was not much more than Pingdom, measuring the response time once a second, but the latency graphs immediately caught my eye.

load peaking jvm

There were seemingly random spikes in response times plotted in the graphs – every once in a blue moon the responses started to take almost 10x more time to compose. And within a short while everything was back to normal again.

First suspects – regular cron jobs or expensive reporting processes were nowhere in sight. Neither was the storage monitoring exposing any particularly ugly queries hitting the databases. But before calling it a night, I added some more logging options to the application configuration, rebooted the whole stuff and went to bed puzzled about the behaviour.

Next morning I had something extra to look at, GC logs among the other things. Within ten minutes I found myself staring at something like the following:

11.408: [Full GC [PSYoungGen: 192K->0K(48640K)] [ParOldGen: 16324K->14750K(114688K)] 16516K->14750K(163328K) [PSPermGen: 31995K->31637K(64000K)], 0.1543430 secs] [Times: user=0.58 sys=0.00, real=0.16 secs]
...
3613.362: [Full GC [PSYoungGen: 256K->0K(47104K)] [ParOldGen: 15142K->13497K(114688K)] 15398K->13497K(161792K) [PSPermGen: 32429K->32369K(72192K)], 0.1129070 secs] [Times: user=0.51 sys=0.00, real=0.11 secs]
...
7215.278: [Full GC [PSYoungGen: 224K->0K(44544K)] [ParOldGen: 13665K->13439K(114688K)] 13889K->13439K(159232K) [PSPermGen: 32426K->32423K(70144K)], 0.0881450 secs] [Times: user=0.44 sys=0.00, real=0.09 secs]
...

I could now correlate the seemingly random performance degradation to the Full GC running each time around 3,600 seconds has passed. Attaching VisualVM to the JVM left me puzzled – there was no evidence of high memory usage which should be causing the full garbage collector to run. Checking with the developers I made sure they were not running explicit GCs via cron jobs. So there i stood, confused by the regularity of the GC.

What does an engineer do when he is puzzled by a question? He googles. And there it was, staring right into my face. Pages after pages with people who are concerned about the hourly Full GC taking place without any apparent reason.

And the culprit – RMI. Apparently when your application either exposes its services via RMI or consumes any services over RMI, you are bound to have an additional garbage collection cycle. As the RMI documentation says:

“When it is necessary to ensure that unreachable remote objects are unexported and garbage collected in a timely fashion, the value of this property represents the maximum interval (in milliseconds) that the Java RMI runtime will allow between garbage collections of the local heap. The default value is 3600000 milliseconds (one hour).”

Solution for the time being was increasing the sun.rmi.dgc.server.gcInterval length from the default 3,600 seconds. I wonder how things used to look prior to the change introduced in JDK 6, when RMI used to force a full GC once a minute. Considering all the EJB frenzy back then, I guess none of the apps even stood a chance performance-wise. If you have memories from the ancient times, maybe you can shed some light upon how the apps were able to survive this madness.

And – if the content was interesting, consider subscribing to our Twitter feed to be alerted on future posts.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

COMMENTS

so a GC pause of around 100 ms causes a peak of response time of about 2000 ms? how does that add-up exactly

dimo

You can always use -XX:+DisableExplicitGC

What for?

“What for” what? Why increase default interval? For now, I have now idea, if this RMI induced GC is really needed or not in the long run. Hopefully not, and usual GC will take care of those RMI objects. But I have not verified it yet.

iNikem

With CMS, explicit GC via RMI settings are usually not ideal since it doesn’t let the GC heuristics to kick in. I think this benefit can also be extracted with ParallelGC on both new and old gens, but I might be wrong on that

What for?

There is no benefits in explicit GC whatsoever, IMHO. But, some part of the Internet claims, that -XX:+DisableExplicitGC does not change RMI’s behaviours. Other parts claim, that RMI remote objects are note eligible for usual GC.

Investigation is under way…

iNikem

From experience, if you are setting Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 adding -XX:+DisableExplicitGC might look counterproductive but some application servers understand that DisableExplictGC should override sun.rmi settings. I usually recommend removing the sun.rmi settings and add DisableExplicitGC just to be on the safe side.

What for?

Hi,

-XX:+DisableExplicitGC will disable the RMI-introduced Full GC on any reasonably new HotSpot-based JVM.

The other option is to set both(?) sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval to Long.MAX_VALUE and it will disable the RMI GC as well.

Jaromir Hamala

> But, some part of the Internet claims, that -XX:+DisableExplicitGC does not change RMI’s behaviours.

-XX:+DiasbleExplicitGC is flag used by JVM itself (native code). sun.rmi.dgc.server.gcInterval is a property for RMI itself (java code). IMHO It’s not possible to change behavior of jvm from java code.

> Other parts claim, that RMI remote objects are note eligible for usual GC.

In addition to its own logic DGC uses GC to perform garbage collection and adjusts calls of System.gc() with sun.rmi.dgc.client.gcInterval and/or sun.rmi.dgc.server.gcInterval.

Roman

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more