I have spent a lot of time recently measuring Plumbr’s overhead and figuring out some ways of reducing it. When I started the measurements, I was frustrated to the extent that I had to let my feelings out in a form of a previous blog post. But now I am back in my constructive mode and ready to share more insights on the subject.
How do you measure memory overhead of a Java agent? The answer, as seen below, seems trivial at first:
“You measure memory consumption of the application with that agent and without it, then compare”.
As it turns out, even the “measure” part is not that obvious, not to speak of “comparing”. Let me explain. We are using our own performance monitor to judge how well Plumbr fairs. It runs as a separate thread in JVM and periodically measures the amount of memory consumed by application. Here is an example output for one of our test application:
We are using java.lang.Runtime to get the information from the JVM. In the picture we have lined up different metrics:
- JVM heap’s capacity is measured by Runtime.totalMemory()
- Used memory is calculated using Runtime.freeMemory()
- Native memory usage is reported from OS’s /proc/self/status.
As we can see, there are 3 different memory related metrics. And seemingly the most important of them – used memory amount – looks more like a cardiologist’s nightmare. So it seems not the right source of information to base your overhead calculations upon.
We needed a better idea. And as usual it was truly simple – we turned our eyes into monitoring the size of live data instead. Live data denotes objects that are still alive after full garbage collection cycle. This effectively serves as the lowest bound of the heap required for the application. You can’t get away with less. How much heap is occupied by application’s live data can be easily read from GC log: find a line corresponding to Full GC and read it from there. Here is an example:
[Full GC [PSYoungGen: 19377K->0K(217024K)] [PSOldGen: 161109K->152083K(261568K)] 180486K->152083K(478592K) [PSPermGen: 125886K->125886K(258688K)], 1.6528730 secs] [Times: user=1.65 sys=0.01, real=1.66 secs]
If you are unfamiliar with the GC logs then lets see what are the useful bits in it:
- PSYoungGen: 19377K->0K(217024K) indicates that the objects in young generation were fully collected. Its used size was reduced from 19377KB to 0KB. And the total available size for the young generation is 217024KB
- PSOldGen and PSPermGen sections contain the same information, but for different areas in memory – correspondingly showing the changes in old and permanent generations
- The total time the full GC ran was 1.66 seconds.
Now, the live data size is equal to the size of old generation after the full GC. In our case it equals to 152083KB of heap memory. During application’s stable phase this number doesn’t fluctuates very much. So usually you could take either lower convex boundary or average the old generation size over several full GC runs.
What it meant for us was that – we now had a foundation to build the test suites for Plumbr to do the measuring. And we are ready to publish the numbers soon. After I have gotten rid of some obvious overheads we have already found during the measurements.