Underutilized Collections wasting memory

August 29, 2013 by Nikita Salnikov-Tarnovski

Plumbr has been recognized as memory leak hunters. But our goal has never been to stick with just the leak hunting. Your feedback about how easy it is to solve memory leaks with Plumbr made us wonder what if we could take the same approach we used to solve memory leaks and expand to other similar problem domains?

And indeed we have progressed in several other areas. In this article we can open up one of the research domains we have been working. Namely – Collection API overhead. As one picture is worth more than the words describing it, let us start with a graph illustrating one aspect of the problem:

Colleciton utilization underutilized

This diagram is plotted using data from more than 500 different applications, all of which use thousands of Collection API class instances under the hood. What we have outlined in the diagram above is the collection utilization ratio. You see for example that more than 35% of the collections created are completely empty. And that approximately 9% of the collections are half-empty. Or half full if you are one of the rare optimists out there.

More interesting is the aggregate of the facts:

  • 72% of the collections are utilized to 50% or less
  • 90% of the collections are utilized to 75% or less
  • And just 5% of the collections are fully utilized to the full capacity

But this data would just serve as a fun trivia without digging further and trying to understand the reasons and impact beyond the facts.

First lets look at the reasons. If you recall our recent post about Collection overhead, we analyzed the memory consumption of an ArrayList where we covered the internals of the ArrayList:

  • When you initialize an ArrayList via its default new ArrayList() constructor, you have created a list with capacity to accommodate 10 elements. The size of the list is zero, as no elements have been yet added.
  • When adding elements to the list, the capacity of underlying array is dynamically increased at certain thresholds.

Lets see how the increasing occurs by extracting the relevant parts from the source code of the ArrayList:

public boolean add(E e) {
	ensureCapacityInternal(size + 1);  // Increments modCount!!
	elementData[size++] = e;
	return true;
}

private void ensureCapacityInternal(int minCapacity) {
	modCount++;
	if (minCapacity - elementData.length > 0)
		grow(minCapacity);
}

private void grow(int minCapacity) {
	int oldCapacity = elementData.length;
	int newCapacity = oldCapacity + (oldCapacity >> 1);
	//cut for simplicity
	elementData = Arrays.copyOf(elementData, newCapacity);
}

As seen from the JDK API above, the list capacity is being checked when new elements are added. If the size of the list is equal to the capacity, the underlying array is being copied into a new, larger one. This increase happens at 50% of the previous capacity. So for example when our original list has grown to 10 elements, adding a single new object grows the list to a capacity of 15. And when our list capacity is 1,000,000 and a single new object is added, we would have a list with a capacity of 1,500,000

So here we have the reason for the Collections being underutilized staring right into our face in the form of the JDK source code. But how severe is the impact of this behaviour?

In order to estimate this impact, we need to turn to the basic math of calculating how much space do those empty references require. It can be platform dependent and you can read more on how to estimate and/or calculate this from this post, but more often than not 4 bytes is required for each of those references. So applying this math to the case where you grew your ArrayList from 1,000,000 to 1,500,000 you have just wasted nearly 500,000 times 4 = 2,000,000 bytes to empty references.

Getting rid of this excess heap consumption is trivial – either initialize your collections to the expected sizes by using the ArrayList(int initialCapacity) instead of the default or invoke trimToSize() when you have populated the list you need.

This ~2MB might not seem much though – at least when trying to recall the last time when you actually created a collection accommodating more than 1,000,000 elements.

On the other hand – have you given a thought about

  • How many smaller collections you have in place in your application?
  • How many collections do your 3rd party libraries use under the hood?
  • What would be the aggregate amount of memory lost to this excess reserve?
  • And what if you could just discover and remove this waste easily?

If you are eager to be among the first ones to get your hands on to the solution, stay tuned – we are working towards making this data transparent. Equipped with this knowledge, we bet you are the best to decide whether you need to take action or not.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

COMMENTS

0.75 is default load factor for hashmap, which also backs lots of other collections. Which explains little to no utilization between 75-100%.

Would be cool to see statistics for lists and maps independently :)

Oleg

We are currently working on extracting more useful data in these parts. Lots of surprises ahead, I am sure :)

iNikem

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more