Put your fat Collections on a diet!

October 26, 2011 by Nikita Salnikov-Tarnovski

Java Collection API is one of the most used APIs in the Java world. It provides a convenient and pretty solid way to implement and use some every-day data structures. But are they suitable and efficient in all situations?

In this article I will investigate one case from our own experience where Java Collection API turned out to be a huge waste of memory in a pretty trivial usage scenario.

Background

Every Java program uses some sort of data structures, be it a trivial array or a Fibonacci Heap or even something more exotic that only Google search knows about. In most cases developers do not write their own implementations of these structures but use either the one provided by Java core APIs or some third-party library, such as Apache Collections or Google Guava. In my 10+ years of Java development not a day passed by without me using some data structure from Java Collection API. These Lists, Sets and Maps are so natural to me that I don’t hesitate a second before writing

Map<Integer, String> = new HashMap<Integer, String>();

And everything was fine until recently…

One of the classes inside our Plumbr java agent needs to store a bunch of integers as one of its fields. The semi-formal requirements are as follows:

  • We need a data structure for storing integers.
  • No duplicates
  • Order is unimportant
  • We need to add to this structure new elements
  • We need to look up if some element exists in this structure
  • Number of different elements is limited to a couple of hundreds at most
  • Memory consumption is more important than speed
  • Nevertheless the performance  must be decent, so MemoryMapped files, database etc are out of question.

The natural choice for this requirement is, at least considering my experience so far, java.util.HashSet<Integer>. So, without thinking twice I gave it a try. That was a disaster!

Experiment

Well, in order to illustrate my point, we need some way to measure memory usage of different data structures. For this blog post I used the following procedure:

  1. Write a java class with main method, which holds needed data structure as a local variable
  2. Add infinite cycle to the end of this main method in order for this thread not to die too quickly.
  3. Using Eclipse Memory Analyzer take a memory dump and find out the size of retained heap for the local variable of interest.

As a baseline I used the fact that in Java integer takes 4 bytes. So, for a COUNT number of integers we need 4*COUNT bytes. Then we can calculate the overhead of the given data structure as follows:

Overhead = Structure's retained heap/(4*COUNT)

Please note, that Java distinguishes between primitives and objects and Collection API operates only on objects. It means that total overhead consists of the overhead of given collection data structure and overhead of using Integer objects, not primitives.

Results

So, having settled that, let us measure how big are java.util.HashSets. In order to do that I used the following code:

Set<Integer> set = new HashSet<Integer>();
int COUNT = 10;
for (int i = 0; i < COUNT; i++) {
set.add(i);
}

Let us look at the results:

COUNT Retained heap Overhead
10 720 18
100 6 960 17.4
1000 85 424 21.36
1000000 88 774 256 22.19

Wow! Just wow! Storing integers in java.util.HashSet takes about 20(!) times as much memory as the information we are storing. This is a HUGE overhead in my opinion. Taking into account our need to work in really constrained memory conditions that was unacceptable. We had to find some other way. We started with reviewing our requirements: what do we really need? It turns out, that plain old java array suits our requirements all right. Changing my code to this:

int COUNT = 10;
Integer[] array = new Integer[COUNT];
for (int i = 0; i < COUNT; i++) {
array[i] = i;
}

yielded the following results:

COUNT Retained heap Overhead
10 344 8.6
100 3 224 8.06
1000 32 024 8.006
1000000 32 000 024 8.000006

So far so good. This reduced the overhead by 2-3 times in comparison to the HashSet. But it is still too large to my taste. So I just replaced Integer[] with int[]:

int COUNT = 10;
int[] array = new int[COUNT];
for (int i = 0; i < COUNT; i++) {
array[i] = i;
}

and got:

COUNT Retained heap Overhead
10 64 1.6
100 424 1.06
1000 4 024 1.006
1000000 4 000 024 1.000006

Now, that’s much better. We have a constant overhead of 24 bytes per array. And implementing the needed operations, such as element addition and duplicate elimination is as easy as pie.

Conclusion

The first conclusion is quite obvious: array of primitives is the most compact data structure. That is not surprising :) What was really a big surprise for me, was the magnitude of overhead that Java Collection API has. I hope to keep that in mind next time I choose the structure for my data.

Of course, Java Collection API will not become deprecated as a result of this post. And I certainly will use it again and again, as it provides a very easy-to-use API. But in those rare cases when every byte really matters it is much better to be aware of discovered overhead and design your piece of software accordingly.

Another case where this overhead may be important, is storing large amount of data in e.g. a HashSet. I mean, in case of a couple of millions of elements this overhead, in absolute numbers, grows to a couple of hundreds of megabytes. So, the overhead alone requires a significant increase in your heap size.

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

COMMENTS

Wow, I figured my search for a comparison between int[] and HashArray would be too generic, but this is exactly what I was looking for. Thanks!

Guest

A long time back I was stumped by this approach and I did not have the choice of primitive types – I had to use POJOs in y program. I did a benchmark to understand how each of the APIs will perform – something that helped me then and still helps me when every millisecond is critical. The benchmark is available here: http://scrtchpad.wordpress.com/?s=collectionsn

Kapil Viren Ahuja

A long time back I was stumped by this approach and I did not have the choice of primitive types – I had to use POJOs in y program. I did a benchmark to understand how each of the APIs will perform – something that helped me then and still helps me when every millisecond is critical.

Kapil Viren Ahuja

Thats the reason why the Eclipse Memory Analyzer almost always uses handcrafted primitive collections. nas others pointed out fastutil and trove offer good implementations, in case you dont want to implement everything by yourself

Markus Kohler

I agree when you don’t have much memory like when I worked on those tiny heap of nokia mobile I rarely used Collections like Vector or hashtable until there is no choice or I can afford, the way you measure the overhead is simplest possible and easy to grasp for even beginners. thanks How HashMap works in Java

Rajiv Arraylist

When memory is a constrain, primitive types are you friends.. why not only create an array of int instead? .. for the kind of task described, from the start it was clear to me (This is before reading the entire post) that primitives were the answer. I mean you said you have 10+ year programming in java, objects are expensive!

Guest

Gnu Trove has a nice collection of primitive … collections.nhttp://trove.starlight-systems.com/nLGPLn

Noemail

I like pictures as much as the next guy, but I get insulted when they are gratuitously included where they don’t belong.

Zed

I had a similar experience: I tried to replace a Multimap (guava) with some structure based on array of arrays holding long values. The gain in the retained heap size was about 50%. nOne of your assumption is: “Number of different elements is limited to a couple of hundreds at most”. Which was not true for me, unfortunately :) . As an exercise, maybe you try to modify the code and measure the gain without this assumption?nJust to point out, in my case, memory consumption was more important than speed, as well.

kaltman

This is only surprising if you don’t read the api documents.nhttp://download.oracle.com/javase/6/docs/api/java/util/HashSet.htmln”This class implements the Set interface, backed by a hash table (actually a HashMap instance). “nHashSet uses the HashMap so has some of overhead for just storing items. The HashMap is necessary for fast lookup and uniqueness which are your requirements. There are other collections (trove, colt) that might be a little more memory efficient.nBut this is optimization.

Ido

The existence of the overhead is not surprising at all. The magnitude of it and the easy of bringing it to your application – that was surprising for me :)

iNikem

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more