To blog |

Squeezing data into the data structure

September 1, 2015 by Nikita Salnikov-Tarnovski Filed under: Performance

Data structure optimization javaThis story is about a capacity optimization task that we recently carried out at Plumbr. It all started with an innocent-looking requirement being added to the existing mix.

As you might know, Plumbr monitoring solution is distributed as a Java Agent which connects to a Server. The small addition required to keep track of all the connected Agents over time so that questions such as following could be answered in real time:

  • For how long have we not heard from this particular JVM?
  • What was the last known downtime of that other JVM?

As each of the Agents is sending out a heartbeat every second, all that we need to do in the server-side is to keep track of all the heartbeats. As each heartbeat has a unique timestamp attached, the naive solution would be as easy as tossing all the heartbeats in a Set or a Map. So – easy, done, next, please?

However, some quick back-of-the-envelope math demonstrated that the initial idea just might not work. Taking into account that:

  • a timestamp is of type long and requires 8 bytes to accommodate itself
  • in a year there are 365 x 24 x 60 x 60 = 31,536,000 seconds

we can quickly do the math and see that the raw data alone for a single JVM for one year would require 240MB. The size of the raw data alone was scary enough, but when packaged to a HashSet the retained size of the structure exploded to about 2GB with all the overhead java.util.Collection API implementations hide in their belly.

The naive solution was off the table and we needed an alternative. We did not have to look very far initially, as in the very same java.util package a surprise called java.util.BitSet was waiting to be discovered. According to the javadoc of the class:

The BitSet class implements a vector of bits that grows as needed. Each component of the bit set has a boolean value. The bits of a BitSet are indexed by nonnegative integers. Individual indexed bits can be examined, set, or cleared.

So what if we stored the heartbeat acquired from the Agent as boolean values indexed by the heartbeat’s timestamp? Timestamps in Java are represented as the difference in milliseconds between the current time and midnight, January 1, 1970 UTC. Knowing this, we can represent the 1st of September, 2015, 12:00 UTC as the number 1441108800. So what if when we see an Agent sending us a heartbeat at timestamp 1441108800 we would set the bit with the index 1441108800 to true, otherwise being left as the default false?

The issue with the solution is hidden in the fact that bits in a BitSet are indexed by integer instead of long. To proceed with this solution, we would thus need a way to map the integers to long without losing any information. If it seems impossible then let’s look back at the fact that the precision of a second instead of a millisecond was needed. Knowing this, we can shrink the index 1,000x and stamp the time with the precision of a second instead of a millisecond.

But how many seconds can be represented using just Integers? Apparently Integer.MAX_VALUE is big enough to represent each second from 01.01.1970 to until 19.01.2038. Besides creating a year-2038 problem, it should be good enough, right?

Unfortunately,  as our back-of the-napkin calculations show, one year worth of data would still require around 800MB of heap. This is a small step in the right direction from the original 2GB of the HashSet, but still way too much for practical use.

To overcome the problem, one might need to re-read/re-think the part that said “enough to represent each second from 01.01.1970”. (Un)fortunately mr. Gosling did not invent the Java Virtual Machine until 1995. And Plumbr itself saw the light 18 more years later. Consequently, we do not need to rewind the history until 1970 and have a bunch of zeroes padding each integer. Instead of starting from 01.01.1970, we can start with 01.01.2013 and have a bit with index 0 to correspond to 01.01.2013 00:00 (UTC).

Redoing our back-of the napkin math and checking the results in practice gave us a winner. Now a year’s worth of data could be stored in only 20MB. Comparing this with the original 2GB we have shrunk the needed capacity by 100x. This was already in the comfort zone as the existing infrastructure was able to cope with it, so we did not go further down the optimization path.

Moral of the story? When you have a requirement in your hands, find out what it might mean in terms of your application’s performance. And I mean all aspects of performance as there is more than just latency and throughput, one should not forget about capacity. And – know your domain. Without it, you cannot make decisions which, if just equipped with book-smarts about data structures, seem insecure and dangerous.

ADD COMMENT

Comments

All you need to store on server is a bunch of dates when you lost connection to JVM and when it was fixed. The whole idea of storing each second heartbeat looks overengineered.

Sergey

For this you need:
1. Persistent connection between a client and a server
2. This connection to be reliable and not break too often
3. Be able to reliable detect if client disconnects

In our experience, all these things are not trivial.

Plumbr Monitoring

Why don’t you use sth between com.google.common.collect.RangeSet and com.google.common.collect.ContiguousSet? With the current solution you “loose” 1 byte every 8 seconds. If agents are more likely to send heartbeat every _next_ second than not, why not just store intervals instead of (strangely modified) timestamps? Wouldn’t that be more domain-aware solution?

Michał

It should be: “If agents are more likely to be _in the same state_ in next second” – slightly more general, surprisingly more likely to be true (I suppose you deal with periods of active/inactive state).

Michał

Thank you for your suggestion! I will look into it, sounds interesting. As I mentioned in the article, we have reached our comfort zone with this solution and so stopped digging further.

Nikita