Why is your software aging?

August 15, 2013 by Nikita Salnikov-Tarnovski

I recently stumbled upon a term software aging. My first thoughts on the subject were not too positive, especially after reading the Wikipedia definition. Just Another Buzzword was the only thing resonating in my head. But after digging further into the concept I started thinking a bit differently. Even about our own product, which essentially is offering protection for the outcomes of software aging. So I thought some of the concepts are worth sharing with you.

But lets start by what Wikipedia had to say about the subject:

Software aging refers to progressive performance degradation or a sudden hang/crash of a software system due to exhaustion of operating system resources, fragmentation and accumulation of errors.

This definition is boring as hell. But I think you all remember the days when your freshly booted Windows was running just fine. But in just a few days it became so sluggish that the only solution was a reboot. And in a year or so you needed a clean install, because reboots did not help you any longer.

Rebooting and reinstalling Windows serves as a good example which I guess most of you can easily relate to. And maybe even agree with what David Lodge Parnas has said on the subject:

Programs, like people, get old. We can’t prevent aging, but we can understand its causes, take steps to limit its effects, temporarily reverse some of the damage it has caused, and prepare for the day when the software is no longer viable. We must lose our preoccupation with the first release and focus on the long term health of our products.

In this quote, mr. Parnas also implies that legacy applications are more fragile to aging, but regardless of the size of your code base, you are likely to suffer from different causes for software aging, such as:

  • Memory leaks (our current bread and butter)
  • Lock contention issues
  • Unreleased file handles
  • Memory/swap space bloat
  • Data corruption
  • Storage space fragmentation
  • Round off error accumulation

As the list is a bit too dry, I will try to enhance it by bringing examples from the Java world, demonstrating the relevance (or irrelevance) of the causes.

Memory leaks. This is our current bread and butter – each day I face tens of different situations where applications are suffering from the leaks. As a matter of fact, from our current data set of several thousand applications we see that roughly 50% of the applications do contain one. Following  sample illustrates the case.

The program reads one number at a time and calculates its square value. This implementation uses a primitive “cache” for storing the results of the calculation. But since these results are never read from the cache, the code block represents a memory leak. If we let this program run and interact with users long enough, the “cached” results consume a lot of memory. It serves as a good sample of the aging – this program could be used for days before the end users are affected.

public class Calc {
  Map cache = new HashMap();

  public int square(int i) {
     int result = i * i;
     cache.put(i, result);
     return result;
  }

  public static void main(String[] args) throws Exception {
     Calc calc = new Calc();
     while (true)
        System.out.println("Enter a number between 1 and 100");
        int i = readUserInput(); //not shown
        System.out.println("Answer " + calc.square(i));
     }
  }
}

Lock contention. You must all have been in the situation where the application behaves just fine for years and then after a small bump in load you start facing situations where the threads start waiting behind synchronized blocks and are either starved or completely locked out.

The following sample serves as a textbook illustration to the case. The code will work just fine until you launch two threads which attempt to run transfer(a,b) and transfer(b,a) at the same time resulting in a deadlock. And again, you could be happily running the code for months or years before a situation like this escalates to locked threads.

class Account {
 double balance;
 int id;

 void withdraw(double amount){
    balance -= amount;
 }

 void deposit(double amount){
    balance += amount;
 }

  void transfer(Account from, Account to, double amount){
       sync(from);
       sync(to);
          from.withdraw(amount);
          to.deposit(amount);
       release(to);
       release(from);
   }
}

Unreleased file handles. Again, I am sure you have been cursing when looking at something similar to the following where a fellow developer has forgotten to close the resources after loading. The code might have been running happily for months, before the java.io.IOException: Too many open files message is thrown which again serves as a good case demonstrating the aging problem.

Properties p = new Properties();
try {
   p.load(new FileInputStream(“my.properties”));
} catch (Exception ex) {}
finally {
  //no, i will NOT close the stream
 }

Memory/swap space bloat. Modern OS tend to quickly page out memory that has not been touched for a while. So you might run into problems when you run out of the physical memory and the OS starts swapping your heap. Things get from bad to worse due to the garbage collection – Full GC requires the JVM to walk the object graph to identify every reachable object to detect garbage. While doing so, it will touch every page in the application heap, triggering pages to be swapped in and out from the memory.

Luckily, the effects are reduced in modern JVMs for several reasons, for example:

  • Most objects will never escape from young generation which are close to guaranteed to be resident in memory
  • Objects moved out of the young generations tend to be accessed frequently, which again tends to keep them in resident memory.

So you might have escaped this one, but I have seen the GC cycles extended from few hundred milliseconds to tens of seconds due to the extensive swapping. So we again have a case where a perfectly nicely behaving application with lazily loaded caches turns into a usability nightmare after a while due to the memory bloat.

Considering the samples above – I think you might agree with me that the software is indeed aging like the humans do. And I am extremely glad that we have stepped in to the rescue. So far, only to cure memory leaks, but I can hint that in our labs we have a lot of interesting things brewing. In order to stay tuned on the news, subscribe to either our RSS feed or follow us in Twitter.

 

Can't figure out what causes your OutOfMemoryError? Read more

ADD COMMENT

Can't figure out what causes your OutOfMemoryError? Read more

Latest
Recommended
You cannot predict the way you die
When debugging a situation where systems are failing due to the lack of resources, you can no longer count on anything. Seemingly unrelated changes can trigger completely different messages and control flows within the JVM. Read more
Tuning GC - it does not have to be that hard
Solving GC pauses is a complex task. If you do not believe our words, check out the recent LinkedIn experience in garbage collection optimization. It is a complex and tedious task, so we are glad to report we have a whole lot simpler solution in mind Read more
Building a nirvana
We have invested a lot into our continuous integration / delivery infrastructure. As of now we can say that the Jenkins-orchestrated gang consisting of Ansible, Vagrant, Gradle, LiveRebel and TestNG is something an engineer can call a nirvana. Read more
Creative way to handle OutOfMemoryError
Wish to spend a day troubleshooting? Or make enemies among sysops? Registering pkill java to OutOfMemoryError events is one darn good way to achieve those goals. Read more