To blog |

How to set meaningful goals towards performance and availability requirements

May 16, 2017 by Ivo Mägi Filed under: Monitoring

Let’s be honest – your software has bugs. Sometimes it is also slower than end users would expect. This is the situation we all live in. The key is to admit that the software you create will end up disappointing the users at some point. After all, even the almighty Google itself cannot guarantee the availability of their services for more than 99.95% of the time.

The sad part that when the inevitable happens and some service is not behaving as it should, most of us are either not capturing the correct signal or cannot act based on the signal due to the impact of the issue being unclear. Whenever clearly defined goals and transparent reporting is missing, finger-pointing and blame games start happening.

This post is targeted to any product owner or operations team lead who wants to have a clear and meaningful measure towards understanding whether or not your IT assets behave as they should. After reading the post you will end up understanding how to set a really simple objective and how to track the progress towards the performance and availability goal over time.

Status quo

Every good team has a reasonable objective to attain when it comes to functional aspects of the software they are building:

  • Sign-up to product activation ratio must exceed 35%;
  • the recommended products share of the aggregate shopping cart size muste exceed 10%;
  • the nurturing e-mail campaigns targeted towards inactive users must result in 6% or more reactivation rates.

The very same teams also have objectives around performance & availability of the software. However, these type of objectives are often not too well chosen. Let me give you an idea by visiting the following objectives some of the real-world teams:

  • CPU utilization must not exceed 80%;
  • 99% of the database queries must respond under 1 second;
  • the application must support 1,200 concurrent users;
  • any application server node must not experience more than 2h downtime per month.

I could continue the list with similar examples, but the pattern is hopefully clear already. None of these requirements really focuses on the user experience aspect of the performance & availability of the software. As a result you will find yourself again and again in situations like the one below:

setting performance requirements

We have all been present in the rooms where you could cut the tension in the air with the knife. On one side of the table there is the product team claiming that they are not tolerating the availability issues around the product offering. On the other side of the table sits the operations team who is pointing towards the fact that from their perspective the systems are working just fine.

So let’s admit that we need a different goal to measure. Our real goal is not to make database queries fast nor is it to keep CPUs idling. The real goal is in making sure the end users of the software are satisfied with the application.

How can you measure user satisfaction?

How can user satisfaction towards the service availability and performance be expressed in a measurable way? Apparently the answer is simpler than you would expect. User satisfaction builds upon monitoring every interaction end users are performing with the application to track whether or not:

  • the application performs the interaction the user wanted it to;
  • the interaction completes with the expected outcome;
  • the interaction completes within a reasonable timeframe.

There is a variety of tools on the market being available of capturing the interactions and flagging every interaction based on whether or not the outcome completed successfully and/or fast enough from the end user’s point of view.

Using the interactions as the input, we could measure the satisfaction across your user base via the following (simplified) formula

Satisfaction = successful interactions / total interactions

Now if you have agreed of the goal of 99.9% satisfaction rate on each give day, it would mean that if on a particular day 500,000 user interactions were performed, the goal would be met if up to 500 non-successful interactions occurred during the day.

If you have not monitored real user experience before, tracking your users using the formula above is a good starting point. You will learn and improve lot from while using this approach.

The devil is in the details

Unfortunately understanding and applying this simple formula is only the first step. There are many details to take into account when monitoring for performance & availability, for example

  • The availability of different services has different impact on your business. What if the failures you experienced during the day all occurred during checkout of a shopping cart? In an e-commerce solution this would indicate a serious issue and would likely require immediate action taken by the operations team.
  • Performance of different services cannot be treated equally. Some operations might have to be completed in few hundred of milliseconds while for some others it might be perfectly OK to experience 10+ second response times.
  • Success is sometimes difficult to monitor. When the operation involves a complex calculation, checking whether or not the outcome of the calculation was correct might not be feasible, so the monitoring must rely upon just checking the metadata about the operation (duration, response codes, etc) to decide whether or not the operation completed as expected

So there are many details to cover down the road, but rest assured, you will figure the nuances out along the way. I can only recommend you to adopt the mindset of “failures do happen” and start measuring the real user experience to make sure you stay on top of such failures. Being proud of our own solution on the field, I can recommend to take Plumbr out for a trial run and see whether your end users are satisfied with the availability & performance of your application.

ADD COMMENT