Lies, statistics and vendors


Reading performance results supplied by vendors is a skill in itself.   It can be difficult to compare numbers from different vendors on a fair basis, and even more difficult to estimate how a product will behave in your system.

Lies and statistics

One of the few quotes from University I remember goes roughly like this

Peak Performance - A manufacture's guarantee not to exceed a given rating
-- Computer Architecture, A Quantitative Approach. (1st edition)

At first this appears rather cynical, but over the years I have come to the conclusion this is unavoidable and once you accept this you can trust the numbers you get in if you see them a new light.

Why is it so hard to give a trustworthy performance number?

There are many challenges in giving good performance numbers.  Most vendors try harder to give trustworthy numbers but it is not as easy as it looks.
  • Latencies and throughputs don't follow a normal distribution which is the basis of mathematically rigorous statistics.  This means you are modelling something for which is isn't a generally accepted mathematical model.
  • There are many different assumptions you can make, ways to test your solution and ways to represent the results.
  • You need to use benchmarks to measure something, but those benchmarks are either a) not standard, b) not representative of your use case, or c) can be optimised for in ways which don't help you.
  • Vendors understand their products and sensibly select the best hardware for their product.  This works best if you only have one product to consider. Multi-product systems many not have an optimal hardware solution for all the products, even if your organisation allowed you to buy the optional hardware.
  • It is easy to report the best results tested and not include results which were not so good.
Any decent vendor will use their benchmarks to optimise their solution.  The downside of this is that the solution will have been optimised more for the benchmarks they report than use cases the vendor hasn't tested e.g. your use case.

BTW: I often find it interesting to see what use cases the vendor had in mind when they benchmark their solutions.  This can be a good indication of a) what it is good for, b) the assumptions made in designing the solution, and c) how it is generally used already.

Should we ignore all benchmarks?

This can lead people to give up on micro-benchmarks and benchmarks in general because they have been "lied" to many times before.

However, used correctly benchmarks can be a good guide even if they cannot give you definitive or completely reliable answers.  As such I suggest you should be highly sceptical that small difference in performance give you any indication of what you would expect tot see, and only take note of wide variations in performance. By wide variations I mean 3 to 10 times differences.

Percentiles for latency

Customers generally remember the worst service they ever got and take the average service for granted.  When looking at the latency of your systems, it is generally the higher latencies which cause the most issues if not customer complaints.

A common approach for modelling the distribution of latencies is to sort all the latencies and report a sample of the worst.

PercentileOne in NScale Notes
This is a good indication of what is possible.
 It is the most optimistic figure you could use
90%one in
This is a better indication of performance
if tested on a real, complex system.
99%one in
For benchmarks of simplified systems, this is a better
indication of what you can realistically expect to achieve
99.9%one in
For benchmarks of simplified systems, this is a conservative
indication of what you can expect.
99.99%one in
20x-100x This number is nice to have but difficult to reproduce,
even for the same benchmark, let alone for a different use case.
See below
99.999%one in
This number is almost impossible reproduce between systems.
See below

Generally speaking, the latencies escalate geometrically, as you get into the higher percentiles. The very high percentiles have limited value as you have to take more samples to get a reproducible number even on the same system from one day to next.  They can vary dramatically based on the use case or system.

A guide to the number of samples you need for reproducible numbers

Java has a additional feature that it gets faster as it warms up.  In the past I have advocated removing these warm-up figures, but given micro-benchmarks give overly optimistic figures, I am more inclined to include them if for no other reason than it is simpler.

My rule of thumb for reproducible percentile figures is that for 1 in N, you need N^1.5 samples for simple micro-benchmarks and N^2 samples for complex systems.

PercentileOne in NSimple test
Complex test
90%one in ten~ 30~ 100
99%one in 100~ 300~ 10,000
99.9%one in 1,000~ 30,000~ 1 million
99.99%one in 10,000~1 million~ 100 million
99.999%one in 100,000~ 30 million~ 10 billion
99.9999%one in 1,000,000~ one billion~ one trillion
Maximum or 100%

Based on this rule of thumb I don't believe a real maximum can be measured empirically. Never the less, not reporting it all isn't satisfactory either.  Some benchmarks report what is the "worst in sample" which is better than nothing, but very hard to reproduce.

To mitigate the cost of warm up in real systems, I suggest latency critical classes should be pre-loaded, if not warmed up on start up of your application.

In summary

If you are looking for a performance figure you can use, I suggest using the 99 percentile as a good indication of what you can expect in a real system.  If you want to be cautious, use the 99.9 percentile.

If this number is not given, I would assume you might get about 10x the average or typical latency and 1/10th of the throughput the vendor can get under ideal conditions.  Usually this is still more than enough. 

If the vendor quotes performance figures close to what you need, or worse doesn't quote figures at all, beware !! I am amazed how many vendors will say they are fast, quick, fastest, efficient, high performance but don't quote any figures at all.


Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues