An Inconvenient Latency

September 12, 2014

Overview

Vendors typically publish numbers they are happy with, and avoid telling you about a product's weaknesses. However, behind the numbers is a dirty secret if you know where to look.

Why don't we use GPUs for everything?

Finding problems which naturally scale to thousands of data points/tasks is easy for some problems, and very hard for others.

GPUs are designed for computing large vector operations. However, what is "large" and why does it matter?

Say you have a GPU with 1024 cores. This means it can process a vector of length 1024 all at once and 2048 double the time. But what happen if we only have a vector of 100, or 10 or 1. The inconvenient answer is it take the same amount of time because you can’t make use of all of your cores at once. You get only 10%, 1% or just 0.1% of the peak performance. If you want to get best efficiency, you want a problem which has many thousands of values which can be processed concurrently. If you don't have a problem which is so concurrent, you are not going to get maximum performance.

Unlike a screen full of pixels, a lot of business logic deals with a small number values at once so you want a small number of fast cores which can perform independent tasks at once.

Why web servers favour horizontal scalability

Web services scale well up to one task per active user.

To use multiple cores, you need a problem which naturally breaks into many independent tasks. When you provide a web service, you have many users and the work for each user is largely independent.

If you have ten concurrent users, you can expect to close to ten time the throughput as having one user at a time. If you have one hundred users, you can get up to ten times the throughput of ten users etc.

The limit of your concurrency is around the number of active/concurrent users you have.

Throughput, latency and concurrency

When a vendor benchmarks their product, a common benchmark is throughput. This is the total number of operations per second over a significant period of time. Ideally this should be many minutes.

Vendors are increasingly publishing average latency benchmarks. Average latency is a fairly optimistic number as it is good at hiding small numbers of particularly bad latencies. For example, if you want to hide long GC pauses, use average latency.

The problem for vendors is these two measures can be used in combination to determine the minimum concurrency required to achieve the throughput claimed.

A given problem has a "natural concurrency" the problem can be easily broken into. If you have 100 concurrent users, you may have a natural latency of about 100. There is also a "minimum concurrency" implied by a benchmark.

minimum-concurrency = throughput * latency.

To achieve the throughput a benchmark suggests your natural concurrency should greater than the minimum concurrency in the benchmark. If you have less natural concurrency, you might expect you get pro-rata throughput as well.

Consider these benchmarks for three key-value stores.

key-store	throughput	latency
Product 1	1,000,000/s	46 ms or 0.046s
Product 2	600,000/s	10 ms or 0.01s
Product 3	2,000,000/s	0.002 ms or 0.000002s

You might look at this table and say, they all look good, they all have high throughputs and low enough latencies. The problem is; there is an implied natural concurrency requirement to achieve the throughput stated for the latency measured. Lets see what the minimum concurrency needed to achieve the latency

key-store	throughput	latency	minimum concurrency
Product 1	1,000,000/s	46 ms or 0.046s	46,000
Product 2	600,000/s	10 ms or 0.01s	6,000
Product 3	2,000,000/s	0.002 ms or 0.000002s	4

Many problems have around 4 independent tasks, but 46K is pretty high. So what? What if you only have 10 concurrent tasks/users.

key-store	concurrency	latency	throughput achieved
Product 1	10	46 ms or 0.046s	220/s
Product 2	10	10 ms or 0.01s	1,000/s
Product 3	10	0.002 ms or 0.000002s	2,000,000/s

Note: having more concurrency than you need, doesn't help throughput much, but a lack of natural concurrency in your problem will hurt you throughput (and horizontal scalability)

Conclusion

Next time you read a benchmark which includes throughput and average latency, multiply them together to see what level of concurrency would be required to achieve that throughput and compare this with the natural concurrency of the problem you are trying to solve to see if the solution fits your problem.

If you have more natural concurrency, you have more solutions you can consider, if you have a low natural concurrency, you need a solution with a low latency.

Vanilla Java