Wall vs CPU time, or the cost of asyncio Tasks

Performing I/O concurrently is one of asyncio's superpowers (if not its main superpower). This is accomplished, directly or indirectly (through helpers like asyncio.gather), by the creation and use of asyncio tasks. Even if you've never created an asyncio task yourself or called gather, you've used them because every web framework spawns them to handle requests.

So what's the problem?

Let me present three small coroutines, one performing two async operations in sequence, and two performing them concurrently, using tasks (via gather or manually).

from asyncio import gather, sleep, create_task


async def job_a():
    await sleep(1)


async def in_sequence():
    await job_a()
    await job_a()


async def concurrently_gather():
    await gather(job_a(), job_a())


async def concurrently_tasks():
    t1, t2 = create_task(job_a()), create_task(job_a())
    await t1
    await t2

All of them await job_a twice; the second and third ones use tasks to await them concurrently. If job_a takes N seconds to finish (for example, it performs a request to a slow third-party service), the concurrently variants would also finish in about N seconds, while in_sequence would take about 2 * N seconds to finish. This is usually called wall time; it's the amount of time elapsed in the real world while the operation was running. It's the amount of time you'd see pass if you were looking at a clock on your wall. So in this case, wall time performance of concurrently is double that of in_sequence. Does this mean you should always use gather (or create asyncio tasks) to do operations concurrently?

First, we need to introduce an additional concept; the concept of CPU time.

CPU time is, roughly, the amount of time a CPU core spends performing work. In the example above, running in_sequence takes around two seconds of wall time to complete, but out of those two seconds the CPU performs only several dozen microseconds of actual work. The rest of that CPU time is spent waiting.

CPU time is important because it's one of the main resources cloud providers like AWS, Google etc. are selling you, and your utilization of it determines how many virtual machines you need to provision to handle your workload. The more CPU time you spend on a request, the more cores you'll need to dedicate to your workload in to maintain acceptable latencies as you scale.

Imagine in_sequence used 1 millisecond of CPU time to run itself, and concurrently used 1000 milliseconds. Keeping in mind that Python is limited by the global interpreter lock, that would mean one instance of your service, running on one core, could serve 1000 concurrent clients with in_sequence and 1 concurrent client using concurrently. In other words, the approach that uses less CPU time can handle more concurrent clients, and is able to scale more effectively.

So let's benchmark the two coroutines and see the difference in CPU times used. As always, we can use the pyperf utility, which has grown the ability to benchmark coroutines in the last version. We can also change the sleep to return 0 to eliminate any waiting, thus making wall time (which is what pyperf reports) and CPU time equal. I'll use my Ubuntu desktop machine tuned with pyperf system tune, Python 3.10, and the uvloop event loop, since it's very commonly used in production deployments.

from asyncio import create_task, gather


async def job_a():
    return 0


async def in_sequence():
    await job_a()
    await job_a()


async def concurrently_gather():
    await gather(job_a(), job_a())

async def concurrently_tasks():
    t1, t2 = create_task(job_a()), create_task(job_a())
    await t1
    await t2

import uvloop
from pyperf import Runner

uvloop.install()

runner = Runner()
runner.bench_async_func("in_sequence", in_sequence)
runner.bench_async_func("concurrently_gather", concurrently_gather)
runner.bench_async_func("concurrently_tasks", concurrently_tasks)

And for the results:

in_sequence: Mean +- std dev: 478 ns +- 16 ns
concurrently_gather: Mean +- std dev: 20.7 us +- 0.5 us
concurrently_tasks: Mean +- std dev: 8.87 us +- 0.20 us

The gather variant uses ~40x the CPU time of the sequential version, and the bare task variant uses ~20x more CPU. Unsurprisingly, the machinery needed to enable concurrency costs some CPU to use.

So, never gather?

Of course not. Like mentioned before, the ability to make concurrent requests is one of asyncio's superpowers. And 20 microseconds is definitely not a large number, but it's not zero. What you should do, though, is use tasks/concurrency judiciously.

If the latency gained through concurrency is something that will materially improve your product, like making your website more responsive or letting your mobile game get past its loading screen sooner, you should definitely pay the price and be happy that you have the option to. However, if the latency gained through concurrency doesn't matter, consider not doing things concurrently.

The concurrency afforded to you by your web framework is enough to unlock essentially all of the operational benefits of asyncio. So if a really hot endpoint takes 10 milliseconds of wall time but would take 20 milliseconds without extra concurrency, your users are probably not going to be able to even notice the difference and you can get some CPU time back.

Another common example of workloads where latency doesn't matter that much are background jobs. Who cares if a background job finishes in 5 seconds or 6 if it leaves you with more CPU time for other endpoints?

As is so often the case in our line of work, whether you should use a feature or not is it depends. Hopefully, now you understand the tradeoffs and you'll be closer to being able to make an informed decision.

Tin
Zagreb, Croatia