cattrs I: un/structuring speed

Over the years, I've put a lot of effort into making cattrs fast.

Let's take a look at how the cattrs structuring code evolved during this time.

cattrs is a library for transforming data, and one of its main uses is transforming attrs classes to and from dictionaries. (The dictionaries are then fed to a serialization library, like json, ujson or msgpack.)

For the purposes of this article, imagine a class like this:

from attrs import define

@define
class Test:
    a: int
    b: int
    c: float
    d: float
    e: str
    f: str

Now imagine, having read a JSON payload from an HTTP request as a dictionary, you wish to transform that JSON payload into an instance of this class.

v1: The Early Days

Back in the early days of cattrs, say circa 2019 (which is simultaneously a lifetime ago, and yesterday), if you asked cattrs to structure your payload into an instance of this class, cattrs would perform something roughly similar to this (very simplified) code:

from attrs import fields

def structure_attrs_fromdict(payload, cls):
    res = {}
    for field in fields(cls):
        attribute_converter = get_converter(field.type)
        res[field.name] = attribute_converter(payload[field.name])
    
    return cls(**res)

Even though this code path is considered suboptimal nowadays, it still exists in the (now, legacy and soon to be renamed) Converter class of cattrs.

This code does an OK job, but the problem with it is that it's somewhat slow. The converter of each field (in this case, two ints, two floats and two strings) needs to be looked up each and every time you call this function, even though they will almost certainly not change during the lifetime of the converter. And there's also the issue of the for loop, creating an iterator and iterating over it every time.

Another problem with it is that any additional features are going to slow it down further. Maybe you want to rename a field - more processing needs to occur. Maybe you want to omit a field altogether - an additional if statement for every attribute. I figured there had to be a better way.

As a side note, on my MacBook Pro and on CPython 3.9, applying this structure function to an appropriate dictionary takes around 4.9 microseconds. That's not too bad, but keep in mind this is a very simple case. In real life classes have more fields, the classes are often nested, and the field types and converters are also more complex.

v2: The GenConverter

Back in those days, I had this (in retrospective, very silly) notion that people wouldn't use cattrs if it was slower than what they could write themselves. I asked myself, given an instance of Test, how would a human write the structuring function. Then I remembered this is Python, and in Python code can write code. If you can write an optimized structuring function for your class, so could cattrs. And the GenConverter (generating converter) was born.

When the GenConverter sees Test for the first time, it'll generate a custom structuring function for it. By using Python's inspect module, we can look at the actual generated source ourselves.

>>> from inspect import getsourcelines
>>> from cattrs import GenConverter

>>> c = GenConverter()
>>> f = c._structure_func.dispatch(Test)

>>> for l in getsourcelines(f)[0]: print(l)

def structure_Test(o, *_):
  res = {
    'a': structure_a(o['a'], type_a),
    'b': structure_b(o['b'], type_b),
    'c': structure_c(o['c'], type_c),
    'd': structure_d(o['d'], type_d),
    'e': structure_e(o['e'], type_e),
    'f': structure_f(o['f'], type_f),
    }
  return __cl(**res)

As you see, this is a lot like what you'd write yourself. When cattrs generates this function, it provides a dictionary of global variables to it - that's where the structure_a et al come from.

The main benefit of this approach is it does the work in two phases - the generation phase and the run phase. The generation phase takes significantly longer, but it happens only once. The generation phase outputs the structure_Test function, which is then cached and run on every structure call.

This also means we can do more work in the generation phase, which allows an entire class of features for essentially no cost. Renaming fields and handling generic classes falls into this category, alongside resolving the individual field converters.

(Note that if the app you're working on cannot afford to pay the upfront cost of compiling the function, you can still use the old Converter code path instead. For example, if you're writing a CLI app where fast startup is crucial.)

This code path has been the default since cattrs 1.7, released in May 2021. On my Mac, it takes ~2.3 microseconds to run. That's more than twice the speed of the old approach.

For a while, I imagined this is as optimized as Python code can get. A few days ago while reading a thread from the Python-Dev mailing list, I realized I was wrong.

v3: Better Living Through Better Bytecode

cattrs generates Python code as simple lines of texual source code. The Python interpreter then ingests this text using the compile and eval builtins and produces a function object we can call.

There's another layer though, sitting in the middle and not as obvious: the function bytecode. This bytecode is a list of low level instructions the Python interpreter executes when a function is called. Let's take a look at the bytecode of our generated structure_Test using Python's dis module.

>>> from dis import dis
>>> dis(f)
  3           0 LOAD_GLOBAL              0 (structure_a)
              2 LOAD_FAST                0 (o)
              4 LOAD_CONST               1 ('a')
              6 BINARY_SUBSCR
              8 LOAD_GLOBAL              1 (type_a)
             10 CALL_FUNCTION            2

  4          12 LOAD_GLOBAL              2 (structure_b)
             14 LOAD_FAST                0 (o)
             16 LOAD_CONST               2 ('b')
             18 BINARY_SUBSCR
             20 LOAD_GLOBAL              3 (type_b)
             22 CALL_FUNCTION            2
             
         ... similar lines omitted...
         
  2          72 LOAD_CONST               7 (('a', 'b', 'c', 'd', 'e', 'f'))
             74 BUILD_CONST_KEY_MAP      6
             76 STORE_FAST               2 (res)

 10          78 LOAD_GLOBAL             12 (__cl)
             80 BUILD_TUPLE              0
             82 BUILD_MAP                0
             84 LOAD_FAST                2 (res)
             86 DICT_MERGE               1
             88 CALL_FUNCTION_EX         1
             90 RETURN_VALUE

This function uses LOAD_GLOBAL a lot - we provide a lot of data through the function globals, so no wonder. It also turns out Python functions essentially have access to two scopes - the global scope and the local scope - and loading objects from the global scope is a lot slower than loading them from the local scope!

When an object is loaded from the local scope, you'll see the LOAD_FAST instruction instead of the LOAD_GLOBAL instruction. The local scope is mostly used for variables defined in the function, hence the name. Wouldn't it be great if we could generate our function to read all the objects we've prepared for it from the local namespace, though?

We can use a trick. We can get all these objects into the local scope by setting them as default values to dummy parameters. The expected interface of the structure function is simple, so no one will know or mind if we stick a bunch of keyword-only parameters with defaults into the function signature.

So the next version of cattrs, 22.1.0 (oh yeah, we're switching to CalVer, just like attrs), will do just that.

First, let's take a look at the new Python source:

>>> for l in getsourcelines(f)[0]: print(l)
def structure_Test(o, _, *, __cl=__cl, __c_structure_a=__c_structure_a, __c_structure_b=__c_structure_b, __c_structure_c=__c_structure_c, __c_structure_d=__c_structure_d, __c_structure_e=__c_structure_e, __c_structure_f=__c_structure_f):
  return __cl(
    __c_structure_a(o['a']),
    __c_structure_b(o['b']),
    __c_structure_c(o['c']),
    __c_structure_d(o['d']),
    __c_structure_e(o['e']),
    __c_structure_f(o['f']),
  )

As I mentioned, we've stuck a bunch of variables into the function signature so it's super messy right now. We can apply some deduplication to it later, but it doesn't really matter.

I've also applied some more optimizations to it: we don't generate a temporary dictionary to hold the individual fields if we don't need to, and we don't need to in this case.

Now if we look at the bytecode:

>>> dis(f)
  2           0 LOAD_FAST                2 (__cl)

  3           2 LOAD_FAST                3 (__c_structure_a)
              4 LOAD_FAST                0 (o)
              6 LOAD_CONST               1 ('a')
              8 BINARY_SUBSCR
             10 CALL_FUNCTION            1

  4          12 LOAD_FAST                4 (__c_structure_b)
             14 LOAD_FAST                0 (o)
             16 LOAD_CONST               2 ('b')
             18 BINARY_SUBSCR
             20 CALL_FUNCTION            1
             
         ... similar lines omitted ...
             
  8          52 LOAD_FAST                8 (__c_structure_f)
             54 LOAD_FAST                0 (o)
             56 LOAD_CONST               6 ('f')
             58 BINARY_SUBSCR
             60 CALL_FUNCTION            1

  2          62 CALL_FUNCTION            6
             64 RETURN_VALUE

Sweet, sweet LOAD_FAST, and no LOAD_GLOBAL in sight. But what are the fruits of our labor?

On my Mac, this version takes ~1.39 microseconds to run. That's 60% of the v2 solution, and 28% of the v1 approach. Not too bad, if I do say so myself.

CPython 3.11a4 and PyPy

I've been using CPython 3.9 to run these benchmarks, as that is the version I'm using for work at this time. Just out of curiosity, I've run these benchmarks on the latest alpha of CPython 3.11 (a4) and the latest stable PyPy release (7.3.7), to see how they compare to my old workhorse the 3.9.

CPython 3.11a4

v1: 3.57 us +- 0.14 us
v2: 1.70 us +- 0.04 us
v3: 1.09 us +- 0.02 us

Looks like 3.11 is ~27% faster than 3.9 for the v3 approach. Great news!

PyPy

v1: 1.68 us +- 0.05 us
v2: 479 ns +- 13 ns
v3: 228 ns +- 6 ns

PyPy is still the king of crunching CPU-bound code. 228 nanoseconds to structure a 6-field class, that's gotta be close to native, right?