On structured and unstructured data, or the case for cattrs

If you've ever gone through the Mypy docs, you might have seen the section on TypedDict. The section goes on to introduce the feature by stating:

Python programs often use dictionaries with string keys to represent objects. [...] you can use a TypedDict to give a precise type for objects like movie, where the type of each dictionary value depends on the key:

from typing_extensions import TypedDict

Movie = TypedDict('Movie', {'name': str, 'year': int})

movie = {'name': 'Blade Runner', 'year': 1982}  # type: Movie

In other words, TypedDict exists to make dictionaries a little more like classes (in the eyes of Mypy, in this particular case), and is only one example of a growing menagerie of similar efforts to make dictionaries classes.

In this post, I maintain that in modern Python classes already exist, are fit-for-purpose and dictionaries should just be left to be dictionaries.

Value Objects

Pretty much every application and every API has a notion of data models on some level. These are prime examples of structured data - pieces of information with a defined shape (usually the names and types of subfields). The TypedDict example from the introduction defines a data model with two fields. Let's call these pieces of data value objects. Value objects come in a million flavors on many different abstraction layers; they can range from a Django model to a class you define in a one-liner to be able to return multiple values from a function, to just a dictionary. Value objects usually don't have a lot of business logic attached to them so it might be a stretch calling some of these value objects, but let's roll with it here.

In Python, the most natural way of modeling value objects is a class; since an instance of a class is just that - a piece of structured data.

When the TypedDict docs claim that Python programs often use dictionaries to model value objects, they aren't incorrect. The reason for this is, however, that historically Python has not had good tools for using classes for value objects, not that dictionaries are actually good or desireable for this purpose. Let's look at why this is the case.

JSON Value Objects

One of the biggest reasons, I believe, is JSON, probably the most popular serialization format of our time. Python has great tools for converting a piece of JSON into unstructured data (Python primitives, lists and dictionaries) - there's a JSON library included in Python's standard library, and very robust, well-known and performant third-party JSON libraries. Pretty much all Python HTTP libraries (client and server) have special cases for easy handling of JSON payloads.

Now, take into account that the most straightforward way to model a value object in JSON is simply using a JSON object with fields corresponding to the value object fields. So, parsing the JSON payload {"name": "Blade Runner", "year": 1982} into a dictionary is extremely easy, and converting this into a proper Python value object much less so.

Modern Python Value Objects

Historically, creating Python value object classes and populating them with data from somewhere (like a JSON payload) has been very cumbersome. There have been three recent development in the broader Python ecosystem to make this much better.

attrs

We now have attrs. attrs is a Python library for declaratively defining Python classes, and is particularly amazing for modeling value objects. attrs itself has excellent docs and makes a great case against manually writing classes (which it whimsically calls artisinal classes) here. The example nicely illustrates the amount of code needed for a well-behaved value object. No wonder the Python ecosystem used to default to dictionaries.

A small note on dataclasses: the dataclasses module is basically a subset clone of attrs present in the Python standard library. In my opinion, the only use of dataclasses is if you don't have access to third-party libraries (i.e. attrs), for example if you're creating simple scripts that don't require a virtual environment or are writing code for the standard library. If you can use pip you should be using attrs instead, since it's just better.

Field-level type annotations

We now (since Python 3.6) have field-level type annotations in classes (aka PEP 526).

This makes it possible to define a value object thusly:

@attr.define
class Movie:
    name: str
    year: int

The most important part of this PEP is that the type information for the value object fields is available at runtime. (Classes like this were possible before this PEP using type comments, but that's not usable in runtime.)

The field type information is necessary for handling structured data; especially any kind of nested structured data.

cattrs

We now have cattrs. cattrs is my library for efficiently converting between unstructured and structured Python data. To simplify, cattrs ingests dictionaries and spits out classes, and ingests classes and spits out dictionaries. attrs classes are supported out of the box, but anything can be structured and unstructured. For example, the usage docs show how to convert Pendulum DateTime instances to strings, which can then be embedded in JSON.

cattrs uses converters to perform the actual transformations, so the un/structuring logic is not on the value objects themselves. This keeps the value objects leaner and allows you to use different rules for the same value object, depending on the context.

So cattrs is the missing layer between our existing unstructured infrastructure (our JSON/msgpack/bson/whatever libraries) and the rich attrs ecosystem, and the Python type system in general. (cattrs goes to efforts to support higher-level Python type concepts, like enumerations and unions.)

I believe this functionality is sufficiently complex for it to have a layer of its own and that it doesn't really make sense for lower-level infrastructure (like JSON libraries) to implement it itself, since the conversion rules between higher-level components (like Pendulum DateTimes) and their serialized representations need to be very customizable. (In other words, there's a million ways of dumping DateTimes to JSON.)

Also, if the unstructured layer only concerns itself with creating unstructured data, the structuring logic can be in one place. In other words, if you use ujson + cattrs, you can easily switch to msgpack + cattrs later (or at the same time).

Putting it all to use

Let's try putting this to use. Let's say we want to load a movie from a JSON HTTP endpoint.

First, define our value object in code. This serves as documentation, runtime information for cattrs, and type information for Mypy.

@attr.frozen
class Movie:
    name: str
    year: int

Second, grab the unstructured JSON payload.

>>> payload = httpx.get('http://my-movie-url.com/movie').json()

Third, structure the data into our value object (this will throw exceptions if the data is not the shape we expect). If our data is not exotic and doesn't require manual customization, we can just import structure from cattr and use that.

>>> movie = cattr.structure(payload, Movie)

Done!

Addendum: What should dictionaries actually be used for?

The attrs docs already have a great section on what dictionaries should be, so I'll be short in adding my two cents.

If the value type of your dictionary is any sort of union, it's not really a dictionary but a value object in disguise. For the movie example, the type of the dictionary would be dict[str, Union[str, int]], and that's a tell-tale sign something's off (and the raison d'etre for TypedDict). A true dictionary would, for example, be a mapping of IDs to Movies (if movies had IDs), the type of which would be dict[int, Movie]. There's no way to turn this kind of data into a class.