On structured and unstructured data, or the case for cattrs
If you've ever gone through the Mypy docs, you might have seen the section on TypedDict. The section goes on to introduce the feature by stating:
Python programs often use dictionaries with string keys to represent objects. [...] you can use a TypedDict to give a precise type for objects like movie, where the type of each dictionary value depends on the key:
from typing_extensions import TypedDict
Movie = TypedDict('Movie', {'name': str, 'year': int})
movie = {'name': 'Blade Runner', 'year': 1982} # type: Movie
In other words, TypedDict exists to make dictionaries a little more like classes (in the eyes of Mypy, in this particular case), and is only one example of a growing menagerie of similar efforts to make dictionaries classes.
In this post, I maintain that in modern Python classes already exist, are fit-for-purpose and dictionaries should just be left to be dictionaries.
Value Objects
Pretty much every application and every API has a notion of data models on some level. These are prime examples of structured data - pieces of information with a defined shape (usually the names and types of subfields). The TypedDict example from the introduction defines a data model with two fields. Let's call these pieces of data value objects. Value objects come in a million flavors on many different abstraction layers; they can range from a Django model to a class you define in a one-liner to be able to return multiple values from a function, to just a dictionary. Value objects usually don't have a lot of business logic attached to them so it might be a stretch calling some of these value objects, but let's roll with it here.
In Python, the most natural way of modeling value objects is a class; since an instance of a class is just that - a piece of structured data.
When the TypedDict docs claim that Python programs often use dictionaries to model value objects, they aren't incorrect. The reason for this is, however, that historically Python has not had good tools for using classes for value objects, not that dictionaries are actually good or desireable for this purpose. Let's look at why this is the case.
JSON Value Objects
One of the biggest reasons, I believe, is JSON, probably the most popular serialization format of our time. Python has great tools for converting a piece of JSON into unstructured data (Python primitives, lists and dictionaries) - there's a JSON library included in Python's standard library, and very robust, well-known and performant third-party JSON libraries. Pretty much all Python HTTP libraries (client and server) have special cases for easy handling of JSON payloads.
Now, take into account that the most straightforward way to model a value object in JSON is simply using a JSON object with fields corresponding to the value object fields. So, parsing the JSON payload {"name": "Blade Runner", "year": 1982}
into a dictionary is extremely easy, and converting this into a proper Python value object much less so.
Modern Python Value Objects
Historically, creating Python value object classes and populating them with data from somewhere (like a JSON payload) has been very cumbersome. There have been three recent development in the broader Python ecosystem to make this much better.
attrs
We now have attrs
. attrs
is a Python library for declaratively defining Python classes, and is particularly amazing for modeling value objects. attrs
itself has excellent docs and makes a great case against manually writing classes (which it whimsically calls artisinal classes) here. The example nicely illustrates the amount of code needed for a well-behaved value object. No wonder the Python ecosystem used to default to dictionaries.
A small note on dataclasses: the dataclasses
module is basically a subset clone of attrs present in the Python standard library. In my opinion, the only use of dataclasses is if you don't have access to third-party libraries (i.e. attrs
), for example if you're creating simple scripts that don't require a virtual environment or are writing code for the standard library. If you can use pip you should be using attrs
instead, since it's just better.
Field-level type annotations
We now (since Python 3.6) have field-level type annotations in classes (aka PEP 526).
This makes it possible to define a value object thusly:
@attr.define
class Movie:
name: str
year: int
The most important part of this PEP is that the type information for the value object fields is available at runtime. (Classes like this were possible before this PEP using type comments, but that's not usable in runtime.)
The field type information is necessary for handling structured data; especially any kind of nested structured data.
cattrs
We now have cattrs
. cattrs
is my library for efficiently converting between unstructured and structured Python data. To simplify, cattrs
ingests dictionaries and spits out classes, and ingests classes and spits out dictionaries. attrs
classes are supported out of the box, but anything can be structured and unstructured. For example, the usage docs show how to convert Pendulum DateTime
instances to strings, which can then be embedded in JSON.
cattrs
uses converters to perform the actual transformations, so the un/structuring logic is not on the value objects themselves. This keeps the value objects leaner and allows you to use different rules for the same value object, depending on the context.
So cattrs
is the missing layer between our existing unstructured infrastructure (our JSON/msgpack/bson/whatever libraries) and the rich attrs
ecosystem, and the Python type system in general. (cattrs
goes to efforts to support higher-level Python type concepts, like enumerations and unions.)
I believe this functionality is sufficiently complex for it to have a layer of its own and that it doesn't really make sense for lower-level infrastructure (like JSON libraries) to implement it itself, since the conversion rules between higher-level components (like Pendulum DateTime
s) and their serialized representations need to be very customizable. (In other words, there's a million ways of dumping DateTime
s to JSON.)
Also, if the unstructured layer only concerns itself with creating unstructured data, the structuring logic can be in one place. In other words, if you use ujson
+ cattrs
, you can easily switch to msgpack
+ cattrs
later (or at the same time).
Putting it all to use
Let's try putting this to use. Let's say we want to load a movie from a JSON HTTP endpoint.
First, define our value object in code. This serves as documentation, runtime information for cattrs, and type information for Mypy.
@attr.frozen
class Movie:
name: str
year: int
Second, grab the unstructured JSON payload.
>>> payload = httpx.get('http://my-movie-url.com/movie').json()
Third, structure the data into our value object (this will throw exceptions if the data is not the shape we expect). If our data is not exotic and doesn't require manual customization, we can just import structure
from cattr
and use that.
>>> movie = cattr.structure(payload, Movie)
Done!
Addendum: What should dictionaries actually be used for?
The attrs
docs already have a great section on what dictionaries should be, so I'll be short in adding my two cents.
If the value type of your dictionary is any sort of union, it's not really a dictionary but a value object in disguise. For the movie example, the type of the dictionary would be dict[str, Union[str, int]]
, and that's a tell-tale sign something's off (and the raison d'etre for TypedDict). A true dictionary would, for example, be a mapping of IDs to Movie
s (if movies had IDs), the type of which would be dict[int, Movie]
. There's no way to turn this kind of data into a class.