The Categories of Bugs in Python Apps

Say you're writing a Python program, of any kind but maybe a network service. You're likely to err (you're human, after all) and produce errors (bugs, defects) during this process. We can't control whether we make mistakes or not but there are steps we can take to control what kinds of errors we write.

For this article let's invent a really simple model for the categories of errors we might write, from best to worst. Then, let's look at how by being mindful of the tools we use and how we use them we can lower some of our defect categories to create better software.

Category 1: Type-checking and Linting Errors

Simple: you call a function and instead of passing in an integer, you pass in a string. You save the file and your editor runs Mypy, or maybe you switch to your terminal and run make lint which then runs Mypy, Mypy yells at you, you fix it, you move on.

These are the best defects because:

  • they are surfaced immediately, the time to discovery is measured in seconds.
  • they are fairly well-localized; it's generally obvious where the issue is exactly.
  • if you have typechecking or linting already set up, they are very low-cost: you just need to run the checkers instead of, for example, writing a test.
  • assuming your CI is set up correctly and refuses to deploy on linting failures, they are stopped early so they're safe; your users (and your SLA, including any folks on-call) will not be affected by them.

A small sidenote: type errors are cheap only if you're actually in a position to use typing. If you have experience with typing and you're starting a greenfield project (so you can choose your libraries), the cost of setting up typing initially is practically zero. If you have little-to-no typing experience and work with an existing codebase using frameworks without robust typing support (Django being somewhat in this category), the cost may be prohibitive. In that case let the existence of cat 0 errors be a motivator to learn and put yourself in the position to use typing in the future.

Category 2: Import-time Explosions

Your service starts up, the start-up procedure runs some setup code, this code raises an exception thus aborting the start-up procedure and crashing your service.

These defects are second-best you can have. To make full use of them you need to either have a test suite that executes the start-up procedure or have your deployment platform require your service handle a readiness check before the service is put in commission.

  • they are also fairly well-localized; by looking at your service logs you should be able to see where exactly the error is thrown, and why.
  • having your CI run a test suite that runs this logic or having your deployment platform run a readiness check are also fairly low-cost, not that difficult to set up in terms of time and complexity. The cost is still higher than just running a lint pass.
  • they are also stopped early and somewhat safe; your production shouldn't be affected. You will need some sort of alerting that your deployment cannot successfully start so someone can figure out what's happening. For example, if a pod from a new version of deployment won't start up, Kubernetes will not proceed with the rollout, thus saving you from a non-functional service. This is not a panacea though, if left unchecked for too long the pods from the old deployment might get removed anyway, maybe from your cluster autoscaler starting and stopping the underlying machines.
  • they aren't surfaced immediately; at best when you run the test suite, and at worst when the service gets deployed and fails to start.

Certain kinds of defects cannot be handled by the Python type system (or maybe any type system) so this is the best that can be done.

This approach has the downside in that it requires logic to run during start-up, and the more logic the better. This can make your start-up slow, and this can make for a worse development experience. If your application is a CLI thing, the start-up time can be a very important feature by itself.

Now you get to choose between conflicting constraints. If only these were typechecking errors, huh?

Category 3: Runtime Explosions

Your service deploys correctly, but whenever a user hits a particular endpoint an exception is raised and the user gets an error response.

We're getting into pretty bad territory now.

  • while it may be easy to see what the problem is when the defect finally triggers, it may not be obvious when the defect will actually get triggered. It may be right after the deploy, and it may be on a Saturday at 3 AM when a user finally hits that particular if branch of that particular endpoint. If the endpoint doesn't get much traffic, you'll require pretty good alerting and observability to actually learn you have a problem.
  • the only way to guard against this is a thorough test suite for all your endpoints. This is pretty expensive in terms of developer time and codebase complexity.
  • these defects are caught late, so your users will see them and maybe be frustrated with your product or lose trust. Your SLA may be affected.

One good thing about runtime explosions is that it's much better to raise an error than do the wrong thing silently. At least your database state won't get inconsistent (you're doing stuff transactionally, right?) and, if someone looks at your error logging, they will actually see the defect and hopefully a stack trace.

Even so, wouldn't it be so much better if these were import-time errors?

Category 4: Doing the Wrong Thing Silently

Your endpoint handles the request without an error, but instead of subtracting N dollars from a user, it adds N dollars to the user's account. No one is any wiser except maybe the user.

Have you ever received an email that begins with the literal string "Hello ${firstName}"? That's someone getting a category 4 into production.

These are terrible. We're getting to defects that could existentially threaten your project or employer.

  • the defect doesn't generate an actual error, so it's extremely difficult to detect. You'll probably hear about it from support or the company leadership at the worst possible time, and it'll need handling immediatelly. Hope you didn't have plans this weekend.
  • apart from the aforementioned thorough test suite, if this is a core thing you will likely need a tracking system to do clawbacks. The tracking system can just be normal logs, as long as your logging system is reliable and has good retention. You might want a periodical auditing job running. This gets into really expensive territory; the kind that might require a team of its own.
  • because there are no actual errors, good luck figuring out where exactly the issue is.

Ooph, these sure do suck. I'd trade these for a runtime explosion defect anytime.

Now, this model isn't perfect. The fact of the matter is, even sophisticated type-checking cannot guard again certain types of category 4's so you will probably want a test suite in any case. This means the cost of a test suite is amortized somewhat, which can change the calculus a little. There are other factors in play, such as how much development velocity means versus correctness; a gaming backend will have different constraints than a financial services one.

That said, I've found this model to still be very useful.

Let's Talk Strategy

The conclusion is simple: if you want to make your software more robust, you need to lower the categories of as many possible defects you can.

Turn your silent manglings into runtime explosions. Turn your runtime explosions into start-up explosions, and turn your start-up errors into typechecking errors. Turning a category 4 into a category 1 would be amazing win in my book; I'd be willing to compromise a lot to get a PR like that merged into something I'm responsible for. Then test what's left; the more defects you demote in category, the less testing you'll need.

The conclusion has a corollary though: we (the Python open-source community) need to keep working on tools that let users lower their defect categories. And users should carefully consider which categories of errors a given tool will make them handle.

This is why I get excited for new libraries that expose their stuff in a type-safe way. A type-safe templating library turns the ${firstName} email from a category 4 into a category 1. The type-safer your new ORM, the more I'm interested in learning about it.

A Case Study: cattrs

Here's a real-life example of a change I plan on making in cattrs to help this situation.

Let's assume you're using cattrs for deserialization and you'd like to use Glyph's DateType thing for dates. You want to convert some JSON into a DateType in an endpoint.

from datetype import AwareDateTime

from cattrs import Converter

c = Converter()

def handler(payload: str) -> None:
    print(c.structure(payload), AwareDateTime)

Now, cattrs doesn't know how to convert a string into an AwareDateTime since they are completely independent libraries, so this will explode at runtime; a category 3.

With how cattrs is designed I don't think the structure method can be made type-safe, so turning this into a cat 1 isn't feasible. Could we turn it into a cat 2 at least?

We can fetch the actual structure hook at import time. This is currently possible with an internal API, so in the next version of cattrs this API will be public.

from datetype import AwareDateTime

from cattrs import Converter

c = Converter()

hook = c.get_structure_hook(AwareDateTime)

def handler(payload: str) -> None:
    print(hook(payload, AwareDateTime))

(Note: ideally you wouldn't be doing this yourself like this but delegate this to your web framework of choice.)

We have an additional problem though. At hook generation time, cattrs knows it can't handle the given type but instead of raising an error it will return a function that raises an error. So in the next version of cattrs, I will rework this API to actually raise errors during hook generation time, instead of hook execution time.

This is an example how we, as library authors, can make sure our users are enabled to make more robust software.

(If you're curious here's the actual fix for the example snippet:)

from datetime import datetime

from datetype import AwareDateTime, aware

    lambda v, _: aware(datetime.fromisoformat(v))
Zagreb, Croatia