Type checking with JSON Schema in Python

In this post I’ll recount some thoughts on merging the activity of JSON Schema-writing with type checking in Python, and some blundering hacks with mypy’s plugin system.

TL;DRHere is a thing:erickpeirson/jsonschema-typedUse JSON Schema for type checking in Python.

Contribute to erickpeirson/jsonschema-typed development by creating an…github.

comAlso, mypy’s plugin system is fun and interesting (albeit challenging).

Also I think that extending the class-based syntax forTypedDict (PEP 589) to allow instance methods might be worth considering.

MotivationOccasionally (usually while Little Human #1 is down for a nap) I get a few minutes to play around with something not on the critical path for work projects.

On one such occasion, I started to wonder about the not-uncommon case in which an API resource (described with JSON Schema) is very similar if not identical in structure to the data we’re passing around inside an app (described with type annotations).

This kind of thing might look familiar:The example above is not terribly interesting, but you get the idea: we have a JSON Schema for API tests (and our poor API consumers), and we have a TypedDict (thanks to PEP 589 and mypy_extensions) for our static type checks.

Is it really always necessary to maintain two descriptions of what are essentially the same data structure?Of course, there is not a perfect 1:1 mapping between concepts in JSON Schema and Python’s type annotations.

Most notably, JSON Schema tends toward open-ended-ness, asserting constraints on data rather than providing a complete description.

Unless specifically prohibited, a document may go beyond the specific properties enumerated in its schema.

Type annotations in Python come down across the spectrum, ranging from permissive approaches like structural duck typing with Protocols (PEP 544) on one end to approaches like TypedDict that provide a complete and total description.

At a more basic level, there isn’t a direct mapping between primitive data types.

For example, JSON inherits JavaScript’s ambivalence about integers and floating point numbers, so the JSON struct {"luckyInteger": 4.

0} would satisfy the schema {"luckyInteger": {"type": "integer"}}; however, that JSON could plausibly be deserialized to the Python struct {'luckyInteger': 4.

0} which would not satisfy {'luckyInteger': int} in a TypedDict declaration.

Nevertheless, I suspect that there is enough of an isomorphism between JSON Schema and Python typing to warrant some exploration of the possibilities.

Since a JSON document is generally represented as a dict in Python programs, I started looking specifically at interpreting JSON schema as TypedDict definitions.

ProblemThe driving intuition is that we shouldn’t have to write and maintain the (nearly) same type constraints for what are essentially two views on the same data.

Not all applications that work with JSON resources are structured this way.

But for those that are, eliminating duplicated information may reduce errors and will certainly reduce effort.

To be clear, this isn’t about validating JSON.

JSON Schema validation (e.

g.

as implemented in jsonschema) is about the correctness/conformity of an instance of the schema, a JSON document.

What I’m interested in here is translating a JSON Schema into a set of Python type constraints that can be leveraged in static type checking.

What this means in practical terms is that we need a way to refer to a JSON Schema document in a Python program, load that document during static analysis (i.

e.

not in the program runtime), and integrate type information from the schema into the type checking process.

Something like:ApproachThe approach that I am considering here is to leverage mypy’s plugin system.

Mypy checks Python code in (roughly) three steps:Parsing the code into an Abstract Syntax Tree;Analyzing the AST to bind references to type definitions, i.

e.

building a symbol table;Checking the type-correctness of the Python code using the symbol table.

The plugin system provides a set of hooks into various moments during semantic analysis, the second step.

I am no expert on compilers nor the mypy codebase, so the brief description of plugins in the mypy documentation seemed a bit cryptic at first.

Thankfully the code that supports plugins includes comments that go into a bit more helpful detail, and there is a “default” plugin that focuses heavily on TypedDict.

And the folks on the python/typing Gitter channel are super helpful, too.

Here’s what a minimal plugin can look like:Of course, it’s probable that I nevertheless managed to badly abuse the mypy internals here, so take the rest of this section with a grain of salt.

Generate a custom type during analysisThis uses the get_type_analyze_hook(name: str) -> Callable[.

]:, which can intervene on any unbound (i.

e.

mypy doesn’t know what it is yet) reference to a type that the semantic analyzer encounters.

I first defined a placeholder for JSONSchema that we can import in our Python program, so that we don’t have problems during runtime.

In the plugin hook, I then attempted to intercept the name 'JSONSchema' and generate a TypedDict.

Note that mypy is not working directly with the TypedDict class (which is implemented in a different package, mypy_extensions) during analysis.

Instead, we are working with mypy’s TypeDictType, an abstract representation of the TypeDict.

The AnalyzeTypeContext object is a named three-tuple that contains the unbound type, the context in which the reference to the type was encountered, and the TypeAnalyzerPluginInterface that exposes some convenient functionality from the type analyzer.

In type annotations, mypy treats anything in square brackets as “arguments”.

We can access those arguments on the unbound type.

So if we encounter the expression JSONSchema['path/to/a/schema.

json'], we can access the path at ctx.

type.

args[0].

literal_value.

So the plugin ends up looking like:Now all we have left to do is actually generate a TypeDictType from the JSON Schema.

I had really hoped that I could lean heavily on jsonschema for this part, especially for traversing the schema—I’d rather not reinvent the necessary recursion and semantics of JSON Schema.

Unfortunately (and this is not a criticism by any means) the underlying validator implementation is really not written in an extensible way, but it did provide a nice starting point, and luckily the recursion involved is straightforward.

My implementation of TypeDict-from-schema generation ended up using a similar dynamic dispatch pattern.

Here’s a fragment that shows how the TypedDictType (including any nested object schemas) gets generated:After a bit of trial and error, this ended up working fairly well.

The following example works as expected:The complete plugin implementation, including a second exploratory approach (ab)using dynamic classes, can be found here:erickpeirson/jsonschema-typedUse JSON Schema for type checking in Python.

Contribute to erickpeirson/jsonschema-typed development by creating an…github.

comDisclaimer: I have not extensively tested this in Real Life Situations, but this is reasonably satisfying for exploratory purposes.

LimitationsadditionalProperties in JSON Schema doesn’t really have an equivalent in TypedDict.

JSON Schema allows array values for type, including the root of the schema.

Cases in which the root of the schema is anything other than an object are not terribly interesting for this project, so I ignored them for now.

Array values for type (e.

g.

"type": ["object", "boolean"]) are otherwise supported with Union.

The default JSON Schema keyword does not have an equivalent in TypedDict, but there is hope for the future.

Self-references (e.

g.

"#") can’t really work properly until nested forward-references are supported in mypy.

But this is coming soon.

Thoughts on TypeDicts and encapsulationThis is a bit of a tangent, but fiddling around with JSON Schema and TypedDict got me thinking a bit more about TypedDict itself.

The class-based syntax is a nice way to write TypedDict, but feels like a bit of a tease since:Methods are not allowed, since the runtime type of a TypedDict object will always be just dict (it is never a subclass of dict).

 — PEP 589It’s a bit of a bummer to prohibit encapsulation—bundling data together with methods—a practice that is fundamental to Object-Oriented Programming.

Clearly we don’t want to fall into the trap of confusing our domain model with our resource model, but I can imagine plenty of good reasons to want instance methods on my resources.

I poked around for some background on why the prohibition on instance methods was included in the PEP, and came up empty.

Next stepsI’m curious to see how useful the initial implementation of this mypy plugin can be in practice.

It will need a bit of pruning before prime time, but I threw it up on PyPI to make it easier to test drive.

It’s also possible that there is a more pragmatic direction than TypedDict, especially given the limitations on the class-based syntax (above).

Perhaps on another nap.

.. More details

Leave a Reply