An API for data that changes over time

The reason is that if you query keys A and B from a database, and the cache has A stored locally, it can’t return the cached value for A and just fetch B.

The cache also can’t store B alongside the older result for A.

Without versions, this problem is basically impossible to solve correctly in a general way.

But we can solve it for HTTP because we have ETags.

The caching problem is sort of solved by read only replicas — but I find it telling that read only replicas often need private APIs to work.

The main API of most databases aren’t powerful enough to support a feature that the database itself needs to scale and function.

(This is getting better though — Mongo / Postgres.

)Personally I think this problem alone is one of the core reasons behind the nosql movement.

Our database APIs make it impossible to correctly implement caching, secondary indexing and computed views in separate processes.

So SQL databases have to do everything in-process, and this in turn kills write performance — they have ever more work to do on each write.

Developers have solved these performance problems by looking elsewhere.

It doesn’t have to be like this — I think we can have our cake and eat it too; we just need better APIs.

(Credit where credit is due — Riak, FoundationDB and CouchDB all provide version information in their fetch APIs.

I still want better change feeds APIs though.

)Minimal Viable SpecWhat would a baseline API for data that changes over time look like?The way I see it, we need 2 basic APIs:fetch(query) -> data, versionsubscribe(query, version) -> stream of (update, version) pairs.

(Or maybe an error if the version is too old)There’s a lot of forms the version information could take — it could be a timestamp, a number, an opaque hash, or something else.

It doesn’t really matter so long as it can be passed into subscribe calls.

Interestingly, HTTP we already has a fetch function with this API in the GET method.

The server returns data and usually either a Last-Modified header or an ETag.

But HTTP is missing a standard way to subscribe.

The update objects themselves should to be small and semantic.

The gold standard for operations is usually that they should express user intent.

And I also believe we should have a MIME-type equivalent set of standard update functions (like JSON-patch).

Lets look at some examples:For Google Docs, we can’t re-send the whole document with every key stroke.

Not only would that be slow and wasteful, but it would make concurrent editing almost impossible.

Instead Docs wants to send a semantic edit, like insert 'x' at position 4.

With that we can update cursor positions correctly and handle concurrent edits from multiple users.

Diffing isn't good enough here – if a document is aaaa and I have a cursor in the middle (aa|aa), inserting another a at the start or the end of the document has the same effect on the document.

But those changes have different effects on my cursor position and speculative edits.

The indie game Factorio uses a deterministic game update function.

Both save games and the network protocol are streams of actions which modify the game state’s in a well defined way (mine coal, place building, tick, etc).

Each player applies the stream of actions to a local snapshot of the world.

Note in this case the semantic content of the updates is totally application specific — I doubt any generic JSON-patch like type would be good enough for a game like this.

For something like a gamepad API, its probably fine to just send the entire new state every time it changes.

The gamepad state data is so small and diffing is so cheap and easy to implement that it doesn’t make much difference.

Even versions feel like overkill here.

GraphQL subscriptions should work this way.

GraphQL already allows me to define a schema and send a query with a shape that mirrors the schema.

I want to know when the query result set changes.

To do so I should be able to use the same query — but subscribe to the results instead of just fetch them.

Under the hood GraphQL could send updates using JSON-patch or something like it.

Then the client can locally update its view of the query.

With this model we could also write tight integrations between that update format and frontend frameworks like Svelte.

That would allow us to update only and exactly the DOM nodes that need to be changed as a result of the new data.

This is not how GraphQL subscriptions work today.

But in my opinion it should be!To make GraphQL and Svelte (and anything else) interoperate, we should define some standard update formats for structured data.

Games like Factorio will always need to do their own thing, but the rest of us can and should use standard stuff.

I’d love to see a Content-Type: for update formats.

I can imagine one type for plain text updates, another for JSON (probably a few for JSON).

Another type for rich text, that applications like Google Docs could use.

I have nearly a decade of experience goofing around with realtime collaborative editing, and this API model would work perfectly with collaborative editors built on top of OT or CRDTs.

Coincidentally, I wrote this JSON operation type that also supports alternate embedded types and operational transform.

And Jason Chen wrote this rich text type.

There’s also plenty of CRDT-compatible types floating around too.

The API I described above is just one way to cut this cake.

There’s plenty of alternate ways to write a good API for this sort of thing.

Braid is another approach.

There’s also a bunch of ancillary APIs which could be useful:fetchAndSubscribe(query) -> data, version, stream of updates.

This saves a round-trip in the common case, and saves re-sending the query.

getOps(query, fromVersion, toVersion / limit) -> list of updates.

Useful for some applicationsmutate(update, ifNotChangedSinceVersion) -> new version or conflict errorMutate is interesting.

By adding a version argument, we can reimplement atomic transactions on top of this API.

It can support all the same semantics as SQL, but it could also work with caches and secondary indexes.

Having a way to generate version conflicts lets you build realtime collaborative editors with OT on top of this, using the same approach as Firepad.

The algorithm is simple — put a retry loop with some OT magic in the middle, between the frontend application and database.

Like this.

It composes really well — with this model you can do realtime editing without support from your database.

Obviously not all data is mutable, and for data that is, it won’t necessarily make sense to funnel all mutations through a single function.

But its a neat property!.Its also interesting to note that HTTP POST already supports doing this sort of thing with the If-Match / If-Unmodified-Since headers.

StandardsSo to sum up, we need a standard for how we observe data that changes over time.

We need:A local programatic APIs for kernels (and stuff like that)A standard API we can use over the network.

A REST equivalent, or a protocol that extends REST directly.

Both of these APIs should support:Versions (or timestamps, ETags, or some equivalent)A standard set of update operations, like Content-Type in http but for modifications.

Sending a fresh copy of all the data with each update is bad.

The ability to reconnect from some point in timeAnd we should use these APIs basically everywhere, from databases, to applications, and down into our kernels.

Personally I’ve wasted too much of my professional life implementing and reimplementing code to do this.

And because our industry builds this stuff from scratch each time, the implementations we have aren’t as good as they could be.

Some have bugs (fs watching on MacOS), some are hard to use (parsing sysfs files), some require polling (Contentful), some don’t allow you to reconnect to feeds (GraphQL, RethinkDB, most pubsub systems).

Some don’t let you send small incremental updates (observables).

The high quality tools we do have for building this sort of thing are too low level (streams, websockets, MQs, Kafka).

The result is a total lack of interoperability and common tools for debugging, monitoring and scaling.

I don’t want to rubbish the systems that exist today — we’ve needed them to explore the space and figure out what good looks like.

But having done that, I think we’re ready for a standard, simple, forward looking protocol for data that changes over time.

Whew.

By the way, I’m working to solve some problems in this space with Statecraft.

But thats another blog post.

 ;)InspirationsDatomic and everything Rich Hickey — The Value of Values talk is great.

Kafka and the event sourcing / DDD communities.

GraphQL subscriptionsRethinkDB change feedsRxJS / Obj-C observables and everything in betweenSvelteFirebaseGoogle Realtime API (Discontinued)Everything Martin Kleppmann does.

Fav talk 1 Talk 2Statebus / BraidReact FluxThis article was written by Joseph Gentle and first published 25 May on his blog.

.. More details

Leave a Reply