A Technical Article Series for Data Architects This multi-part article series is intended for data architects and anyone else interested in learning how to design modern real-time data analytics solutions.
It explores key principles and implications of event streaming and streaming analytics, and concludes that the biggest opportunity to derive meaningful value from data – and gain continuous intelligence about the state of things – lies in the ability to analyze, learn and predict from real-time events in concert with contextual, static and dynamic data.
This article series places continuous intelligence in an architectural context, with reference to established technologies and use cases in place today.
Part 3: Smarter Databases Can’t Help The rise of event streaming is due to the importance of the change in system behavior over time.
The challenge at the application layer is to deliver intelligence gleaned from streamed events, continuously, over time.
But a major drawback of event streaming architectures is that they store events in topic queues ordered by event arrival time, forcing applications to only consume from the head of the queue.
What we’d really like is a stream of intelligence – delivered like events – that results from the continuously concurrent interpretation of the effects of all events on a stateful model of the system.
Instead, today’s application developers must use some sort of database to represent the state of the system.
And that just isn’t enough: modifying a representation of the system state in response to changes communicated in events is one thing but delivering a continuous stream of intelligence that results from those changes is quite another.
Databases can help with the first – but lead to performance impacts.
They do nothing for the second (see figure 4).
There is a vast number of databases and cloud database services available.
Most can store streaming data, and many have evolved powerful features to confirm their role as masters of application state.
Sophisticated data management capabilities are migrating into the database engine to deal with latency challenges.
Leading the feature development race are the hosted database services from the major cloud providers.
But there are hundreds of others.
Broadly the trend is toward large in-memory stores, grids and caches that attempt to reduce latency.
All of today’s database engines can ingest events at huge rates.
But that’s not the problem.
No database, in-memory or other, can understand the meaning of data, or deliver real-time, situationally relevant responses.
Applications interpret events from the real-world to change a model of the state of the system, but a single event may cause state changes to multiple related entities.
By way of example: A truck needing maintenance enters a geo-fence meaning that the truck is near an inspector, so the inspector is alerted.
A single event with the GPS coordinates of the truck might change the states of the geo-fence and the inspector.
Every time the states or relationships between entities change, the application may need to evaluate sophisticated logical or mathematical predicates, joins, maps or other aggregations, and execute business logic.
Each of these might require scores of round-trips to the database.
For every truck, and every inspector, in real-time.
For an application at scale, this leads rapidly to a situation where the database is the bottleneck.
For distributed applications, the round-trip latency for database access can quickly dominate performance.
For an application processing hundreds of thousands of events per second, the only way to reduce latency is to execute application logic in the memory context of each impacted entity, avoiding database latency entirely.
That is exactly what Swim does.
There’s another reason that smarter databases can’t help with continuous intelligence: They don’t drive computation of insights or “push” them to users.
The inversion of the control loop is fundamental: In most applications that claim to be real-time, the query to a database drives computation, and the results are delivered to the user.
But that’s not enough for today’s continuous intelligence use cases.
Users want to deliver real-time responses to analysis, learning and prediction, as it occurs, concurrently for all entities in the system.
They want the application to always have an answer.
Databases don’t do that.
To read parts 1 and 2 of this guest article series, please visit the Swim blog.
About the Author Simon Crosby is CTO at Swim.
Swim offers the first open core, enterprise-grade platform for continuous intelligence at scale, providing businesses with complete situational awareness and operational decision support at every moment.
Simon co-founded Bromium (now HP SureClick) in 2010 and currently serves as a strategic advisor.
Previously, he was the CTO of the Data Center and Cloud Division at Citrix Systems; founder, CTO, and vice president of strategy and corporate development at XenSource; and a principal engineer at Intel, as well as a faculty member at Cambridge University, where he led the research on network performance and control and multimedia operating systems.
Simon is an equity partner at DCVC, serves on the board of Cambridge in America, and is an investor in and advisor to numerous startups.
He is the author of 35 research papers and patents on a number of data center and networking topics, including security, network and server virtualization, and resource optimization and performance.
He holds a PhD in computer science from the University of Cambridge, an MSc from the University of Stellenbosch, South Africa, and a BSc (with honors) in computer science and mathematics from the University of Cape Town, South Africa.