In this special guest feature, John Hammink, Developer Advocate at Aiven.
io, discusses how there are numerous ways to go about designing and maintaining a viable data pipeline, and there is no silver-bullet solution for every organization.
John Hammink was an early engineer in a variety of roles at Skype and F-Secure, and after stints at Nokia, Cisco and Mozilla, eventually became a content creator and developer evangelist at Arm Treasure Data.
Since then he’s focused on producing content for early stage startups, including Algorithmia, RadixDLT, and Alooma.
He’s recently become the Developer Advocate at Aiven.
io.
Data pipelines may not be useful unless they connect with where the data is housed — a frustration that engineers know all too well.
Here’s how this might look: imagine running a site that tracks commodities, but you’re limited to batch importing the commodity prices once every 24 hours.
No savvy trader will trust your platform to make accurate, well-timed decisions about when to buy or sell.
googletag.
cmd.
push(function() { googletag.
display(div-gpt-ad-1439400881943-0); }); Or picture being tasked to develop a mobile game that monetizes players’ progress on a level-by-level basis.
What if the game suddenly rockets to the top of the app store, and you quickly need to collect and accommodate variable-length event data like level progress along with points, player name, rank and position matched with a timestamp? Can the MySQL server sitting under your desk really keep the pipeline connected and flowing to keep up with demand? These scenarios shouldn’t seem unfamiliar.
Data from disparate sources flowing into countless silos with piecemeal permissions is a challenge encountered by nearly every data engineer or developer.
Until recently, viewing, interpreting and analyzing in-transit data hasn’t been possible.
Instead, it was processed and collected in batches on a nowhere near real-time basis.
The volume and velocity of data has accelerated at a pace way faster than homespun pipelines were built to handle.
This has led to painful restart cycles, inconsistent data formats and a host of other challenges that create painful consequences for organizations.
Insights are stale and inconclusive as a result, latency is untenable and overall performance is bottlenecked.
Developers simply want to get their data where they want it to go in order to achieve a singular, canonical store that is primed for analysis.
But in the face of limited resources and higher priorities for engineering teams, this isn’t always possible.
The top three challenges in creating and maintaining a modern data pipeline Before we can identify the challenges that developers and engineers must tackle to build and optimize a data pipeline, we need to define the term “data pipeline.
” For our purposes, it’s a set of automated workflows that extract data from multiple sources, and connect to those sources with a certain level of elasticity and schema flexibility that enables data mobility, transformation and visualization.
Within the pipeline, what, where and how data is collected should be clearly defined, as should the process for automatically extracting, transforming, combining and preparing the data for deeper analysis and visualization.
Keeping this (complex!) definition in mind, let’s examine some common challenges to consider when designing a data pipeline.
1.
Getting data where (and how) you need it to be Achieving a complete picture of your data means getting it to a state where you can draw insights from combined information.
Your tools must support connections to as many data formats and sources as possible, including unstructured data.
The challenge here is to identify the data that is needed in the first place, and the strategy you’ll use to ingest, combine and augment the data within your pipeline.
2.
Finding a home for your data Once a pipeline exists, it has to take all of this newly combined data somewhere.
Will it deposit it to an on-prem location? If so, there are a litany of choices you’ll have to make, including where precisely the data will reside and in what format, a determination as to whether there should be redundancy within the system, the performance benchmark needed to meet the service level agreement (SLA), etc.
Your data solution could also utilize managed services, which can be more expensive but are far less variable (and customizable) than running on-prem.
Managed cloud services do, however, offer more tailored support along with scalable storage and memory.
3.
Future proofing your pipeline Some enterprises are still importing data in all-or-nothing static batches.
But looking to the future is critical when building pipelines that handle data, especially considering the mind-blowing rate at which data is being created.
Your organization might currently draw data from one device, system or set of sensors to power your applications, but it’s unlikely that this will be the case forever.
Again, if choosing to host your solution on-prem, you have to consider the viability of that system far into the future.
There are numerous ways to go about designing and maintaining a viable data pipeline, and there is no silver-bullet solution for every organization.
Those that choose a managed service route to address these challenges may realize the most future-proof and direct path to a solution, while those that keep things on-prem can — with the right competencies and resources — architect the most custom path forward.
But whichever path you choose, it’s important to make decisions quickly and strategically.
Because the accelerating pace of data won’t wait for you to decide.
Sign up for the free insideBIGDATA newsletter.
.