Processing Time Series Data in Real-Time with InfluxDB and Structured Streaming

Processing Time Series Data in Real-Time with InfluxDB and Structured StreamingThis article focuses on how to utilize a popular open source database “Influxdb” along with spark-structured streaming to process, store and visualize data in real time..Here, we will go in detail over how to set up a single node instance of Influxdb, how to extend the Foreach writer of SPARK to use it to write to Influxdb and what one needs to keep in mind while designing an Influxdb database.vibhor nigamBlockedUnblockFollowFollowingDec 15In the data world, one of the major trends which people want to see is how a metric progresses with time..This makes managing and handling a time series data (simply meaning where data values are co-dependent on time) a very important aspect of a Data Scientist’s life.A lot of tools and databases have been developed around this idea of handling time series data in an efficient way..During my recent project, I got to explore one such very popular open source database called “Influxdb”, and this post is about how to process real-time data with Influxdb and Spark.InfluxdbAs from the perspective of a definitionInfluxDB is used as a data store for any use case involving large amounts of time-stamped data, including DevOps monitoring, log data, application metrics, IoT sensor data, and real-time analytics.From the scope of this article, I will not go into the details of how the database works and the algorithms being used by it, the details of which can be found hereIn this article, I will focus mainly on installation, writing and reading capacity, writing through the Spark and behavior of influx with the volume of data.InstallationInfluxdb comes in 2 versions as a solution, open source which can be installed only on a single instance and enterprise edition, which is paid and can be installed on a cluster.For a number of cases, open source edition is pretty useful and fulfills the requirements..A single instance installation of Influxdb is very simple..The steps I followed are different from what has been mentioned in the documentation (which I found a bit tricky to do installation), which are as following:Download a rpm file of influxdbInstall alien package if not installed with “sudo apt-get install alien”Get a .deb file from rpm with “alien name.rpm”install influx with “sudo dpkg -i name.deb”Start influx server with “sudo influxd” or with “sudo service influx start”Hardware Sizing GuidelinesInfluxdb has been generous enough to provide us with hardware sizing guidelines..The ones for a single instance node are as follows.These guidelines are mentioned in much detail atInfluxData DocumentationDocumentation for InfluxDB, Telegraf, Chronograf, Kapacitor, and Fluxdocs.influxdata.comInfluxDB Basic ConceptsThere are some important Influxdb concepts to understand here1..Measurement: A measurement is loosely equivalent to the concept of a table in relational databases..Measurement is inside which a data is stored and a database can have multiple measurements..A measurement primarily consists of 3 types of columns Time, Tags and Fields2..Time: A time is nothing but a column tracking timestamp to perform time series operations in a better way..The default is the Influxdb time which is in nanoseconds, however, it can be replaced with event time.3.. More details

Leave a Reply