An Introduction to Big Data, Apache Spark, and RDDs

Various computing functionalities to handle big data.

We write code for Apache Spark in Python, R, Scala, and Java in scripts typically ran on platforms that support execution of these heavyset computations.

A Big Number of Big Machines process Big DataThese platforms that run Spark are typically cloud-based (Microsoft Azure, AWS, Google Cloud, etc.

) and written in notebooks connected with the cloud environment.

These notebooks are backed by networked computer clusters to efficiently process the large datasets that your lone laptop or PC wouldn’t be able to handle.

(More on the how later)The RDD as Spark’s Fundamental AbstractionApache Spark handles these huge sums of data through an abstraction called a Resilient Distributed Database (RDD).

You’ll be hearing this word a lot as it is the foundation behind Spark’s robust data processing engine.

On a high level, an RDD is a logical construct that allows us to visualize data in the traditional table format so familiar to us and run the necessary SQL queries on our dataset.

It is a java object at its core and has built-in methods (e.

g.

RDD_example.

map(), RDD_example.

filter(), etc.

) provided by Spark that allow us to manipulate the original data we pass in.

An RDD is merely an abstraction responsible for handling (storing & transforming) the data we pass in.

Under the hood, however, an RDD effectively handles big data through partitioning subsets of data that can be operated on in parallel across various nodes and has replications per partition to prevent loss of data.

Here’s a tangible example of how data is represented by an RDD:Let’s Give Spark 100GB of Data:Assume we have 5 worker nodes available to us in this setting.

Spark recognizes that it would be inefficient to give all 100 GB to one worker node and leave the other four empty.

Instead, it chooses to partition out this original dataset across the worker nodes to evenly distribute the workload.

Spark decides to partition the data into 100 partitions, each partition being a different GB.

Now Node #1 gets the first 20 GB, Node #2 gets 21–40 GB, Node #3 gets 41–60 GB, etc.

But not all data is created equally — some might be faster to run than others and some might be too burdensome that it crashes a node.

Spark decides to give Node #1 a copy of the 21–40 GB & 41–60 GB, Node #2 a copy of the 41–60 GB & 61–80 GB, etc.

So while each worker node has its original 20 GB it is primarily responsible for, it also stores copies of other GB partitions.

As such, Spark effectively partitions data onto nodes such worker nodes can compute on the data in parallel while preventing idle nodes.

That is, if a worker node finishes its computation early, it can “pick up the slack” from another worker node and support computation of another GB partition.

From the example we can see the following takeaways:The data is resilient — by storing multiple copies of data, each in different nodes, we avoid the loss of data in the rare case that one or more nodes fail.

The data is distributed — it is clear that of the original 100GB of data, we have partitioned, or sharded, the data across multiple nodes.

This gives each node fewer data to work with, resulting in overall faster operation.

The data is a database — it is a data structure where data is organized into named columns in a table format.

Because it is in the form of a DataFrame object, with built-in methods, richer optimizations are available for this particular kind of databaseThis gives rise to the name Resilient Distributed Database.

Of course, more nuances exist to this process but the example above seeks to illustrate a simplified thought process in Spark’s efficiency with RDDs.

A 4-Line Code for RDDsRDD_example = sc.

parallelize([1,2,3])o = RDD_example.

filter(lambda i: i % 2).

map(lambda i: i*2)result = o.

collect()print(result)## (Output: 2, 6)Note that with the first line of the parallelize method, we are creating the RDD from a list of integers we pass in.

Think of paralleizing as just that — creating the RDD so that we can operate on our data in parallel.

Upon that RDD we created, we apply two additional methods on the object, both of which still returns the RDD object.

Filter and map are both parts of the MapReduce framework that takes in their respective anonymous functions.

In this case, we filter the RDD to contain elements that are odd, then map a doubling function to those respective elements.

Lazy Evaluation & DAGsAt this point, the methods of the RDD seem identical to ones we’ve seen in introductory CS courses.

So why are we going through all this effort to filter and map a function onto a list?.To evaluate how Spark computes lazily.

The example above, on the final lines where we call collect, succinctly shows the difference between a spark transformation and action.

Here is where we can dive into a deeper layer of understanding — the transformations of .

filter() and .

map() earlier are not actually performed but instead placed into an execution map only to be activated later by an action.

The .

collect() action method actually serves to execute the entire code sequence and start the computation across nodes.

The local environment execution plan as a Directed Acyclic Graph (DAG)That execution plan that the transformations create on our local environment is stored in a directed acyclic graph (DAG).

As in, the steps of the computation follow one continuous direction, never looping back onto itself.

Long story short, the transformation methods are RDD methods returning RDDs placed into this DAG.

The action methods are typically a final operation that starts the Spark engine for computation.

Transformations add the DAG on the local environment and upon an action call, the DAG is sent to the master Spark interpreter on the main driver program.

The Big PictureIn the larger scope of how Spark operates and how the RDD fits into the framework, the diagram below captures the core of how a central driver program distributes across the worker nodes.

The sc object, or SparkContext, that we call the initial .

parallelize() method on is the built-in functionality of Apache Spark we can see as part of the main Driver Program.

This spark context object is what’s given to us such that we are empowered to create the RDD.

From there, the master Driver Program receives the DAG execution plan from your PySpark script, serializes the code (e.

g.

turn into the bits and bytes), and sends the respective partitions to the worker nodes which they receive with their API call.

The workers then each execute the operations assigned to them from the DAG.

This gives rise to Spark’s ability to compute in parallel and efficiently handle big data.

.

. More details

Leave a Reply