Scalable Log Analytics with Apache Spark — A Comprehensive Case-Study

In this tutorial, our main objective is focusing on one of the most popular case studies in the industry — log analytics.

Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information.

Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files.

They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.

Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale.

This hands-on case study will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis.

In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida.

The full data set is freely available for download here.

These two datasets contain two months’ worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

You can head over to the website and download the following files as needed (or click on the following links directly).

Jul 01 to Jul 31, ASCII format, 20.

7 MB gzip compressed, 205.

2 MB uncompressed: ftp://ita.




gzAug 04 to Aug 31, ASCII format, 21.

8 MB gzip compressed, 167.

8 MB uncompressed: ftp://ita.




gzMake sure both the files are in the same directory as the notebook containing the tutorial which is available on my GitHub.

Setting up DependenciesThe first step is to make sure you have access to a Spark session and cluster.

For this you can use your own local setup or a cloud based setup.

Typically most cloud platforms will provide a Spark cluster these days and you also have free options including Databricks community edition.

This tutorial assumes you already have Spark setup hence we will not be spending additional time configuring or setting up Spark from scratch.

Often pre-configured Spark setups already have the necessary environment variables or dependencies pre-loaded when you start your jupyter notebook server.

In my case, I can check them using the following commands in my notebook.

sparkThis shows me that my cluster is running Spark 2.


0 at the moment.

We can also check if sqlContext is present using the following code.




SQLContext at 0x7fb1577b6400>Now in case you don’t have these variables pre-configured and get an error, you can load them up and configure them using the following code.

Besides this we also load up some other libraries for working with dataframes and regular expressions.

Working with regular expressions will be one of the major aspects of parsing log files.

Regular expressions are a really powerful pattern matching technique which can be used to extract and find patterns in semi-structured and unstructured data.

Source: xkcdRegular expressions can be extremely effective and powerful, yet they can sometimes be overwhelming or confusing.

Not to worry though, with more practice, you can really leveraging its maximum potential.

The following example showcases a way of using regular expressions in Python.


SRE_Match object; span=(0, 25), match="I'm searching for a spark"> 0 25<_sre.

SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36Let’s move on to the next part of our analysis.

Loading and Viewing the NASA Log DatasetGiven that our data is stored in the following mentioned path (in the form of flat files), let’s load it into a DataFrame.

We’ll do this in steps.

The following code get’s us the log data file names in our disk.


gz', 'NASA_access_log_Aug95.

gz']Now, we’ll use sqlContext.


text() or spark.


text() to read the text file.

This will produce a DataFrame with a single string column called value.

root |– value: string (nullable = true)This allows us to see the schema for our log data which apparently looks like text data which we shall inspect soon.

You can view the type of data structure holding our log data using the following code.




DataFrameWe will be using Spark DataFrames throughout our tutorial.

However if you want, you can also convert a dataframe into an RDD if needed, Spark’s original data structure (resilient distributed datasets).

base_df_rdd = base_df.



RDDLet’s now take a peek at the actual log data in our dataframe.


show(10, truncate=False)This definitely looks like standard server log data which is semi-structured and we will definitely need to do some data processing and wrangling before this can be useful.

Do remember accessing data from RDDs is slightly different as seen below.


take(10)Now that we have loaded up and viewed our log data, let’s process and wrangle it.

Data WranglingIn this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.

Log Data UnderstandingIf you’re familiar with web server logs, you’ll recognize that the above displayed data is in Common Log Format.

The fields are: remotehost rfc931 authuser [date] "request" status bytesWe will need to use some specific techniques to parse, match and extract these attributes from the log data.

Data Parsing and Extraction with Regular ExpressionsNext, we have to parse our semi-structured log data into individual columns.

We’ll use the special built-in regexp_extract() function to do the parsing.

This function matches a column against a regular expression with one or more capture groups and allows you to extract one of the matched groups.

We’ll use one regular expression for each field we wish to extract.

You must have heard or used a fair bit of regular expressions by now.

If you find regular expressions confusing (and they certainly can be), and you want to learn more about them, we recommend checking out the RegexOne web site.

You might also find Regular Expressions Cookbook, by Goyvaerts and Levithan, to be useful as a reference.

Let’s take a look at the total number of logs we are working with in our dataset.


count(), len(base_df.

columns)))#Output(3461613, 1)Looks like we have a total of approximately 3.

46 million log messages.

Not a small number!.Let’s extract and take a look at some sample log messages.

Extracting host namesLet’s try and write some regular expressions to extract the host name from the logs.




55', 'unicomp6.


net', '199.



21', 'burger.


com', .

, .

, 'unicomp6.


net', 'd104.


net', 'd104.


net']Extracting timestampsLet’s now try and use regular expressions to extract the timestamp fields from the logs['01/Jul/1995:00:00:01 -0400', '01/Jul/1995:00:00:06 -0400', '01/Jul/1995:00:00:09 -0400', .

, .

, '01/Jul/1995:00:00:14 -0400', '01/Jul/1995:00:00:15 -0400', '01/Jul/1995:00:00:15 -0400']Extracting HTTP Request Method, URIs and ProtocolLet’s now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs.

[('GET', '/history/apollo/', 'HTTP/1.

0'), ('GET', '/shuttle/countdown/', 'HTTP/1.

0'), .

, .

, ('GET', '/shuttle/countdown/count.

gif', 'HTTP/1.

0'), ('GET', '/images/NASA-logosmall.

gif', 'HTTP/1.

0')]Extracting HTTP Status CodesLet’s now try and use regular expressions to extract the HTTP status codes from the logs.

['200', '200', '200', '304', .

, '200', '200']Extracting HTTP Response Content SizeLet’s now try and use regular expressions to extract the HTTP response content size from the logs.

['6245', '3985', '4085', '0', .

, '1204', '40310', '786']Putting it all togetherLet’s now try and leverage all the regular expression patterns we previously built and use the regexp_extract(.

) method to build our dataframe with all the log attributes neatly extracted in their own separate columns.

Finding Missing ValuesMissing and null values are the bane of data analysis and machine learning.

Let’s see how well our data parsing and extraction logic worked.

First, let’s verify that there are no null rows in the original dataframe.

0All good!.Now, if our data parsing and extraction worked properly, we should not have any rows with potential null values.

Let’s try and put that to test!33905Ouch!.Looks like we have over 33K missing values in our data!.Can we handle this?Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null.

Our so-called big dataset is residing on disk which can potentially be present in multiple nodes in a spark cluster.

So how do we find out which columns have potential nulls?Finding Null CountsWe can typically use the following technique to find out which columns have null values.

(Note: This approach is adapted from an excellent answer on StackOverflow.

)Well, looks like we have one missing value in the status column and everything else is in the content_size column.

Let's see if we can figure out what's wrong!Handling nulls in HTTP statusOur original parsing regular expression for the status column was:regexp_extract('value', r's(d{3})s', 1).

cast('integer') .

alias( 'status')Could it be that there are more digits making our regular expression wrong?.or is the data point itself bad?.Let’s try and find out!Note: In the expression below, ~ means "not".

1Let’s look at what this bad record looks like!null_status_df.

show(truncate=False)Looks like a record with a lot of missing information!.Let’s pass this through our log data parsing pipeline.

Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!Handling nulls in HTTP content sizeBased on our previous regular expression, our original parsing regular expression for the content_size column was:regexp_extract('value', r's(d+)$', 1).

cast('integer') .

alias('content_size')Could there be missing data in our original dataset itself?.Let’s try and find out!.We first try to find out the records in our base dataframe with potential missing content sizes.

33905The number seems to match the number of missing content size values in our processed dataframe.

Let’s take a look at the top ten records of our data frame having missing content sizes.


take(10)It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a “-" for the content_size field.

Since we don’t want to discard those rows from our analysis, let’s impute or fill them to 0.

Fix the rows with null content_sizeThe easiest solution is to replace the null values in logs_df with 0 like we discussed earlier.

The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:fillna(), which fills null values with specified non-null values.

na, which returns a DataFrameNaFunctions object with many functions for operating on null columns.

There are several ways to invoke this function.

The easiest is just to replace all null columns with known values.

But, for safety, it’s better to pass a Python dictionary containing (column_name, value) mappings.

That’s what we’ll do.

A sample example from the documentation is depicted below>>> df4.


fill({'age': 50, 'name': 'unknown'}).

show()+—+——+——-+|age|height| name|+—+——+——-+| 10| 80| Alice|| 5| null| Bob|| 50| null| Tom|| 50| null|unknown|+—+——+——-+Now we use this function and fill all the missing values in the content_size field with 0!Look at that, no missing values!Handling Temporal Fields (Timestamp)Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp.

The Common Log Format time is somewhat non-standard.

A User-Defined Function (UDF) is the most straightforward way to parse it.

Let’s now use this function to parse our time column in our dataframe.

Things seem to be looking good!.Let’s verify this by checking the schema of our dataframe.


printSchema()root |– host: string (nullable = true) |– method: string (nullable = true) |– endpoint: string (nullable = true) |– protocol: string (nullable = true) |– status: integer (nullable = true) |– content_size: integer (nullable = false) |– time: timestamp (nullable = true)Let’s now cache logs_df since we will be using it extensively for our data analysis section in the next part!logs_df.

cache()Data Analysis on our Web LogsNow that we have a DataFrame containing the parsed and cleaned log file as a data frame, we can perform some interesting exploratory data analysis (EDA) to try and get some interesting insights!Content Size StatisticsLet’s compute some statistics about the sizes of content being returned by the web server.

In particular, we’d like to know what are the average, minimum, and maximum content sizes.

We can compute the statistics by calling .

describe() on the content_size column of logs_df.

The .

describe() function returns the count, mean, stddev, min, and max of a given column.

Alternatively, we can use SQL to directly calculate these statistics.

You can explore many useful functions within the pyspark.


functions module in the documentation.

After we apply the .

agg() function, we call toPandas() to extract and convert the result into a pandas dataframe which has better formatting on Jupyter notebooks.

We can validate the results and see they are the same as expected.

HTTP Status Code AnalysisNext, let’s look at the status code values that appear in the log.

We want to know which status code values appear in the data and how many times.

We again start with logs_df, then group by the status column, apply the .

count() aggregation function, and sort by the status column.

Total distinct HTTP Status Codes: 8Looks like we have a total of 8 distinct HTTP status codes.

Let’s take a look at their occurrences in the form of a frequency table.

Looks like status code 200 OK is the most frequent code which is a good sign that things have been working normally most of the time.

Let’s visualize this.

HTTP Status Code occurrencesNot too bad!.But several status codes are almost not visible due to the huge skew in the data.

Let’s take a log transform and see if things improve.

The results definitely look good and seem to have handled the skewness, let’s verify this by visualizing this data.

HTTP Status Code occurrences — Log TransformedThis definitely looks much better and less skewed!Analyzing Frequent HostsLet’s look at hosts that have accessed the server frequently.

We will try to get the count of total accesses by each host and then sort by the counts and display only the top ten most frequent hosts.

This looks good but let’s inspect the blank record in row number 9 more closely.

host_sum_pd_df = host_sum_df.


iloc[8]['host']''Looks like we have some empty strings as one of the top host names!. More details

Leave a Reply