Scalable Web Server Log Analytics with Apache Spark

Apache Spark is an excellent and ideal framework for wrangling, analyzing and modeling on structured and unstructured data - at scale! In this tutorial, we will be focusing on one of the most popular case studies in the industry - log analytics.

Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.

Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis.

Part 1 - Setting up Dependencies

In [1]:
spark
Out[1]:

SparkSession - hive

SparkContext

Spark UI

Version
v2.4.0
Master
local[*]
AppName
PySparkShell
In [2]:
sqlContext
Out[2]:
<pyspark.sql.context.SQLContext at 0x7fb1577b6400>
In [3]:
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
    
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
In [4]:
import re
import pandas as pd

Basic Regular Expressions

In [5]:
m = re.finditer(r'.*?(spark).*?', "I'm searching for a spark in PySpark", re.I)
for match in m:
    print(match, match.start(), match.end())
<_sre.SRE_Match object; span=(0, 25), match="I'm searching for a spark"> 0 25
<_sre.SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36

In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download here.

These two datasets contain two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the website and download the following files as needed.

Make sure both the files are in the same directory as this notebook.

Part 2 - Loading and Viewing the NASA Log Dataset

Given that our data is stored in the following mentioned path, let's load it into a DataFrame. We'll do this in steps. First, we'll use sqlContext.read.text() or spark.read.text() to read the text file. This will produce a DataFrame with a single string column called value.

In [6]:
import glob

raw_data_files = glob.glob('*.gz')
raw_data_files
Out[6]:
['NASA_access_log_Jul95.gz', 'NASA_access_log_Aug95.gz']

Taking a look at the metadata of our dataframe

In [7]:
base_df = spark.read.text(raw_data_files)
base_df.printSchema()
root
 |-- value: string (nullable = true)

In [8]:
type(base_df)
Out[8]:
pyspark.sql.dataframe.DataFrame

You can also convert a dataframe to an RDD if needed

In [9]:
base_df_rdd = base_df.rdd
type(base_df_rdd)
Out[9]:
pyspark.rdd.RDD

Viewing sample data in our dataframe

Looks like it needs to be wrangled and parsed!

In [10]:
base_df.show(10, truncate=False)
+-----------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------+
|199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245                                 |
|unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                      |
|199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085   |
|burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0               |
|199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179|
|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0                    |
|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0        |
|205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985             |
|d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                               |
|129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074                                              |
+-----------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

Getting data from an RDD is slightly different. You can see how the data representation is different in the following RDD

In [11]:
base_df_rdd.take(10)
Out[11]:
[Row(value='199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245'),
 Row(value='unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'),
 Row(value='199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0'),
 Row(value='199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0'),
 Row(value='205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985'),
 Row(value='d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'),
 Row(value='129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074')]

Part 3 - Data Wrangling

In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.

Data understanding

If you're familiar with web server logs, you'll recognize that the above displayed data is in Common Log Format.

The fields are: remotehost rfc931 authuser [date] "request" status bytes

field meaning
remotehost Remote hostname (or IP number if DNS hostname is not available or if DNSLookup is off).
rfc931 The remote logname of the user if at all it is present.
authuser The username of the remote user after authentication by the HTTP server.
[date] Date and time of the request.
"request" The request, exactly as it came from the browser or client.
status The HTTP status code the server sent back to the client.
bytes The number of bytes (Content-Length) transferred to the client.

We will need to use some specific techniques to parse, match and extract these attributes from the log data

Data Parsing and Extraction with Regular Expressions

Next, we have to parse it into individual columns. We'll use the special built-in regexp_extract() function to do the parsing. This function matches a column against a regular expression with one or more capture groups and allows you to extract one of the matched groups. We'll use one regular expression for each field we wish to extract.

You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly can be), and you want to learn more about them, we recommend checking out the RegexOne web site. You might also find Regular Expressions Cookbook, by Goyvaerts and Levithan, to be useful as a reference.

Let's take a look at our dataset dimensions

In [12]:
print((base_df.count(), len(base_df.columns)))
(3461613, 1)

Let's extract and take a look at some sample log messages

In [13]:
sample_logs = [item['value'] for item in base_df.take(15)]
sample_logs
Out[13]:
['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0',
 '205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985',
 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204',
 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310',
 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786']

Extracting host names

Let's try and write some regular expressions to extract the host name from the logs

In [14]:
host_pattern = r'(^\S+\.[\S+\.]+\S+)\s'
hosts = [re.search(host_pattern, item).group(1)
           if re.search(host_pattern, item)
           else 'no match'
           for item in sample_logs]
hosts
Out[14]:
['199.72.81.55',
 'unicomp6.unicomp.net',
 '199.120.110.21',
 'burger.letters.com',
 '199.120.110.21',
 'burger.letters.com',
 'burger.letters.com',
 '205.212.115.106',
 'd104.aa.net',
 '129.94.144.152',
 'unicomp6.unicomp.net',
 'unicomp6.unicomp.net',
 'unicomp6.unicomp.net',
 'd104.aa.net',
 'd104.aa.net']

Extracting timestamps

Let's now try and use regular expressions to extract the timestamp fields from the logs

In [15]:
ts_pattern = r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]'
timestamps = [re.search(ts_pattern, item).group(1) for item in sample_logs]
timestamps
Out[15]:
['01/Jul/1995:00:00:01 -0400',
 '01/Jul/1995:00:00:06 -0400',
 '01/Jul/1995:00:00:09 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:13 -0400',
 '01/Jul/1995:00:00:13 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:15 -0400',
 '01/Jul/1995:00:00:15 -0400']

Extracting HTTP Request Method, URIs and Protocol

Let's now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs

In [16]:
method_uri_protocol_pattern = r'\"(\S+)\s(\S+)\s*(\S*)\"'
method_uri_protocol = [re.search(method_uri_protocol_pattern, item).groups()
               if re.search(method_uri_protocol_pattern, item)
               else 'no match'
              for item in sample_logs]
method_uri_protocol
Out[16]:
[('GET', '/history/apollo/', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/', 'HTTP/1.0'),
 ('GET', '/shuttle/missions/sts-73/mission-sts-73.html', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/liftoff.html', 'HTTP/1.0'),
 ('GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/video/livevideo.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/countdown.html', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/', 'HTTP/1.0'),
 ('GET', '/', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/images/KSC-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')]

Extracting HTTP Status Codes

Let's now try and use regular expressions to extract the HTTP status codes from the logs

In [18]:
status_pattern = r'\s(\d{3})\s'
status = [re.search(status_pattern, item).group(1) for item in sample_logs]
print(status)
['200', '200', '200', '304', '200', '304', '200', '200', '200', '200', '200', '200', '200', '200', '200']

Extracting HTTP Response Content Size

Let's now try and use regular expressions to extract the HTTP response content size from the logs

In [19]:
content_size_pattern = r'\s(\d+)$'
content_size = [re.search(content_size_pattern, item).group(1) for item in sample_logs]
print(content_size)
['6245', '3985', '4085', '0', '4179', '0', '0', '3985', '3985', '7074', '40310', '786', '1204', '40310', '786']

Putting it all together

Let's now try and leverage all the regular expression patterns we previously built and use the regexp_extract(...) method to build our dataframe with all the log attributes neatly extracted in their own separate columns.

In [20]:
from pyspark.sql.functions import regexp_extract

logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                         regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                         regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                         regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                         regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                         regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                         regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
logs_df.show(10, truncate=True)
print((logs_df.count(), len(logs_df.columns)))
+--------------------+--------------------+------+--------------------+--------+------+------------+
|                host|           timestamp|method|            endpoint|protocol|status|content_size|
+--------------------+--------------------+------+--------------------+--------+------+------------+
|        199.72.81.55|01/Jul/1995:00:00...|   GET|    /history/apollo/|HTTP/1.0|   200|        6245|
|unicomp6.unicomp.net|01/Jul/1995:00:00...|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|
|      199.120.110.21|01/Jul/1995:00:00...|   GET|/shuttle/missions...|HTTP/1.0|   200|        4085|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   304|           0|
|      199.120.110.21|01/Jul/1995:00:00...|   GET|/shuttle/missions...|HTTP/1.0|   200|        4179|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/images/NASA-logo...|HTTP/1.0|   304|           0|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   200|           0|
|     205.212.115.106|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   200|        3985|
|         d104.aa.net|01/Jul/1995:00:00...|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|
|      129.94.144.152|01/Jul/1995:00:00...|   GET|                   /|HTTP/1.0|   200|        7074|
+--------------------+--------------------+------+--------------------+--------+------+------------+
only showing top 10 rows

(3461613, 7)

Finding Missing Values

Missing and null values are the bane of data analysis and machine learning. Let's see how well our data parsing and extraction logic worked. First, let's verify that there are no null rows in the original dataframe.

In [21]:
(base_df
    .filter(base_df['value']
                .isNull())
    .count())
Out[21]:
0

If our data parsing and extraction worked properly, we should not have any rows with potential null values. Let's try and put that to test!

In [22]:
bad_rows_df = logs_df.filter(logs_df['host'].isNull()| 
                             logs_df['timestamp'].isNull() | 
                             logs_df['method'].isNull() |
                             logs_df['endpoint'].isNull() |
                             logs_df['status'].isNull() |
                             logs_df['content_size'].isNull()|
                             logs_df['protocol'].isNull())
bad_rows_df.count()
Out[22]:
33905

Ouch! Looks like we have over 33K missing values in our data! Can we handle this?

Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called big dataset is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls?

Finding Null Counts

We can typically use the following technique to find out which columns have null values.

(Note: This approach is adapted from an excellent answer on StackOverflow.)

In [23]:
logs_df.columns
Out[23]:
['host',
 'timestamp',
 'method',
 'endpoint',
 'protocol',
 'status',
 'content_size']
In [24]:
from pyspark.sql.functions import col
from pyspark.sql.functions import sum as spark_sum

def count_null(col_name):
    return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)

# Build up a list of column expressions, one per column.
exprs = [count_null(col_name) for col_name in logs_df.columns]

# Run the aggregation. The *exprs converts the list of expressions into
# variable function arguments.
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     1|       33905|
+----+---------+------+--------+--------+------+------------+

Well, looks like we have one missing value in the status column and everything else is in the content_size column. Let's see if we can figure out what's wrong!

Handling nulls in HTTP status

Our original parsing regular expression for the status column was:

regexp_extract('value', r'\s(\d{3})\s', 1).cast('integer').alias('status')

Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let's try and find out!

Note: In the expression below, ~ means "not".

In [25]:
null_status_df = base_df.filter(~base_df['value'].rlike(r'\s(\d{3})\s'))
null_status_df.count()
Out[25]:
1
In [26]:
null_status_df.show(truncate=False)
+--------+
|value   |
+--------+
|alyssa.p|
+--------+

In [27]:
bad_status_df = null_status_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                                      regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                                      regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                                      regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                                      regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                                      regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                                      regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
bad_status_df.show(truncate=False)
+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|    |         |      |        |        |null  |null        |
+----+---------+------+--------+--------+------+------------+

Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!

In [28]:
logs_df.count()
Out[28]:
3461613
In [29]:
logs_df = logs_df[logs_df['status'].isNotNull()] 
logs_df.count()
Out[29]:
3461612
In [30]:
exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     0|       33904|
+----+---------+------+--------+--------+------+------------+

Handling nulls in HTTP content size

Based on our previous regular expression, our original parsing regular expression for the content_size column was:

regexp_extract('value', r'\s(\d+)$', 1).cast('integer').alias('content_size')

Could there be missing data in our original dataset itself? Let's try and find out!

Find out the records in our base data frame with potential missing content sizes

In [31]:
null_content_size_df = base_df.filter(~base_df['value'].rlike(r'\s\d+$'))
null_content_size_df.count()
Out[31]:
33905

Display the top ten records of your data frame having missing content sizes

In [32]:
null_content_size_df.take(10)
Out[32]:
[Row(value='dd15-062.compuserve.com - - [01/Jul/1995:00:01:12 -0400] "GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0" 404 -'),
 Row(value='dynip42.efn.org - - [01/Jul/1995:00:02:14 -0400] "GET /software HTTP/1.0" 302 -'),
 Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:02:40 -0400] "GET /software/winvn HTTP/1.0" 302 -'),
 Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:03:24 -0400] "GET /software HTTP/1.0" 302 -'),
 Row(value='link097.txdirect.net - - [01/Jul/1995:00:05:06 -0400] "GET /shuttle HTTP/1.0" 302 -'),
 Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:13 -0400] "GET /shuttle/missions/sts-78/news HTTP/1.0" 302 -'),
 Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:58 -0400] "GET /shuttle/missions/sts-72/news HTTP/1.0" 302 -'),
 Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:19 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'),
 Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:28 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'),
 Row(value='dynip38.efn.org - - [01/Jul/1995:00:10:50 -0400] "GET /software HTTP/1.0" 302 -')]

It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a "-" for the content_size field.

Since we don't want to discard those rows from our analysis, let's impute or fill them to 0.

Fix the rows with null content_size

The easiest solution is to replace the null values in logs_df with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:

  • fillna(), which fills null values with specified non-null values.
  • na, which returns a DataFrameNaFunctions object with many functions for operating on null columns.

There are several ways to invoke this function. The easiest is just to replace all null columns with known values. But, for safety, it's better to pass a Python dictionary containing (column_name, value) mappings. That's what we'll do. A sample example from the documentation is depicted below

>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+

Now we use this function and fill all the missing values in the content_size field with 0!

In [33]:
logs_df = logs_df.na.fill({'content_size': 0})

Now assuming everything we have done so far worked, we should have no missing values \ nulls in our dataset. Let's verify this!

In [34]:
exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     0|           0|
+----+---------+------+--------+--------+------+------------+

Look at that, no missing values!

Handling Temporal Fields (Timestamp)

Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it.

In [35]:
from pyspark.sql.functions import udf

month_map = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
  'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}

def parse_clf_time(text):
    """ Convert Common Log time format into a Python datetime object
    Args:
        text (str): date and time in Apache time format [dd/mmm/yyyy:hh:mm:ss (+/-)zzzz]
    Returns:
        a string suitable for passing to CAST('timestamp')
    """
    # NOTE: We're ignoring the time zones here, might need to be handled depending on the problem you are solving
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
      int(text[7:11]),
      month_map[text[3:6]],
      int(text[0:2]),
      int(text[12:14]),
      int(text[15:17]),
      int(text[18:20])
    )
In [36]:
sample_ts = [item['timestamp'] for item in logs_df.select('timestamp').take(5)]
sample_ts
Out[36]:
['01/Jul/1995:00:00:01 -0400',
 '01/Jul/1995:00:00:06 -0400',
 '01/Jul/1995:00:00:09 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:11 -0400']
In [37]:
[parse_clf_time(item) for item in sample_ts]
Out[37]:
['1995-07-01 00:00:01',
 '1995-07-01 00:00:06',
 '1995-07-01 00:00:09',
 '1995-07-01 00:00:11',
 '1995-07-01 00:00:11']
In [38]:
udf_parse_time = udf(parse_clf_time)

logs_df = logs_df.select('*', udf_parse_time(logs_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
logs_df.show(10, truncate=True)
+--------------------+------+--------------------+--------+------+------------+-------------------+
|                host|method|            endpoint|protocol|status|content_size|               time|
+--------------------+------+--------------------+--------+------+------------+-------------------+
|        199.72.81.55|   GET|    /history/apollo/|HTTP/1.0|   200|        6245|1995-07-01 00:00:01|
|unicomp6.unicomp.net|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|1995-07-01 00:00:06|
|      199.120.110.21|   GET|/shuttle/missions...|HTTP/1.0|   200|        4085|1995-07-01 00:00:09|
|  burger.letters.com|   GET|/shuttle/countdow...|HTTP/1.0|   304|           0|1995-07-01 00:00:11|
|      199.120.110.21|   GET|/shuttle/missions...|HTTP/1.0|   200|        4179|1995-07-01 00:00:11|
|  burger.letters.com|   GET|/images/NASA-logo...|HTTP/1.0|   304|           0|1995-07-01 00:00:12|
|  burger.letters.com|   GET|/shuttle/countdow...|HTTP/1.0|   200|           0|1995-07-01 00:00:12|
|     205.212.115.106|   GET|/shuttle/countdow...|HTTP/1.0|   200|        3985|1995-07-01 00:00:12|
|         d104.aa.net|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|1995-07-01 00:00:13|
|      129.94.144.152|   GET|                   /|HTTP/1.0|   200|        7074|1995-07-01 00:00:13|
+--------------------+------+--------------------+--------+------+------------+-------------------+
only showing top 10 rows

In [39]:
logs_df.printSchema()
root
 |-- host: string (nullable = true)
 |-- method: string (nullable = true)
 |-- endpoint: string (nullable = true)
 |-- protocol: string (nullable = true)
 |-- status: integer (nullable = true)
 |-- content_size: integer (nullable = false)
 |-- time: timestamp (nullable = true)

In [40]:
logs_df.limit(5).toPandas()
Out[40]:
host method endpoint protocol status content_size time
0 199.72.81.55 GET /history/apollo/ HTTP/1.0 200 6245 1995-07-01 00:00:01
1 unicomp6.unicomp.net GET /shuttle/countdown/ HTTP/1.0 200 3985 1995-07-01 00:00:06
2 199.120.110.21 GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0 200 4085 1995-07-01 00:00:09
3 burger.letters.com GET /shuttle/countdown/liftoff.html HTTP/1.0 304 0 1995-07-01 00:00:11
4 199.120.110.21 GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0 200 4179 1995-07-01 00:00:11

Let's now cache logs_df since we will be using it extensively for our data analysis section in the next part!

In [41]:
logs_df.cache()
Out[41]:
DataFrame[host: string, method: string, endpoint: string, protocol: string, status: int, content_size: int, time: timestamp]

Part 4 - Data Analysis on our Web Logs

Now that we have a DataFrame containing the parsed log file as a data frame, we can perform some interesting exploratory data analysis (EDA)

Content Size Statistics

Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.

We can compute the statistics by calling .describe() on the content_size column of logs_df. The .describe() function returns the count, mean, stddev, min, and max of a given column.

In [42]:
content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.toPandas()
Out[42]:
summary content_size
0 count 3461612
1 mean 18928.844398216785
2 stddev 73031.47260949228
3 min 0
4 max 6823936

Alternatively, we can use SQL to directly calculate these statistics. You can explore many useful functions within the pyspark.sql.functions module in the documentation.

After we apply the .agg() function, we call toPandas() to extract and convert the result into a pandas dataframe which has better formatting on Jupyter notebooks

In [43]:
from pyspark.sql import functions as F

(logs_df.agg(F.min(logs_df['content_size']).alias('min_content_size'),
             F.max(logs_df['content_size']).alias('max_content_size'),
             F.mean(logs_df['content_size']).alias('mean_content_size'),
             F.stddev(logs_df['content_size']).alias('std_content_size'),
             F.count(logs_df['content_size']).alias('count_content_size'))
        .toPandas())
Out[43]:
min_content_size max_content_size mean_content_size std_content_size count_content_size
0 0 6823936 18928.844398 73031.472609 3461612

HTTP Status Code Analysis

Next, let's look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times.

We again start with logs_df, then group by the status column, apply the .count() aggregation function, and sort by the status column.

In [44]:
status_freq_df = (logs_df
                     .groupBy('status')
                     .count()
                     .sort('status')
                     .cache())
In [45]:
print('Total distinct HTTP Status Codes:', status_freq_df.count())
Total distinct HTTP Status Codes: 8
In [46]:
status_freq_pd_df = (status_freq_df
                         .toPandas()
                         .sort_values(by=['count'],
                                      ascending=False))
status_freq_pd_df
Out[46]:
status count
0 200 3100524
2 304 266773
1 302 73070
5 404 20899
4 403 225
6 500 65
7 501 41
3 400 15
In [47]:
!pip install -U seaborn
Requirement already up-to-date: seaborn in /usr/local/anaconda/lib/python3.6/site-packages (0.9.0)
Requirement not upgraded as not directly required: matplotlib>=1.4.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (2.2.2)
Requirement not upgraded as not directly required: numpy>=1.9.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.13.3)
Requirement not upgraded as not directly required: scipy>=0.14.0 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.1.0)
Requirement not upgraded as not directly required: pandas>=0.15.2 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (0.20.3)
Requirement not upgraded as not directly required: cycler>=0.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement not upgraded as not directly required: pytz in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2018.4)
Requirement not upgraded as not directly required: six>=1.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.11.0)
Requirement not upgraded as not directly required: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.2.0)
Requirement not upgraded as not directly required: kiwisolver>=1.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.0.1)
Requirement not upgraded as not directly required: python-dateutil>=2.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.7.3)
Requirement not upgraded as not directly required: setuptools in /usr/local/anaconda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (39.2.0)
pyspark 2.4.0 requires py4j==0.10.7, which is not installed.
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [48]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

sns.catplot(x='status', y='count', data=status_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])
Out[48]:
<seaborn.axisgrid.FacetGrid at 0x7fb13c15ea90>
In [49]:
log_freq_df = status_freq_df.withColumn('log(count)', F.log(status_freq_df['count']))
log_freq_df.show()
+------+-------+------------------+
|status|  count|        log(count)|
+------+-------+------------------+
|   200|3100524|14.947081687429097|
|   302|  73070|11.199173164785263|
|   304| 266773|12.494153388502301|
|   400|     15|  2.70805020110221|
|   403|    225|  5.41610040220442|
|   404|  20899| 9.947456589918252|
|   500|     65| 4.174387269895637|
|   501|     41| 3.713572066704308|
+------+-------+------------------+

In [50]:
log_freq_pd_df = (log_freq_df
                    .toPandas()
                    .sort_values(by=['log(count)'],
                                 ascending=False))
sns.catplot(x='status', y='log(count)', data=log_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x7fb137dfb4e0>

Analyzing Frequent Hosts

Let's look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each host and then sort by the counts and display only the top ten most frequent hosts.

In [51]:
host_sum_df =(logs_df
               .groupBy('host')
               .count()
               .sort('count', ascending=False).limit(10))

host_sum_df.show(truncate=False)
+--------------------+-----+
|host                |count|
+--------------------+-----+
|piweba3y.prodigy.com|21988|
|piweba4y.prodigy.com|16437|
|piweba1y.prodigy.com|12825|
|edams.ksc.nasa.gov  |11964|
|163.206.89.4        |9697 |
|news.ti.com         |8161 |
|www-d1.proxy.aol.com|8047 |
|alyssa.prodigy.com  |8037 |
|                    |7660 |
|siltb10.orl.mmc.com |7573 |
+--------------------+-----+

In [52]:
host_sum_pd_df = host_sum_df.toPandas()
host_sum_pd_df.iloc[8]['host']
Out[52]:
''

Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling.

Display the Top 20 Frequent EndPoints

Now, let's visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our logs_df and group by the endpoint column, aggregate by count, and sort in descending order like the previous question.

In [53]:
paths_df = (logs_df
            .groupBy('endpoint')
            .count()
            .sort('count', ascending=False).limit(20))
In [54]:
paths_pd_df = paths_df.toPandas()
paths_pd_df
Out[54]:
endpoint count
0 /images/NASA-logosmall.gif 208714
1 /images/KSC-logosmall.gif 164970
2 /images/MOSAIC-logosmall.gif 127908
3 /images/USA-logosmall.gif 127074
4 /images/WORLD-logosmall.gif 125925
5 /images/ksclogo-medium.gif 121572
6 /ksc.html 83909
7 /images/launch-logo.gif 76006
8 /history/apollo/images/apollo-logo1.gif 68896
9 /shuttle/countdown/ 64736
10 / 63171
11 /images/ksclogosmall.gif 61393
12 /shuttle/missions/missions.html 47315
13 /images/launchmedium.gif 40687
14 /htbin/cdt_main.pl 39871
15 /shuttle/missions/sts-69/mission-sts-69.html 31574
16 /shuttle/countdown/liftoff.html 29865
17 /icons/menu.xbm 29190
18 /shuttle/missions/sts-69/sts-69-patch-small.gif 29118
19 /icons/blank.xbm 28852

Top Ten Error Endpoints

What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)?

We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten.

In [55]:
not200_df = (logs_df
               .filter(logs_df['status'] != 200))

error_endpoints_freq_df = (not200_df
                               .groupBy('endpoint')
                               .count()
                               .sort('count', ascending=False)
                               .limit(10)
                          )
In [56]:
error_endpoints_freq_df.show(truncate=False)
+---------------------------------------+-----+
|endpoint                               |count|
+---------------------------------------+-----+
|/images/NASA-logosmall.gif             |40082|
|/images/KSC-logosmall.gif              |23763|
|/images/MOSAIC-logosmall.gif           |15245|
|/images/USA-logosmall.gif              |15142|
|/images/WORLD-logosmall.gif            |14773|
|/images/ksclogo-medium.gif             |13559|
|/images/launch-logo.gif                |8806 |
|/history/apollo/images/apollo-logo1.gif|7489 |
|/                                      |6296 |
|/images/ksclogosmall.gif               |5669 |
+---------------------------------------+-----+

Total number of Unique Hosts

What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations.

In [57]:
unique_host_count = (logs_df
                     .select('host')
                     .distinct()
                     .count())
unique_host_count
Out[57]:
137933

Number of Unique Daily Hosts

For an advanced example, let's look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts.

We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day.

Think about the steps that you need to perform to count the number of different hosts that make requests each day. Since the log only covers a single month, you can ignore the month. You may want to use the dayofmonth function in the pyspark.sql.functions module (which we have already imported as F.

host_day_df

A DataFrame with two columns

column explanation
host the host name
day the day of the month

There will be one row in this DataFrame for each row in logs_df. Essentially, we are just transforming each row of logs_df. For example, for this row in logs_df:

unicomp6.unicomp.net - - [01/Aug/1995:00:35:41 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -

your host_day_df should have:

unicomp6.unicomp.net 1
In [58]:
host_day_df = logs_df.select(logs_df.host, 
                             F.dayofmonth('time').alias('day'))
host_day_df.show(5, truncate=False)
+--------------------+---+
|host                |day|
+--------------------+---+
|199.72.81.55        |1  |
|unicomp6.unicomp.net|1  |
|199.120.110.21      |1  |
|burger.letters.com  |1  |
|199.120.110.21      |1  |
+--------------------+---+
only showing top 5 rows

host_day_distinct_df

This DataFrame has the same columns as host_day_df, but with duplicate (day, host) rows removed.

In [59]:
host_day_distinct_df = (host_day_df
                          .dropDuplicates())
host_day_distinct_df.show(5, truncate=False)
+-----------------------+---+
|host                   |day|
+-----------------------+---+
|129.94.144.152         |1  |
|slip1.yab.com          |1  |
|205.184.190.47         |1  |
|204.120.34.71          |1  |
|ppp3_130.bekkoame.or.jp|1  |
+-----------------------+---+
only showing top 5 rows

daily_unique_hosts_df

A DataFrame with two columns:

column explanation
day the day of the month
count the number of unique requesting hosts for that day
In [60]:
def_mr = pd.get_option('max_rows')
pd.set_option('max_rows', 10)

daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .sort("day"))

daily_hosts_df = daily_hosts_df.toPandas()
daily_hosts_df
Out[60]:
day count
0 1 7609
1 2 4858
2 3 10238
3 4 9411
4 5 9640
... ... ...
26 27 6846
27 28 6090
28 29 4825
29 30 5265
30 31 5913

31 rows × 2 columns

In [61]:
c = sns.catplot(x='day', y='count', 
                data=daily_hosts_df, 
                kind='point', height=5, 
                aspect=1.5)

Average Number of Daily Requests per Host

In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let's now try and find the average number of requests being made per Host to the NASA website per day based on our logs.

We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host.

In [62]:
daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .select(col("day"), 
                                      col("count").alias("total_hosts")))

total_daily_reqests_df = (logs_df
                              .select(F.dayofmonth("time")
                                          .alias("day"))
                              .groupBy("day")
                              .count()
                              .select(col("day"), 
                                      col("count").alias("total_reqs")))

avg_daily_reqests_per_host_df = total_daily_reqests_df.join(daily_hosts_df, 'day')
avg_daily_reqests_per_host_df = (avg_daily_reqests_per_host_df
                                    .withColumn('avg_reqs', col('total_reqs') / col('total_hosts'))
                                    .sort("day"))
avg_daily_reqests_per_host_df = avg_daily_reqests_per_host_df.toPandas()
avg_daily_reqests_per_host_df
Out[62]:
day total_reqs total_hosts avg_reqs
0 1 98710 7609 12.972795
1 2 60265 4858 12.405311
2 3 130972 10238 12.792733
3 4 130009 9411 13.814579
4 5 126468 9640 13.119087
... ... ... ... ...
26 27 94503 6846 13.804119
27 28 82617 6090 13.566010
28 29 67988 4825 14.090777
29 30 80641 5265 15.316429
30 31 90125 5913 15.241840

31 rows × 4 columns

In [63]:
c = sns.catplot(x='day', y='avg_reqs', 
                data=avg_daily_reqests_per_host_df, 
                kind='point', height=5, aspect=1.5)

Counting 404 Response Codes

Create a DataFrame containing only log records with a 404 status code (Not Found).

We make sure to cache() the not_found_df dataframe as we will use it in the rest of the examples here.

How many 404 records are in the log?

In [64]:
not_found_df = logs_df.filter(logs_df["status"] == 404).cache()
print(('Total 404 responses: {}').format(not_found_df.count()))
Total 404 responses: 20899

Listing the Top Twenty 404 Response Code Endpoints

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty endpoints that generate the most 404 errors.

Remember, top endpoints should be in sorted order

In [65]:
endpoints_404_count_df = (not_found_df
                          .groupBy("endpoint")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

endpoints_404_count_df.show(truncate=False)
+-----------------------------------------------------------------+-----+
|endpoint                                                         |count|
+-----------------------------------------------------------------+-----+
|/pub/winvn/readme.txt                                            |2004 |
|/pub/winvn/release.txt                                           |1732 |
|/shuttle/missions/STS-69/mission-STS-69.html                     |683  |
|/shuttle/missions/sts-68/ksc-upclose.gif                         |428  |
|/history/apollo/a-001/a-001-patch-small.gif                      |384  |
|/history/apollo/sa-1/sa-1-patch-small.gif                        |383  |
|/://spacelink.msfc.nasa.gov                                      |381  |
|/images/crawlerway-logo.gif                                      |374  |
|/elv/DELTA/uncons.htm                                            |372  |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif|359  |
|/images/nasa-logo.gif                                            |319  |
|/shuttle/resources/orbiters/atlantis.gif                         |314  |
|/history/apollo/apollo-13.html                                   |304  |
|/shuttle/resources/orbiters/discovery.gif                        |263  |
|/shuttle/missions/sts-71/images/KSC-95EC-0916.txt                |190  |
|/shuttle/resources/orbiters/challenger.gif                       |170  |
|/shuttle/missions/technology/sts-newsref/stsref-toc.html         |158  |
|/history/apollo/images/little-joe.jpg                            |150  |
|/images/lf-logo.gif                                              |143  |
|/history/apollo/publications/sp-350/sp-350.txt~                  |140  |
+-----------------------------------------------------------------+-----+

Listing the Top Twenty 404 Response Code Hosts

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty hosts that generate the most 404 errors.

Remember, top hosts should be in sorted order

In [66]:
hosts_404_count_df = (not_found_df
                          .groupBy("host")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

hosts_404_count_df.show(truncate=False)
+---------------------------+-----+
|host                       |count|
+---------------------------+-----+
|hoohoo.ncsa.uiuc.edu       |251  |
|piweba3y.prodigy.com       |157  |
|jbiagioni.npt.nuwc.navy.mil|132  |
|piweba1y.prodigy.com       |114  |
|                           |112  |
|www-d4.proxy.aol.com       |91   |
|piweba4y.prodigy.com       |86   |
|scooter.pa-x.dec.com       |69   |
|www-d1.proxy.aol.com       |64   |
|phaelon.ksc.nasa.gov       |64   |
|dialip-217.den.mmc.com     |62   |
|www-b4.proxy.aol.com       |62   |
|www-b3.proxy.aol.com       |61   |
|www-a2.proxy.aol.com       |60   |
|www-d2.proxy.aol.com       |59   |
|piweba2y.prodigy.com       |59   |
|alyssa.prodigy.com         |56   |
|monarch.eng.buffalo.edu    |56   |
|www-b2.proxy.aol.com       |53   |
|www-c4.proxy.aol.com       |53   |
+---------------------------+-----+

Visualizing 404 Errors per Day

Let's explore our 404 records temporally (by time) now. Similar to the example showing the number of unique daily hosts, we will break down the 404 requests by day and get the daily counts sorted by day in errors_by_date_sorted_df.

In [67]:
errors_by_date_sorted_df = (not_found_df
                                .groupBy(F.dayofmonth('time').alias('day'))
                                .count()
                                .sort("day"))

errors_by_date_sorted_pd_df = errors_by_date_sorted_df.toPandas()
errors_by_date_sorted_pd_df
Out[67]:
day count
0 1 559
1 2 291
2 3 778
3 4 705
4 5 733
... ... ...
26 27 706
27 28 504
28 29 420
29 30 571
30 31 526

31 rows × 2 columns

In [68]:
c = sns.catplot(x='day', y='count', 
                data=errors_by_date_sorted_pd_df, 
                kind='point', height=5, aspect=1.5)

Top Three Days for 404 Errors

What are the top three days of the month having the most 404 errors, we can leverage our previously created errors_by_date_sorted_df for this.

In [69]:
(errors_by_date_sorted_df
    .sort("count", ascending=False)
    .show(3))
+---+-----+
|day|count|
+---+-----+
|  7| 1107|
|  6| 1013|
| 25|  876|
+---+-----+
only showing top 3 rows

Visualizing Hourly 404 Errors

Using the DataFrame not_found_df we cached earlier, we will now group and sort by hour of the day in increasing order, to create a DataFrame containing the total number of 404 responses for HTTP requests for each hour of the day (midnight starts at 0)

In [70]:
hourly_avg_errors_sorted_df = (not_found_df
                                   .groupBy(F.hour('time')
                                             .alias('hour'))
                                   .count()
                                   .sort('hour'))
hourly_avg_errors_sorted_pd_df = hourly_avg_errors_sorted_df.toPandas()
In [71]:
c = sns.catplot(x='hour', y='count', 
                data=hourly_avg_errors_sorted_pd_df, 
                kind='bar', height=5, aspect=1.5)

Reset the max rows displayed in pandas

In [72]:
pd.set_option('max_rows', def_mr)