Scalable Web Server Log Analytics with Apache Spark¶

Apache Spark is an excellent and ideal framework for wrangling, analyzing and modeling on structured and unstructured data - at scale! In this tutorial, we will be focusing on one of the most popular case studies in the industry - log analytics.

Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.

Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis.

Part 1 - Setting up Dependencies¶

In [1]:

spark

Out[1]:

SparkSession - hive

SparkContext

Spark UI

Version: v2.4.0
Master: local[*]
AppName: PySparkShell

In [2]:

sqlContext

Out[2]:

<pyspark.sql.context.SQLContext at 0x7fb1577b6400>

In [3]:

from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
    
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)

In [4]:

import re
import pandas as pd

Basic Regular Expressions¶

In [5]:

m = re.finditer(r'.*?(spark).*?', "I'm searching for a spark in PySpark", re.I)
for match in m:
    print(match, match.start(), match.end())

<_sre.SRE_Match object; span=(0, 25), match="I'm searching for a spark"> 0 25
<_sre.SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36

In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download here.

These two datasets contain two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the website and download the following files as needed.

Jul 01 to Jul 31, ASCII format, 20.7 MB gzip compressed, 205.2 MB uncompressed: ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
Aug 04 to Aug 31, ASCII format, 21.8 MB gzip compressed, 167.8 MB uncompressed: ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz

Make sure both the files are in the same directory as this notebook.

Part 2 - Loading and Viewing the NASA Log Dataset¶

Given that our data is stored in the following mentioned path, let's load it into a DataFrame. We'll do this in steps. First, we'll use sqlContext.read.text() or spark.read.text() to read the text file. This will produce a DataFrame with a single string column called value.

In [6]:

import glob

raw_data_files = glob.glob('*.gz')
raw_data_files

Out[6]:

['NASA_access_log_Jul95.gz', 'NASA_access_log_Aug95.gz']

Taking a look at the metadata of our dataframe¶

In [7]:

base_df = spark.read.text(raw_data_files)
base_df.printSchema()

root
 |-- value: string (nullable = true)

In [8]:

type(base_df)

Out[8]:

pyspark.sql.dataframe.DataFrame

You can also convert a dataframe to an RDD if needed

In [9]:

base_df_rdd = base_df.rdd
type(base_df_rdd)

Out[9]:

pyspark.rdd.RDD

Viewing sample data in our dataframe¶

Looks like it needs to be wrangled and parsed!

In [10]:

base_df.show(10, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------+
|199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245                                 |
|unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                      |
|199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085   |
|burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0               |
|199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179|
|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0                    |
|burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0        |
|205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985             |
|d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                               |
|129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074                                              |
+-----------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

Getting data from an RDD is slightly different. You can see how the data representation is different in the following RDD

In [11]:

base_df_rdd.take(10)

Out[11]:

[Row(value='199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245'),
 Row(value='unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'),
 Row(value='199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0'),
 Row(value='199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0'),
 Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0'),
 Row(value='205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985'),
 Row(value='d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'),
 Row(value='129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074')]

Part 3 - Data Wrangling¶

In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.

Data understanding¶

If you're familiar with web server logs, you'll recognize that the above displayed data is in Common Log Format.

The fields are: remotehost rfc931 authuser [date] "request" status bytes

field	meaning
remotehost	Remote hostname (or IP number if DNS hostname is not available or if DNSLookup is off).
rfc931	The remote logname of the user if at all it is present.
authuser	The username of the remote user after authentication by the HTTP server.
[date]	Date and time of the request.
"request"	The request, exactly as it came from the browser or client.
status	The HTTP status code the server sent back to the client.
bytes	The number of bytes (`Content-Length`) transferred to the client.

We will need to use some specific techniques to parse, match and extract these attributes from the log data

Data Parsing and Extraction with Regular Expressions¶

Next, we have to parse it into individual columns. We'll use the special built-in regexp_extract() function to do the parsing. This function matches a column against a regular expression with one or more capture groups and allows you to extract one of the matched groups. We'll use one regular expression for each field we wish to extract.

You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly can be), and you want to learn more about them, we recommend checking out the RegexOne web site. You might also find Regular Expressions Cookbook, by Goyvaerts and Levithan, to be useful as a reference.

Let's take a look at our dataset dimensions¶

In [12]:

print((base_df.count(), len(base_df.columns)))

(3461613, 1)

Let's extract and take a look at some sample log messages

In [13]:

sample_logs = [item['value'] for item in base_df.take(15)]
sample_logs

Out[13]:

['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0',
 '205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985',
 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204',
 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310',
 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786']

Extracting host names¶

Let's try and write some regular expressions to extract the host name from the logs

In [14]:

host_pattern = r'(^\S+\.[\S+\.]+\S+)\s'
hosts = [re.search(host_pattern, item).group(1)
           if re.search(host_pattern, item)
           else 'no match'
           for item in sample_logs]
hosts

Out[14]:

['199.72.81.55',
 'unicomp6.unicomp.net',
 '199.120.110.21',
 'burger.letters.com',
 '199.120.110.21',
 'burger.letters.com',
 'burger.letters.com',
 '205.212.115.106',
 'd104.aa.net',
 '129.94.144.152',
 'unicomp6.unicomp.net',
 'unicomp6.unicomp.net',
 'unicomp6.unicomp.net',
 'd104.aa.net',
 'd104.aa.net']

Extracting timestamps¶

Let's now try and use regular expressions to extract the timestamp fields from the logs

In [15]:

ts_pattern = r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]'
timestamps = [re.search(ts_pattern, item).group(1) for item in sample_logs]
timestamps

Out[15]:

['01/Jul/1995:00:00:01 -0400',
 '01/Jul/1995:00:00:06 -0400',
 '01/Jul/1995:00:00:09 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:12 -0400',
 '01/Jul/1995:00:00:13 -0400',
 '01/Jul/1995:00:00:13 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:14 -0400',
 '01/Jul/1995:00:00:15 -0400',
 '01/Jul/1995:00:00:15 -0400']

Extracting HTTP Request Method, URIs and Protocol¶

Let's now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs

In [16]:

method_uri_protocol_pattern = r'\"(\S+)\s(\S+)\s*(\S*)\"'
method_uri_protocol = [re.search(method_uri_protocol_pattern, item).groups()
               if re.search(method_uri_protocol_pattern, item)
               else 'no match'
              for item in sample_logs]
method_uri_protocol

Out[16]:

[('GET', '/history/apollo/', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/', 'HTTP/1.0'),
 ('GET', '/shuttle/missions/sts-73/mission-sts-73.html', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/liftoff.html', 'HTTP/1.0'),
 ('GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/video/livevideo.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/countdown.html', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/', 'HTTP/1.0'),
 ('GET', '/', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/images/KSC-logosmall.gif', 'HTTP/1.0'),
 ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),
 ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')]

Extracting HTTP Status Codes¶

Let's now try and use regular expressions to extract the HTTP status codes from the logs

In [18]:

status_pattern = r'\s(\d{3})\s'
status = [re.search(status_pattern, item).group(1) for item in sample_logs]
print(status)

['200', '200', '200', '304', '200', '304', '200', '200', '200', '200', '200', '200', '200', '200', '200']

Extracting HTTP Response Content Size¶

Let's now try and use regular expressions to extract the HTTP response content size from the logs

In [19]:

content_size_pattern = r'\s(\d+)$'
content_size = [re.search(content_size_pattern, item).group(1) for item in sample_logs]
print(content_size)

['6245', '3985', '4085', '0', '4179', '0', '0', '3985', '3985', '7074', '40310', '786', '1204', '40310', '786']

Putting it all together¶

Let's now try and leverage all the regular expression patterns we previously built and use the regexp_extract(...) method to build our dataframe with all the log attributes neatly extracted in their own separate columns.

In [20]:

from pyspark.sql.functions import regexp_extract

logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                         regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                         regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                         regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                         regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                         regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                         regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
logs_df.show(10, truncate=True)
print((logs_df.count(), len(logs_df.columns)))

+--------------------+--------------------+------+--------------------+--------+------+------------+
|                host|           timestamp|method|            endpoint|protocol|status|content_size|
+--------------------+--------------------+------+--------------------+--------+------+------------+
|        199.72.81.55|01/Jul/1995:00:00...|   GET|    /history/apollo/|HTTP/1.0|   200|        6245|
|unicomp6.unicomp.net|01/Jul/1995:00:00...|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|
|      199.120.110.21|01/Jul/1995:00:00...|   GET|/shuttle/missions...|HTTP/1.0|   200|        4085|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   304|           0|
|      199.120.110.21|01/Jul/1995:00:00...|   GET|/shuttle/missions...|HTTP/1.0|   200|        4179|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/images/NASA-logo...|HTTP/1.0|   304|           0|
|  burger.letters.com|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   200|           0|
|     205.212.115.106|01/Jul/1995:00:00...|   GET|/shuttle/countdow...|HTTP/1.0|   200|        3985|
|         d104.aa.net|01/Jul/1995:00:00...|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|
|      129.94.144.152|01/Jul/1995:00:00...|   GET|                   /|HTTP/1.0|   200|        7074|
+--------------------+--------------------+------+--------------------+--------+------+------------+
only showing top 10 rows

(3461613, 7)

Finding Missing Values¶

Missing and null values are the bane of data analysis and machine learning. Let's see how well our data parsing and extraction logic worked. First, let's verify that there are no null rows in the original dataframe.

In [21]:

(base_df
    .filter(base_df['value']
                .isNull())
    .count())

Out[21]:

If our data parsing and extraction worked properly, we should not have any rows with potential null values. Let's try and put that to test!

In [22]:

bad_rows_df = logs_df.filter(logs_df['host'].isNull()| 
                             logs_df['timestamp'].isNull() | 
                             logs_df['method'].isNull() |
                             logs_df['endpoint'].isNull() |
                             logs_df['status'].isNull() |
                             logs_df['content_size'].isNull()|
                             logs_df['protocol'].isNull())
bad_rows_df.count()

Out[22]:

Ouch! Looks like we have over 33K missing values in our data! Can we handle this?

Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called big dataset is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls?

Finding Null Counts¶

We can typically use the following technique to find out which columns have null values.

(Note: This approach is adapted from an excellent answer on StackOverflow.)

In [23]:

logs_df.columns

Out[23]:

['host',
 'timestamp',
 'method',
 'endpoint',
 'protocol',
 'status',
 'content_size']

In [24]:

from pyspark.sql.functions import col
from pyspark.sql.functions import sum as spark_sum

def count_null(col_name):
    return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)

# Build up a list of column expressions, one per column.
exprs = [count_null(col_name) for col_name in logs_df.columns]

# Run the aggregation. The *exprs converts the list of expressions into
# variable function arguments.
logs_df.agg(*exprs).show()

+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     1|       33905|
+----+---------+------+--------+--------+------+------------+

Well, looks like we have one missing value in the status column and everything else is in the content_size column. Let's see if we can figure out what's wrong!

Handling nulls in HTTP status¶

Our original parsing regular expression for the status column was:

__``` regexp_extract('value', r'\s(\d{3})\s', 1).cast('integer').alias('status')

__

Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let's try and find out!

**Note**: In the expression below, `~` means "not".

In [25]:

null_status_df = base_df.filter(~base_df['value'].rlike(r'\s(\d{3})\s'))
null_status_df.count()

Out[25]:

In [26]:

null_status_df.show(truncate=False)

+--------+
|value   |
+--------+
|alyssa.p|
+--------+

In [27]:

bad_status_df = null_status_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                                      regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                                      regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                                      regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                                      regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                                      regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                                      regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
bad_status_df.show(truncate=False)

+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|    |         |      |        |        |null  |null        |
+----+---------+------+--------+--------+------+------------+

Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!

In [28]:

logs_df.count()

Out[28]:

In [29]:

logs_df = logs_df[logs_df['status'].isNotNull()] 
logs_df.count()

Out[29]:

In [30]:

exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()

+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     0|       33904|
+----+---------+------+--------+--------+------+------------+

Handling nulls in HTTP content size¶

Based on our previous regular expression, our original parsing regular expression for the content_size column was:

__``` regexp_extract('value', r'\s(\d+)$', 1).cast('integer').alias('content_size')

__

Could there be missing data in our original dataset itself? Let's try and find out!

Find out the records in our base data frame with potential missing content sizes¶

In [31]:

null_content_size_df = base_df.filter(~base_df['value'].rlike(r'\s\d+$'))
null_content_size_df.count()

Out[31]:

Display the top ten records of your data frame having missing content sizes¶

In [32]:

null_content_size_df.take(10)

Out[32]:

[Row(value='dd15-062.compuserve.com - - [01/Jul/1995:00:01:12 -0400] "GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0" 404 -'),
 Row(value='dynip42.efn.org - - [01/Jul/1995:00:02:14 -0400] "GET /software HTTP/1.0" 302 -'),
 Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:02:40 -0400] "GET /software/winvn HTTP/1.0" 302 -'),
 Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:03:24 -0400] "GET /software HTTP/1.0" 302 -'),
 Row(value='link097.txdirect.net - - [01/Jul/1995:00:05:06 -0400] "GET /shuttle HTTP/1.0" 302 -'),
 Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:13 -0400] "GET /shuttle/missions/sts-78/news HTTP/1.0" 302 -'),
 Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:58 -0400] "GET /shuttle/missions/sts-72/news HTTP/1.0" 302 -'),
 Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:19 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'),
 Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:28 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'),
 Row(value='dynip38.efn.org - - [01/Jul/1995:00:10:50 -0400] "GET /software HTTP/1.0" 302 -')]

It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a "-" for the content_size field.

Since we don't want to discard those rows from our analysis, let's impute or fill them to 0.

Fix the rows with null content_size¶

The easiest solution is to replace the null values in logs_df with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:

fillna(), which fills null values with specified non-null values.
na, which returns a DataFrameNaFunctions object with many functions for operating on null columns.

There are several ways to invoke this function. The easiest is just to replace all null columns with known values. But, for safety, it's better to pass a Python dictionary containing (column_name, value) mappings. That's what we'll do. A sample example from the documentation is depicted below

>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+

Now we use this function and fill all the missing values in the content_size field with 0!

In [33]:

logs_df = logs_df.na.fill({'content_size': 0})

Now assuming everything we have done so far worked, we should have no missing values \ nulls in our dataset. Let's verify this!

In [34]:

exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()

+----+---------+------+--------+--------+------+------------+
|host|timestamp|method|endpoint|protocol|status|content_size|
+----+---------+------+--------+--------+------+------------+
|   0|        0|     0|       0|       0|     0|           0|
+----+---------+------+--------+--------+------+------------+

Look at that, no missing values!

Handling Temporal Fields (Timestamp)¶

Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it.

In [35]:

from pyspark.sql.functions import udf

month_map = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
  'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}

def parse_clf_time(text):
    """ Convert Common Log time format into a Python datetime object
    Args:
        text (str): date and time in Apache time format [dd/mmm/yyyy:hh:mm:ss (+/-)zzzz]
    Returns:
        a string suitable for passing to CAST('timestamp')
    """
    # NOTE: We're ignoring the time zones here, might need to be handled depending on the problem you are solving
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
      int(text[7:11]),
      month_map[text[3:6]],
      int(text[0:2]),
      int(text[12:14]),
      int(text[15:17]),
      int(text[18:20])
    )

In [36]:

sample_ts = [item['timestamp'] for item in logs_df.select('timestamp').take(5)]
sample_ts

Out[36]:

['01/Jul/1995:00:00:01 -0400',
 '01/Jul/1995:00:00:06 -0400',
 '01/Jul/1995:00:00:09 -0400',
 '01/Jul/1995:00:00:11 -0400',
 '01/Jul/1995:00:00:11 -0400']

In [37]:

[parse_clf_time(item) for item in sample_ts]

Out[37]:

['1995-07-01 00:00:01',
 '1995-07-01 00:00:06',
 '1995-07-01 00:00:09',
 '1995-07-01 00:00:11',
 '1995-07-01 00:00:11']

In [38]:

udf_parse_time = udf(parse_clf_time)

logs_df = logs_df.select('*', udf_parse_time(logs_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
logs_df.show(10, truncate=True)

+--------------------+------+--------------------+--------+------+------------+-------------------+
|                host|method|            endpoint|protocol|status|content_size|               time|
+--------------------+------+--------------------+--------+------+------------+-------------------+
|        199.72.81.55|   GET|    /history/apollo/|HTTP/1.0|   200|        6245|1995-07-01 00:00:01|
|unicomp6.unicomp.net|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|1995-07-01 00:00:06|
|      199.120.110.21|   GET|/shuttle/missions...|HTTP/1.0|   200|        4085|1995-07-01 00:00:09|
|  burger.letters.com|   GET|/shuttle/countdow...|HTTP/1.0|   304|           0|1995-07-01 00:00:11|
|      199.120.110.21|   GET|/shuttle/missions...|HTTP/1.0|   200|        4179|1995-07-01 00:00:11|
|  burger.letters.com|   GET|/images/NASA-logo...|HTTP/1.0|   304|           0|1995-07-01 00:00:12|
|  burger.letters.com|   GET|/shuttle/countdow...|HTTP/1.0|   200|           0|1995-07-01 00:00:12|
|     205.212.115.106|   GET|/shuttle/countdow...|HTTP/1.0|   200|        3985|1995-07-01 00:00:12|
|         d104.aa.net|   GET| /shuttle/countdown/|HTTP/1.0|   200|        3985|1995-07-01 00:00:13|
|      129.94.144.152|   GET|                   /|HTTP/1.0|   200|        7074|1995-07-01 00:00:13|
+--------------------+------+--------------------+--------+------+------------+-------------------+
only showing top 10 rows

In [39]:

logs_df.printSchema()

root
 |-- host: string (nullable = true)
 |-- method: string (nullable = true)
 |-- endpoint: string (nullable = true)
 |-- protocol: string (nullable = true)
 |-- status: integer (nullable = true)
 |-- content_size: integer (nullable = false)
 |-- time: timestamp (nullable = true)

In [40]:

logs_df.limit(5).toPandas()

Out[40]:

	host	method	endpoint	protocol	status	content_size	time
0	199.72.81.55	GET	/history/apollo/	HTTP/1.0	200	6245	1995-07-01 00:00:01
1	unicomp6.unicomp.net	GET	/shuttle/countdown/	HTTP/1.0	200	3985	1995-07-01 00:00:06
2	199.120.110.21	GET	/shuttle/missions/sts-73/mission-sts-73.html	HTTP/1.0	200	4085	1995-07-01 00:00:09
3	burger.letters.com	GET	/shuttle/countdown/liftoff.html	HTTP/1.0	304	0	1995-07-01 00:00:11
4	199.120.110.21	GET	/shuttle/missions/sts-73/sts-73-patch-small.gif	HTTP/1.0	200	4179	1995-07-01 00:00:11

Let's now cache logs_df since we will be using it extensively for our data analysis section in the next part!

In [41]:

logs_df.cache()

Out[41]:

DataFrame[host: string, method: string, endpoint: string, protocol: string, status: int, content_size: int, time: timestamp]

Part 4 - Data Analysis on our Web Logs¶

Now that we have a DataFrame containing the parsed log file as a data frame, we can perform some interesting exploratory data analysis (EDA)

Content Size Statistics¶

Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.

We can compute the statistics by calling .describe() on the content_size column of logs_df. The .describe() function returns the count, mean, stddev, min, and max of a given column.

In [42]:

content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.toPandas()

Out[42]:

	summary	content_size
0	count	3461612
1	mean	18928.844398216785
2	stddev	73031.47260949228
3	min	0
4	max	6823936

Alternatively, we can use SQL to directly calculate these statistics. You can explore many useful functions within the pyspark.sql.functions module in the documentation.

After we apply the .agg() function, we call toPandas() to extract and convert the result into a pandas dataframe which has better formatting on Jupyter notebooks

In [43]:

from pyspark.sql import functions as F

(logs_df.agg(F.min(logs_df['content_size']).alias('min_content_size'),
             F.max(logs_df['content_size']).alias('max_content_size'),
             F.mean(logs_df['content_size']).alias('mean_content_size'),
             F.stddev(logs_df['content_size']).alias('std_content_size'),
             F.count(logs_df['content_size']).alias('count_content_size'))
        .toPandas())

Out[43]:

	min_content_size	max_content_size	mean_content_size	std_content_size	count_content_size
0	0	6823936	18928.844398	73031.472609	3461612

HTTP Status Code Analysis¶

Next, let's look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times.

We again start with logs_df, then group by the status column, apply the .count() aggregation function, and sort by the status column.

In [44]:

status_freq_df = (logs_df
                     .groupBy('status')
                     .count()
                     .sort('status')
                     .cache())

In [45]:

print('Total distinct HTTP Status Codes:', status_freq_df.count())

Total distinct HTTP Status Codes: 8

In [46]:

status_freq_pd_df = (status_freq_df
                         .toPandas()
                         .sort_values(by=['count'],
                                      ascending=False))
status_freq_pd_df

Out[46]:

	status	count
0	200	3100524
2	304	266773
1	302	73070
5	404	20899
4	403	225
6	500	65
7	501	41
3	400	15

In [47]:

!pip install -U seaborn

Requirement already up-to-date: seaborn in /usr/local/anaconda/lib/python3.6/site-packages (0.9.0)
Requirement not upgraded as not directly required: matplotlib>=1.4.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (2.2.2)
Requirement not upgraded as not directly required: numpy>=1.9.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.13.3)
Requirement not upgraded as not directly required: scipy>=0.14.0 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.1.0)
Requirement not upgraded as not directly required: pandas>=0.15.2 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (0.20.3)
Requirement not upgraded as not directly required: cycler>=0.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0)
Requirement not upgraded as not directly required: pytz in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2018.4)
Requirement not upgraded as not directly required: six>=1.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.11.0)
Requirement not upgraded as not directly required: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.2.0)
Requirement not upgraded as not directly required: kiwisolver>=1.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.0.1)
Requirement not upgraded as not directly required: python-dateutil>=2.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.7.3)
Requirement not upgraded as not directly required: setuptools in /usr/local/anaconda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (39.2.0)
pyspark 2.4.0 requires py4j==0.10.7, which is not installed.
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [48]:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

sns.catplot(x='status', y='count', data=status_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])

Out[48]:

<seaborn.axisgrid.FacetGrid at 0x7fb13c15ea90>

In [49]:

log_freq_df = status_freq_df.withColumn('log(count)', F.log(status_freq_df['count']))
log_freq_df.show()

+------+-------+------------------+
|status|  count|        log(count)|
+------+-------+------------------+
|   200|3100524|14.947081687429097|
|   302|  73070|11.199173164785263|
|   304| 266773|12.494153388502301|
|   400|     15|  2.70805020110221|
|   403|    225|  5.41610040220442|
|   404|  20899| 9.947456589918252|
|   500|     65| 4.174387269895637|
|   501|     41| 3.713572066704308|
+------+-------+------------------+

In [50]:

log_freq_pd_df = (log_freq_df
                    .toPandas()
                    .sort_values(by=['log(count)'],
                                 ascending=False))
sns.catplot(x='status', y='log(count)', data=log_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])

Out[50]:

<seaborn.axisgrid.FacetGrid at 0x7fb137dfb4e0>

Analyzing Frequent Hosts¶

Let's look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each host and then sort by the counts and display only the top ten most frequent hosts.

In [51]:

host_sum_df =(logs_df
               .groupBy('host')
               .count()
               .sort('count', ascending=False).limit(10))

host_sum_df.show(truncate=False)

+--------------------+-----+
|host                |count|
+--------------------+-----+
|piweba3y.prodigy.com|21988|
|piweba4y.prodigy.com|16437|
|piweba1y.prodigy.com|12825|
|edams.ksc.nasa.gov  |11964|
|163.206.89.4        |9697 |
|news.ti.com         |8161 |
|www-d1.proxy.aol.com|8047 |
|alyssa.prodigy.com  |8037 |
|                    |7660 |
|siltb10.orl.mmc.com |7573 |
+--------------------+-----+

In [52]:

host_sum_pd_df = host_sum_df.toPandas()
host_sum_pd_df.iloc[8]['host']

Out[52]:

''

Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling.

Display the Top 20 Frequent EndPoints¶

Now, let's visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our logs_df and group by the endpoint column, aggregate by count, and sort in descending order like the previous question.

In [53]:

paths_df = (logs_df
            .groupBy('endpoint')
            .count()
            .sort('count', ascending=False).limit(20))

In [54]:

paths_pd_df = paths_df.toPandas()
paths_pd_df

Out[54]:

	endpoint	count
0	/images/NASA-logosmall.gif	208714
1	/images/KSC-logosmall.gif	164970
2	/images/MOSAIC-logosmall.gif	127908
3	/images/USA-logosmall.gif	127074
4	/images/WORLD-logosmall.gif	125925
5	/images/ksclogo-medium.gif	121572
6	/ksc.html	83909
7	/images/launch-logo.gif	76006
8	/history/apollo/images/apollo-logo1.gif	68896
9	/shuttle/countdown/	64736
10	/	63171
11	/images/ksclogosmall.gif	61393
12	/shuttle/missions/missions.html	47315
13	/images/launchmedium.gif	40687
14	/htbin/cdt_main.pl	39871
15	/shuttle/missions/sts-69/mission-sts-69.html	31574
16	/shuttle/countdown/liftoff.html	29865
17	/icons/menu.xbm	29190
18	/shuttle/missions/sts-69/sts-69-patch-small.gif	29118
19	/icons/blank.xbm	28852

Top Ten Error Endpoints¶

What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)?

We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten.

In [55]:

not200_df = (logs_df
               .filter(logs_df['status'] != 200))

error_endpoints_freq_df = (not200_df
                               .groupBy('endpoint')
                               .count()
                               .sort('count', ascending=False)
                               .limit(10)
                          )

In [56]:

error_endpoints_freq_df.show(truncate=False)

+---------------------------------------+-----+
|endpoint                               |count|
+---------------------------------------+-----+
|/images/NASA-logosmall.gif             |40082|
|/images/KSC-logosmall.gif              |23763|
|/images/MOSAIC-logosmall.gif           |15245|
|/images/USA-logosmall.gif              |15142|
|/images/WORLD-logosmall.gif            |14773|
|/images/ksclogo-medium.gif             |13559|
|/images/launch-logo.gif                |8806 |
|/history/apollo/images/apollo-logo1.gif|7489 |
|/                                      |6296 |
|/images/ksclogosmall.gif               |5669 |
+---------------------------------------+-----+

Total number of Unique Hosts¶

What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations.

In [57]:

unique_host_count = (logs_df
                     .select('host')
                     .distinct()
                     .count())
unique_host_count

Out[57]:

Number of Unique Daily Hosts¶

For an advanced example, let's look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts.

We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day.

Think about the steps that you need to perform to count the number of different hosts that make requests each day. Since the log only covers a single month, you can ignore the month. You may want to use the dayofmonth function in the pyspark.sql.functions module (which we have already imported as F.

host_day_df

A DataFrame with two columns

column	explanation
`host`	the host name
`day`	the day of the month

There will be one row in this DataFrame for each row in logs_df. Essentially, we are just transforming each row of logs_df. For example, for this row in logs_df:

unicomp6.unicomp.net - - [01/Aug/1995:00:35:41 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -

your host_day_df should have:

unicomp6.unicomp.net 1

In [58]:

host_day_df = logs_df.select(logs_df.host, 
                             F.dayofmonth('time').alias('day'))
host_day_df.show(5, truncate=False)

+--------------------+---+
|host                |day|
+--------------------+---+
|199.72.81.55        |1  |
|unicomp6.unicomp.net|1  |
|199.120.110.21      |1  |
|burger.letters.com  |1  |
|199.120.110.21      |1  |
+--------------------+---+
only showing top 5 rows

host_day_distinct_df

This DataFrame has the same columns as host_day_df, but with duplicate (day, host) rows removed.

In [59]:

host_day_distinct_df = (host_day_df
                          .dropDuplicates())
host_day_distinct_df.show(5, truncate=False)

+-----------------------+---+
|host                   |day|
+-----------------------+---+
|129.94.144.152         |1  |
|slip1.yab.com          |1  |
|205.184.190.47         |1  |
|204.120.34.71          |1  |
|ppp3_130.bekkoame.or.jp|1  |
+-----------------------+---+
only showing top 5 rows

daily_unique_hosts_df

A DataFrame with two columns:

column	explanation
`day`	the day of the month
`count`	the number of unique requesting hosts for that day

In [60]:

def_mr = pd.get_option('max_rows')
pd.set_option('max_rows', 10)

daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .sort("day"))

daily_hosts_df = daily_hosts_df.toPandas()
daily_hosts_df

Out[60]:

	day	count
0	1	7609
1	2	4858
2	3	10238
3	4	9411
4	5	9640
...	...	...
26	27	6846
27	28	6090
28	29	4825
29	30	5265
30	31	5913

31 rows × 2 columns

In [61]:

c = sns.catplot(x='day', y='count', 
                data=daily_hosts_df, 
                kind='point', height=5, 
                aspect=1.5)

Average Number of Daily Requests per Host¶

In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let's now try and find the average number of requests being made per Host to the NASA website per day based on our logs.

We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host.

In [62]:

daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .select(col("day"), 
                                      col("count").alias("total_hosts")))

total_daily_reqests_df = (logs_df
                              .select(F.dayofmonth("time")
                                          .alias("day"))
                              .groupBy("day")
                              .count()
                              .select(col("day"), 
                                      col("count").alias("total_reqs")))

avg_daily_reqests_per_host_df = total_daily_reqests_df.join(daily_hosts_df, 'day')
avg_daily_reqests_per_host_df = (avg_daily_reqests_per_host_df
                                    .withColumn('avg_reqs', col('total_reqs') / col('total_hosts'))
                                    .sort("day"))
avg_daily_reqests_per_host_df = avg_daily_reqests_per_host_df.toPandas()
avg_daily_reqests_per_host_df

Out[62]:

	day	total_reqs	total_hosts	avg_reqs
0	1	98710	7609	12.972795
1	2	60265	4858	12.405311
2	3	130972	10238	12.792733
3	4	130009	9411	13.814579
4	5	126468	9640	13.119087
...	...	...	...	...
26	27	94503	6846	13.804119
27	28	82617	6090	13.566010
28	29	67988	4825	14.090777
29	30	80641	5265	15.316429
30	31	90125	5913	15.241840

31 rows × 4 columns

In [63]:

c = sns.catplot(x='day', y='avg_reqs', 
                data=avg_daily_reqests_per_host_df, 
                kind='point', height=5, aspect=1.5)

Counting 404 Response Codes¶

Create a DataFrame containing only log records with a 404 status code (Not Found).

We make sure to cache() the not_found_df dataframe as we will use it in the rest of the examples here.

How many 404 records are in the log?

In [64]:

not_found_df = logs_df.filter(logs_df["status"] == 404).cache()
print(('Total 404 responses: {}').format(not_found_df.count()))

Total 404 responses: 20899

Listing the Top Twenty 404 Response Code Endpoints¶

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty endpoints that generate the most 404 errors.

Remember, top endpoints should be in sorted order

In [65]:

endpoints_404_count_df = (not_found_df
                          .groupBy("endpoint")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

endpoints_404_count_df.show(truncate=False)

+-----------------------------------------------------------------+-----+
|endpoint                                                         |count|
+-----------------------------------------------------------------+-----+
|/pub/winvn/readme.txt                                            |2004 |
|/pub/winvn/release.txt                                           |1732 |
|/shuttle/missions/STS-69/mission-STS-69.html                     |683  |
|/shuttle/missions/sts-68/ksc-upclose.gif                         |428  |
|/history/apollo/a-001/a-001-patch-small.gif                      |384  |
|/history/apollo/sa-1/sa-1-patch-small.gif                        |383  |
|/://spacelink.msfc.nasa.gov                                      |381  |
|/images/crawlerway-logo.gif                                      |374  |
|/elv/DELTA/uncons.htm                                            |372  |
|/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif|359  |
|/images/nasa-logo.gif                                            |319  |
|/shuttle/resources/orbiters/atlantis.gif                         |314  |
|/history/apollo/apollo-13.html                                   |304  |
|/shuttle/resources/orbiters/discovery.gif                        |263  |
|/shuttle/missions/sts-71/images/KSC-95EC-0916.txt                |190  |
|/shuttle/resources/orbiters/challenger.gif                       |170  |
|/shuttle/missions/technology/sts-newsref/stsref-toc.html         |158  |
|/history/apollo/images/little-joe.jpg                            |150  |
|/images/lf-logo.gif                                              |143  |
|/history/apollo/publications/sp-350/sp-350.txt~                  |140  |
+-----------------------------------------------------------------+-----+

Listing the Top Twenty 404 Response Code Hosts¶

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty hosts that generate the most 404 errors.

Remember, top hosts should be in sorted order

In [66]:

hosts_404_count_df = (not_found_df
                          .groupBy("host")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

hosts_404_count_df.show(truncate=False)

+---------------------------+-----+
|host                       |count|
+---------------------------+-----+
|hoohoo.ncsa.uiuc.edu       |251  |
|piweba3y.prodigy.com       |157  |
|jbiagioni.npt.nuwc.navy.mil|132  |
|piweba1y.prodigy.com       |114  |
|                           |112  |
|www-d4.proxy.aol.com       |91   |
|piweba4y.prodigy.com       |86   |
|scooter.pa-x.dec.com       |69   |
|www-d1.proxy.aol.com       |64   |
|phaelon.ksc.nasa.gov       |64   |
|dialip-217.den.mmc.com     |62   |
|www-b4.proxy.aol.com       |62   |
|www-b3.proxy.aol.com       |61   |
|www-a2.proxy.aol.com       |60   |
|www-d2.proxy.aol.com       |59   |
|piweba2y.prodigy.com       |59   |
|alyssa.prodigy.com         |56   |
|monarch.eng.buffalo.edu    |56   |
|www-b2.proxy.aol.com       |53   |
|www-c4.proxy.aol.com       |53   |
+---------------------------+-----+

Visualizing 404 Errors per Day¶

Let's explore our 404 records temporally (by time) now. Similar to the example showing the number of unique daily hosts, we will break down the 404 requests by day and get the daily counts sorted by day in errors_by_date_sorted_df.

In [67]:

errors_by_date_sorted_df = (not_found_df
                                .groupBy(F.dayofmonth('time').alias('day'))
                                .count()
                                .sort("day"))

errors_by_date_sorted_pd_df = errors_by_date_sorted_df.toPandas()
errors_by_date_sorted_pd_df

Out[67]:

	day	count
0	1	559
1	2	291
2	3	778
3	4	705
4	5	733
...	...	...
26	27	706
27	28	504
28	29	420
29	30	571
30	31	526

31 rows × 2 columns

In [68]:

c = sns.catplot(x='day', y='count', 
                data=errors_by_date_sorted_pd_df, 
                kind='point', height=5, aspect=1.5)

Top Three Days for 404 Errors¶

What are the top three days of the month having the most 404 errors, we can leverage our previously created errors_by_date_sorted_df for this.

In [69]:

(errors_by_date_sorted_df
    .sort("count", ascending=False)
    .show(3))

+---+-----+
|day|count|
+---+-----+
|  7| 1107|
|  6| 1013|
| 25|  876|
+---+-----+
only showing top 3 rows

Visualizing Hourly 404 Errors¶

Using the DataFrame not_found_df we cached earlier, we will now group and sort by hour of the day in increasing order, to create a DataFrame containing the total number of 404 responses for HTTP requests for each hour of the day (midnight starts at 0)

In [70]:

hourly_avg_errors_sorted_df = (not_found_df
                                   .groupBy(F.hour('time')
                                             .alias('hour'))
                                   .count()
                                   .sort('hour'))
hourly_avg_errors_sorted_pd_df = hourly_avg_errors_sorted_df.toPandas()

In [71]:

c = sns.catplot(x='hour', y='count', 
                data=hourly_avg_errors_sorted_pd_df, 
                kind='bar', height=5, aspect=1.5)

Reset the max rows displayed in pandas¶

In [72]:

pd.set_option('max_rows', def_mr)

	day	count
0	1	559
1	2	291
2	3	778
3	4	705
4	5	733
...	...	...
26	27	706
27	28	504
28	29	420
29	30	571
30	31	526

	day	count
0	1	559
1	2	291
2	3	778
3	4	705
4	5	733
...	...	...
26	27	706
27	28	504
28	29	420
29	30	571
30	31	526

	day	count
0	1	559
1	2	291
2	3	778
3	4	705
4	5	733
...	...	...
26	27	706
27	28	504
28	29	420
29	30	571
30	31	526