Apache Spark is an excellent and ideal framework for wrangling, analyzing and modeling on structured and unstructured data - at scale! In this tutorial, we will be focusing on one of the most popular case studies in the industry - log analytics.
Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.
Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis.
spark
SparkSession - hive
sqlContext
<pyspark.sql.context.SQLContext at 0x7fb1577b6400>
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
import re
import pandas as pd
m = re.finditer(r'.*?(spark).*?', "I'm searching for a spark in PySpark", re.I)
for match in m:
print(match, match.start(), match.end())
<_sre.SRE_Match object; span=(0, 25), match="I'm searching for a spark"> 0 25 <_sre.SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36
In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download here.
These two datasets contain two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the website and download the following files as needed.
Make sure both the files are in the same directory as this notebook.
Given that our data is stored in the following mentioned path, let's load it into a DataFrame. We'll do this in steps. First, we'll use sqlContext.read.text()
or spark.read.text()
to read the text file. This will produce a DataFrame with a single string column called value
.
import glob
raw_data_files = glob.glob('*.gz')
raw_data_files
['NASA_access_log_Jul95.gz', 'NASA_access_log_Aug95.gz']
base_df = spark.read.text(raw_data_files)
base_df.printSchema()
root |-- value: string (nullable = true)
type(base_df)
pyspark.sql.dataframe.DataFrame
You can also convert a dataframe to an RDD if needed
base_df_rdd = base_df.rdd
type(base_df_rdd)
pyspark.rdd.RDD
Looks like it needs to be wrangled and parsed!
base_df.show(10, truncate=False)
+-----------------------------------------------------------------------------------------------------------------------+ |value | +-----------------------------------------------------------------------------------------------------------------------+ |199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 | |unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985 | |199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085 | |burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0 | |199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179| |burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 | |burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0 | |205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985 | |d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985 | |129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074 | +-----------------------------------------------------------------------------------------------------------------------+ only showing top 10 rows
Getting data from an RDD is slightly different. You can see how the data representation is different in the following RDD
base_df_rdd.take(10)
[Row(value='199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245'), Row(value='unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'), Row(value='199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085'), Row(value='burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0'), Row(value='199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179'), Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0'), Row(value='burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0'), Row(value='205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985'), Row(value='d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985'), Row(value='129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074')]
In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.
If you're familiar with web server logs, you'll recognize that the above displayed data is in Common Log Format.
The fields are:
remotehost rfc931 authuser [date] "request" status bytes
field | meaning |
---|---|
remotehost | Remote hostname (or IP number if DNS hostname is not available or if DNSLookup is off). |
rfc931 | The remote logname of the user if at all it is present. |
authuser | The username of the remote user after authentication by the HTTP server. |
[date] | Date and time of the request. |
"request" | The request, exactly as it came from the browser or client. |
status | The HTTP status code the server sent back to the client. |
bytes | The number of bytes (Content-Length ) transferred to the client. |
We will need to use some specific techniques to parse, match and extract these attributes from the log data
Next, we have to parse it into individual columns. We'll use the special built-in regexp_extract() function to do the parsing. This function matches a column against a regular expression with one or more capture groups and allows you to extract one of the matched groups. We'll use one regular expression for each field we wish to extract.
You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly can be), and you want to learn more about them, we recommend checking out the RegexOne web site. You might also find Regular Expressions Cookbook, by Goyvaerts and Levithan, to be useful as a reference.
print((base_df.count(), len(base_df.columns)))
(3461613, 1)
Let's extract and take a look at some sample log messages
sample_logs = [item['value'] for item in base_df.take(15)]
sample_logs
['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245', 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985', '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085', 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0', '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179', 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0', 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0', '205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985', 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985', '129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / HTTP/1.0" 200 7074', 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310', 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786', 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204', 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310', 'd104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786']
Let's try and write some regular expressions to extract the host name from the logs
host_pattern = r'(^\S+\.[\S+\.]+\S+)\s'
hosts = [re.search(host_pattern, item).group(1)
if re.search(host_pattern, item)
else 'no match'
for item in sample_logs]
hosts
['199.72.81.55', 'unicomp6.unicomp.net', '199.120.110.21', 'burger.letters.com', '199.120.110.21', 'burger.letters.com', 'burger.letters.com', '205.212.115.106', 'd104.aa.net', '129.94.144.152', 'unicomp6.unicomp.net', 'unicomp6.unicomp.net', 'unicomp6.unicomp.net', 'd104.aa.net', 'd104.aa.net']
Let's now try and use regular expressions to extract the timestamp fields from the logs
ts_pattern = r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]'
timestamps = [re.search(ts_pattern, item).group(1) for item in sample_logs]
timestamps
['01/Jul/1995:00:00:01 -0400', '01/Jul/1995:00:00:06 -0400', '01/Jul/1995:00:00:09 -0400', '01/Jul/1995:00:00:11 -0400', '01/Jul/1995:00:00:11 -0400', '01/Jul/1995:00:00:12 -0400', '01/Jul/1995:00:00:12 -0400', '01/Jul/1995:00:00:12 -0400', '01/Jul/1995:00:00:13 -0400', '01/Jul/1995:00:00:13 -0400', '01/Jul/1995:00:00:14 -0400', '01/Jul/1995:00:00:14 -0400', '01/Jul/1995:00:00:14 -0400', '01/Jul/1995:00:00:15 -0400', '01/Jul/1995:00:00:15 -0400']
Let's now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs
method_uri_protocol_pattern = r'\"(\S+)\s(\S+)\s*(\S*)\"'
method_uri_protocol = [re.search(method_uri_protocol_pattern, item).groups()
if re.search(method_uri_protocol_pattern, item)
else 'no match'
for item in sample_logs]
method_uri_protocol
[('GET', '/history/apollo/', 'HTTP/1.0'), ('GET', '/shuttle/countdown/', 'HTTP/1.0'), ('GET', '/shuttle/missions/sts-73/mission-sts-73.html', 'HTTP/1.0'), ('GET', '/shuttle/countdown/liftoff.html', 'HTTP/1.0'), ('GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0'), ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'), ('GET', '/shuttle/countdown/video/livevideo.gif', 'HTTP/1.0'), ('GET', '/shuttle/countdown/countdown.html', 'HTTP/1.0'), ('GET', '/shuttle/countdown/', 'HTTP/1.0'), ('GET', '/', 'HTTP/1.0'), ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'), ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0'), ('GET', '/images/KSC-logosmall.gif', 'HTTP/1.0'), ('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'), ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')]
Let's now try and use regular expressions to extract the HTTP status codes from the logs
status_pattern = r'\s(\d{3})\s'
status = [re.search(status_pattern, item).group(1) for item in sample_logs]
print(status)
['200', '200', '200', '304', '200', '304', '200', '200', '200', '200', '200', '200', '200', '200', '200']
Let's now try and use regular expressions to extract the HTTP response content size from the logs
content_size_pattern = r'\s(\d+)$'
content_size = [re.search(content_size_pattern, item).group(1) for item in sample_logs]
print(content_size)
['6245', '3985', '4085', '0', '4179', '0', '0', '3985', '3985', '7074', '40310', '786', '1204', '40310', '786']
Let's now try and leverage all the regular expression patterns we previously built and use the regexp_extract(...)
method to build our dataframe with all the log attributes neatly extracted in their own separate columns.
from pyspark.sql.functions import regexp_extract
logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
regexp_extract('value', ts_pattern, 1).alias('timestamp'),
regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
logs_df.show(10, truncate=True)
print((logs_df.count(), len(logs_df.columns)))
+--------------------+--------------------+------+--------------------+--------+------+------------+ | host| timestamp|method| endpoint|protocol|status|content_size| +--------------------+--------------------+------+--------------------+--------+------+------------+ | 199.72.81.55|01/Jul/1995:00:00...| GET| /history/apollo/|HTTP/1.0| 200| 6245| |unicomp6.unicomp.net|01/Jul/1995:00:00...| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985| | 199.120.110.21|01/Jul/1995:00:00...| GET|/shuttle/missions...|HTTP/1.0| 200| 4085| | burger.letters.com|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 304| 0| | 199.120.110.21|01/Jul/1995:00:00...| GET|/shuttle/missions...|HTTP/1.0| 200| 4179| | burger.letters.com|01/Jul/1995:00:00...| GET|/images/NASA-logo...|HTTP/1.0| 304| 0| | burger.letters.com|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 200| 0| | 205.212.115.106|01/Jul/1995:00:00...| GET|/shuttle/countdow...|HTTP/1.0| 200| 3985| | d104.aa.net|01/Jul/1995:00:00...| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985| | 129.94.144.152|01/Jul/1995:00:00...| GET| /|HTTP/1.0| 200| 7074| +--------------------+--------------------+------+--------------------+--------+------+------------+ only showing top 10 rows (3461613, 7)
Missing and null values are the bane of data analysis and machine learning. Let's see how well our data parsing and extraction logic worked. First, let's verify that there are no null rows in the original dataframe.
(base_df
.filter(base_df['value']
.isNull())
.count())
0
If our data parsing and extraction worked properly, we should not have any rows with potential null values. Let's try and put that to test!
bad_rows_df = logs_df.filter(logs_df['host'].isNull()|
logs_df['timestamp'].isNull() |
logs_df['method'].isNull() |
logs_df['endpoint'].isNull() |
logs_df['status'].isNull() |
logs_df['content_size'].isNull()|
logs_df['protocol'].isNull())
bad_rows_df.count()
33905
Ouch! Looks like we have over 33K missing values in our data! Can we handle this?
Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called big dataset is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls?
We can typically use the following technique to find out which columns have null values.
(Note: This approach is adapted from an excellent answer on StackOverflow.)
logs_df.columns
['host', 'timestamp', 'method', 'endpoint', 'protocol', 'status', 'content_size']
from pyspark.sql.functions import col
from pyspark.sql.functions import sum as spark_sum
def count_null(col_name):
return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)
# Build up a list of column expressions, one per column.
exprs = [count_null(col_name) for col_name in logs_df.columns]
# Run the aggregation. The *exprs converts the list of expressions into
# variable function arguments.
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+ |host|timestamp|method|endpoint|protocol|status|content_size| +----+---------+------+--------+--------+------+------------+ | 0| 0| 0| 0| 0| 1| 33905| +----+---------+------+--------+--------+------+------------+
Well, looks like we have one missing value in the status
column and everything else is in the content_size
column.
Let's see if we can figure out what's wrong!
Our original parsing regular expression for the status
column was:
__``` regexp_extract('value', r'\s(\d{3})\s', 1).cast('integer').alias('status')
__
Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let's try and find out!
**Note**: In the expression below, `~` means "not".
null_status_df = base_df.filter(~base_df['value'].rlike(r'\s(\d{3})\s'))
null_status_df.count()
1
null_status_df.show(truncate=False)
+--------+ |value | +--------+ |alyssa.p| +--------+
bad_status_df = null_status_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
regexp_extract('value', ts_pattern, 1).alias('timestamp'),
regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
bad_status_df.show(truncate=False)
+----+---------+------+--------+--------+------+------------+ |host|timestamp|method|endpoint|protocol|status|content_size| +----+---------+------+--------+--------+------+------------+ | | | | | |null |null | +----+---------+------+--------+--------+------+------------+
Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!
logs_df.count()
3461613
logs_df = logs_df[logs_df['status'].isNotNull()]
logs_df.count()
3461612
exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+ |host|timestamp|method|endpoint|protocol|status|content_size| +----+---------+------+--------+--------+------+------------+ | 0| 0| 0| 0| 0| 0| 33904| +----+---------+------+--------+--------+------+------------+
Based on our previous regular expression, our original parsing regular expression for the content_size
column was:
__``` regexp_extract('value', r'\s(\d+)$', 1).cast('integer').alias('content_size')
__
Could there be missing data in our original dataset itself? Let's try and find out!
null_content_size_df = base_df.filter(~base_df['value'].rlike(r'\s\d+$'))
null_content_size_df.count()
33905
null_content_size_df.take(10)
[Row(value='dd15-062.compuserve.com - - [01/Jul/1995:00:01:12 -0400] "GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0" 404 -'), Row(value='dynip42.efn.org - - [01/Jul/1995:00:02:14 -0400] "GET /software HTTP/1.0" 302 -'), Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:02:40 -0400] "GET /software/winvn HTTP/1.0" 302 -'), Row(value='ix-or10-06.ix.netcom.com - - [01/Jul/1995:00:03:24 -0400] "GET /software HTTP/1.0" 302 -'), Row(value='link097.txdirect.net - - [01/Jul/1995:00:05:06 -0400] "GET /shuttle HTTP/1.0" 302 -'), Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:13 -0400] "GET /shuttle/missions/sts-78/news HTTP/1.0" 302 -'), Row(value='ix-war-mi1-20.ix.netcom.com - - [01/Jul/1995:00:05:58 -0400] "GET /shuttle/missions/sts-72/news HTTP/1.0" 302 -'), Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:19 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'), Row(value='netport-27.iu.net - - [01/Jul/1995:00:10:28 -0400] "GET /pub/winvn/readme.txt HTTP/1.0" 404 -'), Row(value='dynip38.efn.org - - [01/Jul/1995:00:10:50 -0400] "GET /software HTTP/1.0" 302 -')]
It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a "-
" for the content_size
field.
Since we don't want to discard those rows from our analysis, let's impute or fill them to 0.
The easiest solution is to replace the null values in logs_df
with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:
There are several ways to invoke this function. The easiest is just to replace all null columns with known values. But, for safety, it's better to pass a Python dictionary containing (column_name, value) mappings. That's what we'll do. A sample example from the documentation is depicted below
>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height| name|
+---+------+-------+
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|
+---+------+-------+
Now we use this function and fill all the missing values in the content_size
field with 0!
logs_df = logs_df.na.fill({'content_size': 0})
Now assuming everything we have done so far worked, we should have no missing values \ nulls in our dataset. Let's verify this!
exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()
+----+---------+------+--------+--------+------+------------+ |host|timestamp|method|endpoint|protocol|status|content_size| +----+---------+------+--------+--------+------+------------+ | 0| 0| 0| 0| 0| 0| 0| +----+---------+------+--------+--------+------+------------+
Look at that, no missing values!
Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it.
from pyspark.sql.functions import udf
month_map = {
'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
'Aug':8, 'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}
def parse_clf_time(text):
""" Convert Common Log time format into a Python datetime object
Args:
text (str): date and time in Apache time format [dd/mmm/yyyy:hh:mm:ss (+/-)zzzz]
Returns:
a string suitable for passing to CAST('timestamp')
"""
# NOTE: We're ignoring the time zones here, might need to be handled depending on the problem you are solving
return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
int(text[7:11]),
month_map[text[3:6]],
int(text[0:2]),
int(text[12:14]),
int(text[15:17]),
int(text[18:20])
)
sample_ts = [item['timestamp'] for item in logs_df.select('timestamp').take(5)]
sample_ts
['01/Jul/1995:00:00:01 -0400', '01/Jul/1995:00:00:06 -0400', '01/Jul/1995:00:00:09 -0400', '01/Jul/1995:00:00:11 -0400', '01/Jul/1995:00:00:11 -0400']
[parse_clf_time(item) for item in sample_ts]
['1995-07-01 00:00:01', '1995-07-01 00:00:06', '1995-07-01 00:00:09', '1995-07-01 00:00:11', '1995-07-01 00:00:11']
udf_parse_time = udf(parse_clf_time)
logs_df = logs_df.select('*', udf_parse_time(logs_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
logs_df.show(10, truncate=True)
+--------------------+------+--------------------+--------+------+------------+-------------------+ | host|method| endpoint|protocol|status|content_size| time| +--------------------+------+--------------------+--------+------+------------+-------------------+ | 199.72.81.55| GET| /history/apollo/|HTTP/1.0| 200| 6245|1995-07-01 00:00:01| |unicomp6.unicomp.net| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|1995-07-01 00:00:06| | 199.120.110.21| GET|/shuttle/missions...|HTTP/1.0| 200| 4085|1995-07-01 00:00:09| | burger.letters.com| GET|/shuttle/countdow...|HTTP/1.0| 304| 0|1995-07-01 00:00:11| | 199.120.110.21| GET|/shuttle/missions...|HTTP/1.0| 200| 4179|1995-07-01 00:00:11| | burger.letters.com| GET|/images/NASA-logo...|HTTP/1.0| 304| 0|1995-07-01 00:00:12| | burger.letters.com| GET|/shuttle/countdow...|HTTP/1.0| 200| 0|1995-07-01 00:00:12| | 205.212.115.106| GET|/shuttle/countdow...|HTTP/1.0| 200| 3985|1995-07-01 00:00:12| | d104.aa.net| GET| /shuttle/countdown/|HTTP/1.0| 200| 3985|1995-07-01 00:00:13| | 129.94.144.152| GET| /|HTTP/1.0| 200| 7074|1995-07-01 00:00:13| +--------------------+------+--------------------+--------+------+------------+-------------------+ only showing top 10 rows
logs_df.printSchema()
root |-- host: string (nullable = true) |-- method: string (nullable = true) |-- endpoint: string (nullable = true) |-- protocol: string (nullable = true) |-- status: integer (nullable = true) |-- content_size: integer (nullable = false) |-- time: timestamp (nullable = true)
logs_df.limit(5).toPandas()
host | method | endpoint | protocol | status | content_size | time | |
---|---|---|---|---|---|---|---|
0 | 199.72.81.55 | GET | /history/apollo/ | HTTP/1.0 | 200 | 6245 | 1995-07-01 00:00:01 |
1 | unicomp6.unicomp.net | GET | /shuttle/countdown/ | HTTP/1.0 | 200 | 3985 | 1995-07-01 00:00:06 |
2 | 199.120.110.21 | GET | /shuttle/missions/sts-73/mission-sts-73.html | HTTP/1.0 | 200 | 4085 | 1995-07-01 00:00:09 |
3 | burger.letters.com | GET | /shuttle/countdown/liftoff.html | HTTP/1.0 | 304 | 0 | 1995-07-01 00:00:11 |
4 | 199.120.110.21 | GET | /shuttle/missions/sts-73/sts-73-patch-small.gif | HTTP/1.0 | 200 | 4179 | 1995-07-01 00:00:11 |
Let's now cache logs_df
since we will be using it extensively for our data analysis section in the next part!
logs_df.cache()
DataFrame[host: string, method: string, endpoint: string, protocol: string, status: int, content_size: int, time: timestamp]
Now that we have a DataFrame containing the parsed log file as a data frame, we can perform some interesting exploratory data analysis (EDA)
Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.
We can compute the statistics by calling .describe()
on the content_size
column of logs_df
. The .describe()
function returns the count, mean, stddev, min, and max of a given column.
content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.toPandas()
summary | content_size | |
---|---|---|
0 | count | 3461612 |
1 | mean | 18928.844398216785 |
2 | stddev | 73031.47260949228 |
3 | min | 0 |
4 | max | 6823936 |
Alternatively, we can use SQL to directly calculate these statistics. You can explore many useful functions within the pyspark.sql.functions
module in the documentation.
After we apply the .agg()
function, we call toPandas()
to extract and convert the result into a pandas
dataframe which has better formatting on Jupyter notebooks
from pyspark.sql import functions as F
(logs_df.agg(F.min(logs_df['content_size']).alias('min_content_size'),
F.max(logs_df['content_size']).alias('max_content_size'),
F.mean(logs_df['content_size']).alias('mean_content_size'),
F.stddev(logs_df['content_size']).alias('std_content_size'),
F.count(logs_df['content_size']).alias('count_content_size'))
.toPandas())
min_content_size | max_content_size | mean_content_size | std_content_size | count_content_size | |
---|---|---|---|---|---|
0 | 0 | 6823936 | 18928.844398 | 73031.472609 | 3461612 |
Next, let's look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times.
We again start with logs_df
, then group by the status
column, apply the .count()
aggregation function, and sort by the status
column.
status_freq_df = (logs_df
.groupBy('status')
.count()
.sort('status')
.cache())
print('Total distinct HTTP Status Codes:', status_freq_df.count())
Total distinct HTTP Status Codes: 8
status_freq_pd_df = (status_freq_df
.toPandas()
.sort_values(by=['count'],
ascending=False))
status_freq_pd_df
status | count | |
---|---|---|
0 | 200 | 3100524 |
2 | 304 | 266773 |
1 | 302 | 73070 |
5 | 404 | 20899 |
4 | 403 | 225 |
6 | 500 | 65 |
7 | 501 | 41 |
3 | 400 | 15 |
!pip install -U seaborn
Requirement already up-to-date: seaborn in /usr/local/anaconda/lib/python3.6/site-packages (0.9.0) Requirement not upgraded as not directly required: matplotlib>=1.4.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (2.2.2) Requirement not upgraded as not directly required: numpy>=1.9.3 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.13.3) Requirement not upgraded as not directly required: scipy>=0.14.0 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (1.1.0) Requirement not upgraded as not directly required: pandas>=0.15.2 in /usr/local/anaconda/lib/python3.6/site-packages (from seaborn) (0.20.3) Requirement not upgraded as not directly required: cycler>=0.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (0.10.0) Requirement not upgraded as not directly required: pytz in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2018.4) Requirement not upgraded as not directly required: six>=1.10 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.11.0) Requirement not upgraded as not directly required: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.2.0) Requirement not upgraded as not directly required: kiwisolver>=1.0.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (1.0.1) Requirement not upgraded as not directly required: python-dateutil>=2.1 in /usr/local/anaconda/lib/python3.6/site-packages (from matplotlib>=1.4.3->seaborn) (2.7.3) Requirement not upgraded as not directly required: setuptools in /usr/local/anaconda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (39.2.0) pyspark 2.4.0 requires py4j==0.10.7, which is not installed. You are using pip version 10.0.1, however version 19.0.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
sns.catplot(x='status', y='count', data=status_freq_pd_df,
kind='bar', order=status_freq_pd_df['status'])
<seaborn.axisgrid.FacetGrid at 0x7fb13c15ea90>
log_freq_df = status_freq_df.withColumn('log(count)', F.log(status_freq_df['count']))
log_freq_df.show()
+------+-------+------------------+ |status| count| log(count)| +------+-------+------------------+ | 200|3100524|14.947081687429097| | 302| 73070|11.199173164785263| | 304| 266773|12.494153388502301| | 400| 15| 2.70805020110221| | 403| 225| 5.41610040220442| | 404| 20899| 9.947456589918252| | 500| 65| 4.174387269895637| | 501| 41| 3.713572066704308| +------+-------+------------------+
log_freq_pd_df = (log_freq_df
.toPandas()
.sort_values(by=['log(count)'],
ascending=False))
sns.catplot(x='status', y='log(count)', data=log_freq_pd_df,
kind='bar', order=status_freq_pd_df['status'])
<seaborn.axisgrid.FacetGrid at 0x7fb137dfb4e0>
Let's look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each host
and then sort by the counts and display only the top ten most frequent hosts.
host_sum_df =(logs_df
.groupBy('host')
.count()
.sort('count', ascending=False).limit(10))
host_sum_df.show(truncate=False)
+--------------------+-----+ |host |count| +--------------------+-----+ |piweba3y.prodigy.com|21988| |piweba4y.prodigy.com|16437| |piweba1y.prodigy.com|12825| |edams.ksc.nasa.gov |11964| |163.206.89.4 |9697 | |news.ti.com |8161 | |www-d1.proxy.aol.com|8047 | |alyssa.prodigy.com |8037 | | |7660 | |siltb10.orl.mmc.com |7573 | +--------------------+-----+
host_sum_pd_df = host_sum_df.toPandas()
host_sum_pd_df.iloc[8]['host']
''
Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling.
Now, let's visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our logs_df
and group by the endpoint
column, aggregate by count, and sort in descending order like the previous question.
paths_df = (logs_df
.groupBy('endpoint')
.count()
.sort('count', ascending=False).limit(20))
paths_pd_df = paths_df.toPandas()
paths_pd_df
endpoint | count | |
---|---|---|
0 | /images/NASA-logosmall.gif | 208714 |
1 | /images/KSC-logosmall.gif | 164970 |
2 | /images/MOSAIC-logosmall.gif | 127908 |
3 | /images/USA-logosmall.gif | 127074 |
4 | /images/WORLD-logosmall.gif | 125925 |
5 | /images/ksclogo-medium.gif | 121572 |
6 | /ksc.html | 83909 |
7 | /images/launch-logo.gif | 76006 |
8 | /history/apollo/images/apollo-logo1.gif | 68896 |
9 | /shuttle/countdown/ | 64736 |
10 | / | 63171 |
11 | /images/ksclogosmall.gif | 61393 |
12 | /shuttle/missions/missions.html | 47315 |
13 | /images/launchmedium.gif | 40687 |
14 | /htbin/cdt_main.pl | 39871 |
15 | /shuttle/missions/sts-69/mission-sts-69.html | 31574 |
16 | /shuttle/countdown/liftoff.html | 29865 |
17 | /icons/menu.xbm | 29190 |
18 | /shuttle/missions/sts-69/sts-69-patch-small.gif | 29118 |
19 | /icons/blank.xbm | 28852 |
What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)?
We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten.
not200_df = (logs_df
.filter(logs_df['status'] != 200))
error_endpoints_freq_df = (not200_df
.groupBy('endpoint')
.count()
.sort('count', ascending=False)
.limit(10)
)
error_endpoints_freq_df.show(truncate=False)
+---------------------------------------+-----+ |endpoint |count| +---------------------------------------+-----+ |/images/NASA-logosmall.gif |40082| |/images/KSC-logosmall.gif |23763| |/images/MOSAIC-logosmall.gif |15245| |/images/USA-logosmall.gif |15142| |/images/WORLD-logosmall.gif |14773| |/images/ksclogo-medium.gif |13559| |/images/launch-logo.gif |8806 | |/history/apollo/images/apollo-logo1.gif|7489 | |/ |6296 | |/images/ksclogosmall.gif |5669 | +---------------------------------------+-----+
What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations.
unique_host_count = (logs_df
.select('host')
.distinct()
.count())
unique_host_count
137933
For an advanced example, let's look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts.
We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day.
Think about the steps that you need to perform to count the number of different hosts that make requests each day.
Since the log only covers a single month, you can ignore the month. You may want to use the dayofmonth
function in the pyspark.sql.functions
module (which we have already imported as F
.
host_day_df
A DataFrame with two columns
column | explanation |
---|---|
host |
the host name |
day |
the day of the month |
There will be one row in this DataFrame for each row in logs_df
. Essentially, we are just transforming each row of logs_df
. For example, for this row in logs_df
:
unicomp6.unicomp.net - - [01/Aug/1995:00:35:41 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -
your host_day_df
should have:
unicomp6.unicomp.net 1
host_day_df = logs_df.select(logs_df.host,
F.dayofmonth('time').alias('day'))
host_day_df.show(5, truncate=False)
+--------------------+---+ |host |day| +--------------------+---+ |199.72.81.55 |1 | |unicomp6.unicomp.net|1 | |199.120.110.21 |1 | |burger.letters.com |1 | |199.120.110.21 |1 | +--------------------+---+ only showing top 5 rows
host_day_distinct_df
This DataFrame has the same columns as host_day_df
, but with duplicate (day
, host
) rows removed.
host_day_distinct_df = (host_day_df
.dropDuplicates())
host_day_distinct_df.show(5, truncate=False)
+-----------------------+---+ |host |day| +-----------------------+---+ |129.94.144.152 |1 | |slip1.yab.com |1 | |205.184.190.47 |1 | |204.120.34.71 |1 | |ppp3_130.bekkoame.or.jp|1 | +-----------------------+---+ only showing top 5 rows
daily_unique_hosts_df
A DataFrame with two columns:
column | explanation |
---|---|
day |
the day of the month |
count |
the number of unique requesting hosts for that day |
def_mr = pd.get_option('max_rows')
pd.set_option('max_rows', 10)
daily_hosts_df = (host_day_distinct_df
.groupBy('day')
.count()
.sort("day"))
daily_hosts_df = daily_hosts_df.toPandas()
daily_hosts_df
day | count | |
---|---|---|
0 | 1 | 7609 |
1 | 2 | 4858 |
2 | 3 | 10238 |
3 | 4 | 9411 |
4 | 5 | 9640 |
... | ... | ... |
26 | 27 | 6846 |
27 | 28 | 6090 |
28 | 29 | 4825 |
29 | 30 | 5265 |
30 | 31 | 5913 |
31 rows × 2 columns
c = sns.catplot(x='day', y='count',
data=daily_hosts_df,
kind='point', height=5,
aspect=1.5)
In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let's now try and find the average number of requests being made per Host to the NASA website per day based on our logs.
We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host.
daily_hosts_df = (host_day_distinct_df
.groupBy('day')
.count()
.select(col("day"),
col("count").alias("total_hosts")))
total_daily_reqests_df = (logs_df
.select(F.dayofmonth("time")
.alias("day"))
.groupBy("day")
.count()
.select(col("day"),
col("count").alias("total_reqs")))
avg_daily_reqests_per_host_df = total_daily_reqests_df.join(daily_hosts_df, 'day')
avg_daily_reqests_per_host_df = (avg_daily_reqests_per_host_df
.withColumn('avg_reqs', col('total_reqs') / col('total_hosts'))
.sort("day"))
avg_daily_reqests_per_host_df = avg_daily_reqests_per_host_df.toPandas()
avg_daily_reqests_per_host_df
day | total_reqs | total_hosts | avg_reqs | |
---|---|---|---|---|
0 | 1 | 98710 | 7609 | 12.972795 |
1 | 2 | 60265 | 4858 | 12.405311 |
2 | 3 | 130972 | 10238 | 12.792733 |
3 | 4 | 130009 | 9411 | 13.814579 |
4 | 5 | 126468 | 9640 | 13.119087 |
... | ... | ... | ... | ... |
26 | 27 | 94503 | 6846 | 13.804119 |
27 | 28 | 82617 | 6090 | 13.566010 |
28 | 29 | 67988 | 4825 | 14.090777 |
29 | 30 | 80641 | 5265 | 15.316429 |
30 | 31 | 90125 | 5913 | 15.241840 |
31 rows × 4 columns
c = sns.catplot(x='day', y='avg_reqs',
data=avg_daily_reqests_per_host_df,
kind='point', height=5, aspect=1.5)
Create a DataFrame containing only log records with a 404 status code (Not Found).
We make sure to cache()
the not_found_df
dataframe as we will use it in the rest of the examples here.
How many 404 records are in the log?
not_found_df = logs_df.filter(logs_df["status"] == 404).cache()
print(('Total 404 responses: {}').format(not_found_df.count()))
Total 404 responses: 20899
Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty endpoints that generate the most 404 errors.
Remember, top endpoints should be in sorted order
endpoints_404_count_df = (not_found_df
.groupBy("endpoint")
.count()
.sort("count", ascending=False)
.limit(20))
endpoints_404_count_df.show(truncate=False)
+-----------------------------------------------------------------+-----+ |endpoint |count| +-----------------------------------------------------------------+-----+ |/pub/winvn/readme.txt |2004 | |/pub/winvn/release.txt |1732 | |/shuttle/missions/STS-69/mission-STS-69.html |683 | |/shuttle/missions/sts-68/ksc-upclose.gif |428 | |/history/apollo/a-001/a-001-patch-small.gif |384 | |/history/apollo/sa-1/sa-1-patch-small.gif |383 | |/://spacelink.msfc.nasa.gov |381 | |/images/crawlerway-logo.gif |374 | |/elv/DELTA/uncons.htm |372 | |/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif|359 | |/images/nasa-logo.gif |319 | |/shuttle/resources/orbiters/atlantis.gif |314 | |/history/apollo/apollo-13.html |304 | |/shuttle/resources/orbiters/discovery.gif |263 | |/shuttle/missions/sts-71/images/KSC-95EC-0916.txt |190 | |/shuttle/resources/orbiters/challenger.gif |170 | |/shuttle/missions/technology/sts-newsref/stsref-toc.html |158 | |/history/apollo/images/little-joe.jpg |150 | |/images/lf-logo.gif |143 | |/history/apollo/publications/sp-350/sp-350.txt~ |140 | +-----------------------------------------------------------------+-----+
Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty hosts that generate the most 404 errors.
Remember, top hosts should be in sorted order
hosts_404_count_df = (not_found_df
.groupBy("host")
.count()
.sort("count", ascending=False)
.limit(20))
hosts_404_count_df.show(truncate=False)
+---------------------------+-----+ |host |count| +---------------------------+-----+ |hoohoo.ncsa.uiuc.edu |251 | |piweba3y.prodigy.com |157 | |jbiagioni.npt.nuwc.navy.mil|132 | |piweba1y.prodigy.com |114 | | |112 | |www-d4.proxy.aol.com |91 | |piweba4y.prodigy.com |86 | |scooter.pa-x.dec.com |69 | |www-d1.proxy.aol.com |64 | |phaelon.ksc.nasa.gov |64 | |dialip-217.den.mmc.com |62 | |www-b4.proxy.aol.com |62 | |www-b3.proxy.aol.com |61 | |www-a2.proxy.aol.com |60 | |www-d2.proxy.aol.com |59 | |piweba2y.prodigy.com |59 | |alyssa.prodigy.com |56 | |monarch.eng.buffalo.edu |56 | |www-b2.proxy.aol.com |53 | |www-c4.proxy.aol.com |53 | +---------------------------+-----+
Let's explore our 404 records temporally (by time) now. Similar to the example showing the number of unique daily hosts, we will break down the 404 requests by day and get the daily counts sorted by day in errors_by_date_sorted_df
.
errors_by_date_sorted_df = (not_found_df
.groupBy(F.dayofmonth('time').alias('day'))
.count()
.sort("day"))
errors_by_date_sorted_pd_df = errors_by_date_sorted_df.toPandas()
errors_by_date_sorted_pd_df
day | count | |
---|---|---|
0 | 1 | 559 |
1 | 2 | 291 |
2 | 3 | 778 |
3 | 4 | 705 |
4 | 5 | 733 |
... | ... | ... |
26 | 27 | 706 |
27 | 28 | 504 |
28 | 29 | 420 |
29 | 30 | 571 |
30 | 31 | 526 |
31 rows × 2 columns
c = sns.catplot(x='day', y='count',
data=errors_by_date_sorted_pd_df,
kind='point', height=5, aspect=1.5)
What are the top three days of the month having the most 404 errors, we can leverage our previously created errors_by_date_sorted_df
for this.
(errors_by_date_sorted_df
.sort("count", ascending=False)
.show(3))
+---+-----+ |day|count| +---+-----+ | 7| 1107| | 6| 1013| | 25| 876| +---+-----+ only showing top 3 rows
Using the DataFrame not_found_df
we cached earlier, we will now group and sort by hour of the day in increasing order, to create a DataFrame containing the total number of 404 responses for HTTP requests for each hour of the day (midnight starts at 0)
hourly_avg_errors_sorted_df = (not_found_df
.groupBy(F.hour('time')
.alias('hour'))
.count()
.sort('hour'))
hourly_avg_errors_sorted_pd_df = hourly_avg_errors_sorted_df.toPandas()
c = sns.catplot(x='hour', y='count',
data=hourly_avg_errors_sorted_pd_df,
kind='bar', height=5, aspect=1.5)
pd.set_option('max_rows', def_mr)