#!/usr/bin/env python
# coding: utf-8

# # Scalable Web Server Log Analytics with Apache Spark
# 
# Apache Spark is an excellent and ideal framework for wrangling, analyzing and modeling on structured and unstructured data - at scale! In this tutorial, we will be focusing on one of the most popular case studies in the industry - log analytics.
# 
# Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.
# 
# Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis.

# # Part 1 - Setting up Dependencies

# In[1]:


spark


# In[2]:


sqlContext


# In[3]:


from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
    
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)


# In[4]:


import re
import pandas as pd


# ## Basic Regular Expressions

# In[5]:


m = re.finditer(r'.*?(spark).*?', "I'm searching for a spark in PySpark", re.I)
for match in m:
    print(match, match.start(), match.end())


# In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download [__here__](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html).
# 
# These two datasets contain two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the [__website__](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and download the following files as needed.
# 
# - Jul 01 to Jul 31, ASCII format, 20.7 MB gzip compressed, 205.2 MB uncompressed: [ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz](ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz)
# - Aug 04 to Aug 31, ASCII format, 21.8 MB gzip compressed, 167.8 MB uncompressed: [ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz](ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz)
# 
# Make sure both the files are in the same directory as this notebook.

# # Part 2 - Loading and Viewing the NASA Log Dataset
# 
# Given that our data is stored in the following mentioned path, let's load it into a DataFrame. We'll do this in steps. First, we'll use `sqlContext.read.text()` or `spark.read.text()` to read the text file. This will produce a DataFrame with a single string column called `value`.

# In[6]:


import glob

raw_data_files = glob.glob('*.gz')
raw_data_files


# ### Taking a look at the metadata of our dataframe

# In[7]:


base_df = spark.read.text(raw_data_files)
base_df.printSchema()


# In[8]:


type(base_df)


# You can also convert a dataframe to an RDD if needed

# In[9]:


base_df_rdd = base_df.rdd
type(base_df_rdd)


# ### Viewing sample data in our dataframe
# Looks like it needs to be wrangled and parsed!

# In[10]:


base_df.show(10, truncate=False)


# Getting data from an RDD is slightly different. You can see how the data representation is different in the following RDD

# In[11]:


base_df_rdd.take(10)


# # Part 3 - Data Wrangling
# 
# In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.
# 
# ### Data understanding
# If you're familiar with web server logs, you'll recognize that the above displayed data is in [Common Log Format](https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format). 
# 
# The fields are:
# __`remotehost rfc931 authuser [date] "request" status bytes`__
# 
# 
# | field         | meaning                                                                |
# | ------------- | ---------------------------------------------------------------------- |
# | _remotehost_  | Remote hostname (or IP number if DNS hostname is not available or if [DNSLookup](https://www.w3.org/Daemon/User/Config/General.html#DNSLookup) is off).       |
# | _rfc931_      | The remote logname of the user if at all it is present. |
# | _authuser_    | The username of the remote user after authentication by the HTTP server.  |
# | _[date]_      | Date and time of the request.                                      |
# | _"request"_   | The request, exactly as it came from the browser or client.            |
# | _status_      | The [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) the server sent back to the client.               |
# | _bytes_       | The number of bytes (`Content-Length`) transferred to the client.      |
# 
# We will need to use some specific techniques to parse, match and extract these attributes from the log data

# ## Data Parsing and Extraction with Regular Expressions
# 
# Next, we have to parse it into individual columns. We'll use the special built-in [regexp\_extract()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract)
# function to do the parsing. This function matches a column against a regular expression with one or more [capture groups](http://regexone.com/lesson/capturing_groups) and allows you to extract one of the matched groups. We'll use one regular expression for each field we wish to extract.
# 
# You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly _can_ be), and you want to learn more about them, we recommend checking out the
# [RegexOne web site](http://regexone.com/). You might also find [_Regular Expressions Cookbook_](http://shop.oreilly.com/product/0636920023630.do), by Goyvaerts and Levithan, to be useful as a reference.

# #### Let's take a look at our dataset dimensions

# In[12]:


print((base_df.count(), len(base_df.columns)))


# Let's extract and take a look at some sample log messages

# In[13]:


sample_logs = [item['value'] for item in base_df.take(15)]
sample_logs


# ### Extracting host names
# 
# Let's try and write some regular expressions to extract the host name from the logs

# In[14]:


host_pattern = r'(^\S+\.[\S+\.]+\S+)\s'
hosts = [re.search(host_pattern, item).group(1)
           if re.search(host_pattern, item)
           else 'no match'
           for item in sample_logs]
hosts


# ### Extracting timestamps 
# 
# Let's now try and use regular expressions to extract the timestamp fields from the logs

# In[15]:


ts_pattern = r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]'
timestamps = [re.search(ts_pattern, item).group(1) for item in sample_logs]
timestamps


# ### Extracting HTTP Request Method, URIs and Protocol 
# 
# Let's now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs

# In[16]:


method_uri_protocol_pattern = r'\"(\S+)\s(\S+)\s*(\S*)\"'
method_uri_protocol = [re.search(method_uri_protocol_pattern, item).groups()
               if re.search(method_uri_protocol_pattern, item)
               else 'no match'
              for item in sample_logs]
method_uri_protocol


# ### Extracting HTTP Status Codes
# 
# Let's now try and use regular expressions to extract the HTTP status codes from the logs

# In[18]:


status_pattern = r'\s(\d{3})\s'
status = [re.search(status_pattern, item).group(1) for item in sample_logs]
print(status)


# ### Extracting HTTP Response Content Size
# 
# Let's now try and use regular expressions to extract the HTTP response content size from the logs

# In[19]:


content_size_pattern = r'\s(\d+)$'
content_size = [re.search(content_size_pattern, item).group(1) for item in sample_logs]
print(content_size)


# ## Putting it all together 
# 
# Let's now try and leverage all the regular expression patterns we previously built and use the `regexp_extract(...)` method to build our dataframe with all the log attributes neatly extracted in their own separate columns.

# In[20]:


from pyspark.sql.functions import regexp_extract

logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                         regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                         regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                         regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                         regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                         regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                         regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
logs_df.show(10, truncate=True)
print((logs_df.count(), len(logs_df.columns)))


# ## Finding Missing Values
# 
# Missing and null values are the bane of data analysis and machine learning. Let's see how well our data parsing and extraction logic worked. First, let's verify that there are no null rows in the original dataframe.

# In[21]:


(base_df
    .filter(base_df['value']
                .isNull())
    .count())


# If our data parsing and extraction worked properly, we should not have any rows with potential null values. Let's try and put that to test!

# In[22]:


bad_rows_df = logs_df.filter(logs_df['host'].isNull()| 
                             logs_df['timestamp'].isNull() | 
                             logs_df['method'].isNull() |
                             logs_df['endpoint'].isNull() |
                             logs_df['status'].isNull() |
                             logs_df['content_size'].isNull()|
                             logs_df['protocol'].isNull())
bad_rows_df.count()


# Ouch! Looks like we have over 33K missing values in our data! Can we handle this?

# Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called _big dataset_ is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls? 
# 
# ### Finding Null Counts
# 
# We can typically use the following technique to find out which columns have null values. 
# 
# (__Note:__ This approach is adapted from an [excellent answer](http://stackoverflow.com/a/33901312) on StackOverflow.)

# In[23]:


logs_df.columns


# In[24]:


from pyspark.sql.functions import col
from pyspark.sql.functions import sum as spark_sum

def count_null(col_name):
    return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)

# Build up a list of column expressions, one per column.
exprs = [count_null(col_name) for col_name in logs_df.columns]

# Run the aggregation. The *exprs converts the list of expressions into
# variable function arguments.
logs_df.agg(*exprs).show()


# Well, looks like we have one missing value in the `status` column and everything else is in the `content_size` column. 
# Let's see if we can figure out what's wrong!

# ### Handling nulls in HTTP status
# 
# Our original parsing regular expression for the `status` column was:
# 
# __```
# regexp_extract('value', r'\s(\d{3})\s', 1).cast('integer').alias('status')
# ```__ 
# 
# Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let's try and find out!
# 
# **Note**: In the expression below, `~` means "not".

# In[25]:


null_status_df = base_df.filter(~base_df['value'].rlike(r'\s(\d{3})\s'))
null_status_df.count()


# In[26]:


null_status_df.show(truncate=False)


# In[27]:


bad_status_df = null_status_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
                                      regexp_extract('value', ts_pattern, 1).alias('timestamp'),
                                      regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
                                      regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
                                      regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
                                      regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
                                      regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
bad_status_df.show(truncate=False)


# Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!

# In[28]:


logs_df.count()


# In[29]:


logs_df = logs_df[logs_df['status'].isNotNull()] 
logs_df.count()


# In[30]:


exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()


# ### Handling nulls in HTTP content size
# 
# Based on our previous regular expression, our original parsing regular expression for the `content_size` column was:
# 
# __```
# regexp_extract('value', r'\s(\d+)$', 1).cast('integer').alias('content_size')
# ```__ 
# 
# Could there be missing data in our original dataset itself? Let's try and find out!

# ### Find out the records in our base data frame with potential missing content sizes

# In[31]:


null_content_size_df = base_df.filter(~base_df['value'].rlike(r'\s\d+$'))
null_content_size_df.count()


# ### Display the top ten records of your data frame having missing content sizes

# In[32]:


null_content_size_df.take(10)


# It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a "`-`" for the `content_size` field. 
# 
# Since we don't want to discard those rows from our analysis, let's impute or fill them to 0.

# ### Fix the rows with null content\_size
# 
# The easiest solution is to replace the null values in `logs_df` with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:
# 
# * [fillna()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna), which fills null values with specified non-null values.
# * [na](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.na), which returns a [DataFrameNaFunctions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions) object with many functions for operating on null columns.
# 
# There are several ways to invoke this function. The easiest is just to replace _all_ null columns with known values. But, for safety, it's better to pass a Python dictionary containing (column\_name, value) mappings. That's what we'll do. A sample example from the documentation is depicted below
# 
# ```
# >>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
# +---+------+-------+
# |age|height|   name|
# +---+------+-------+
# | 10|    80|  Alice|
# |  5|  null|    Bob|
# | 50|  null|    Tom|
# | 50|  null|unknown|
# +---+------+-------+
# ```
# 
# Now we use this function and fill all the missing values in the `content_size` field with 0!

# In[33]:


logs_df = logs_df.na.fill({'content_size': 0})


# Now assuming everything we have done so far worked, we should have no missing values \ nulls in our dataset. Let's verify this!

# In[34]:


exprs = [count_null(col_name) for col_name in logs_df.columns]
logs_df.agg(*exprs).show()


# Look at that, no missing values! 

# ## Handling Temporal Fields (Timestamp)
# 
# Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it.

# In[35]:


from pyspark.sql.functions import udf

month_map = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
  'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}

def parse_clf_time(text):
    """ Convert Common Log time format into a Python datetime object
    Args:
        text (str): date and time in Apache time format [dd/mmm/yyyy:hh:mm:ss (+/-)zzzz]
    Returns:
        a string suitable for passing to CAST('timestamp')
    """
    # NOTE: We're ignoring the time zones here, might need to be handled depending on the problem you are solving
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
      int(text[7:11]),
      month_map[text[3:6]],
      int(text[0:2]),
      int(text[12:14]),
      int(text[15:17]),
      int(text[18:20])
    )


# In[36]:


sample_ts = [item['timestamp'] for item in logs_df.select('timestamp').take(5)]
sample_ts


# In[37]:


[parse_clf_time(item) for item in sample_ts]


# In[38]:


udf_parse_time = udf(parse_clf_time)

logs_df = logs_df.select('*', udf_parse_time(logs_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
logs_df.show(10, truncate=True)


# In[39]:


logs_df.printSchema()


# In[40]:


logs_df.limit(5).toPandas()


# Let's now cache `logs_df` since we will be using it extensively for our data analysis section in the next part!

# In[41]:


logs_df.cache()


# # Part 4 - Data Analysis on our Web Logs
# 
# Now that we have a DataFrame containing the parsed log file as a data frame, we can perform some interesting exploratory data analysis (EDA)
# 
# ## Content Size Statistics
# 
# Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.
# 
# We can compute the statistics by calling `.describe()` on the `content_size` column of `logs_df`.  The `.describe()` function returns the count, mean, stddev, min, and max of a given column.

# In[42]:


content_size_summary_df = logs_df.describe(['content_size'])
content_size_summary_df.toPandas()


# Alternatively, we can use SQL to directly calculate these statistics.  You can explore many useful functions within the `pyspark.sql.functions` module in the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions).
# 
# After we apply the `.agg()` function, we call `toPandas()` to extract and convert the result into a `pandas` dataframe which has better formatting on Jupyter notebooks

# In[43]:


from pyspark.sql import functions as F

(logs_df.agg(F.min(logs_df['content_size']).alias('min_content_size'),
             F.max(logs_df['content_size']).alias('max_content_size'),
             F.mean(logs_df['content_size']).alias('mean_content_size'),
             F.stddev(logs_df['content_size']).alias('std_content_size'),
             F.count(logs_df['content_size']).alias('count_content_size'))
        .toPandas())


# ## HTTP Status Code Analysis
# 
# Next, let's look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times.  
# 
# We again start with `logs_df`, then group by the `status` column, apply the `.count()` aggregation function, and sort by the `status` column.

# In[44]:


status_freq_df = (logs_df
                     .groupBy('status')
                     .count()
                     .sort('status')
                     .cache())


# In[45]:


print('Total distinct HTTP Status Codes:', status_freq_df.count())


# In[46]:


status_freq_pd_df = (status_freq_df
                         .toPandas()
                         .sort_values(by=['count'],
                                      ascending=False))
status_freq_pd_df


# In[47]:


get_ipython().system('pip install -U seaborn')


# In[48]:


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')

sns.catplot(x='status', y='count', data=status_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])


# In[49]:


log_freq_df = status_freq_df.withColumn('log(count)', F.log(status_freq_df['count']))
log_freq_df.show()


# In[50]:


log_freq_pd_df = (log_freq_df
                    .toPandas()
                    .sort_values(by=['log(count)'],
                                 ascending=False))
sns.catplot(x='status', y='log(count)', data=log_freq_pd_df, 
            kind='bar', order=status_freq_pd_df['status'])


# ## Analyzing Frequent Hosts
# 
# Let's look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each `host` and then sort by the counts and display only the top ten most frequent hosts.

# In[51]:


host_sum_df =(logs_df
               .groupBy('host')
               .count()
               .sort('count', ascending=False).limit(10))

host_sum_df.show(truncate=False)


# In[52]:


host_sum_pd_df = host_sum_df.toPandas()
host_sum_pd_df.iloc[8]['host']


# Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling.

# ## Display the Top 20 Frequent EndPoints
# 
# Now, let's visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our `logs_df` and group by the `endpoint` column, aggregate by count, and sort in descending order like the previous question.

# In[53]:


paths_df = (logs_df
            .groupBy('endpoint')
            .count()
            .sort('count', ascending=False).limit(20))


# In[54]:


paths_pd_df = paths_df.toPandas()
paths_pd_df


# ## Top Ten Error Endpoints
# 
# What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)? 
# 
# We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten.

# In[55]:


not200_df = (logs_df
               .filter(logs_df['status'] != 200))

error_endpoints_freq_df = (not200_df
                               .groupBy('endpoint')
                               .count()
                               .sort('count', ascending=False)
                               .limit(10)
                          )


# In[56]:


error_endpoints_freq_df.show(truncate=False)


# ## Total number of Unique Hosts
# 
# What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations.

# In[57]:


unique_host_count = (logs_df
                     .select('host')
                     .distinct()
                     .count())
unique_host_count


# ## Number of Unique Daily Hosts
# 
# For an advanced example, let's look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts. 
# 
# We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day. 
# 
# Think about the steps that you need to perform to count the number of different hosts that make requests *each* day.
# *Since the log only covers a single month, you can ignore the month.*  You may want to use the [`dayofmonth` function](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.dayofmonth) in the `pyspark.sql.functions` module (which we have already imported as __`F`__.
# 
# 
# **`host_day_df`**
# 
# A DataFrame with two columns
# 
# | column | explanation          |
# | ------ | -------------------- |
# | `host` | the host name        |
# | `day`  | the day of the month |
# 
# There will be one row in this DataFrame for each row in `logs_df`. Essentially, we are just transforming each row of `logs_df`. For example, for this row in `logs_df`:
# 
# ```
# unicomp6.unicomp.net - - [01/Aug/1995:00:35:41 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -
# ```
# 
# your `host_day_df` should have:
# 
# ```
# unicomp6.unicomp.net 1
# ```

# In[58]:


host_day_df = logs_df.select(logs_df.host, 
                             F.dayofmonth('time').alias('day'))
host_day_df.show(5, truncate=False)


# **`host_day_distinct_df`**
# 
# This DataFrame has the same columns as `host_day_df`, but with duplicate (`day`, `host`) rows removed.

# In[59]:


host_day_distinct_df = (host_day_df
                          .dropDuplicates())
host_day_distinct_df.show(5, truncate=False)


# **`daily_unique_hosts_df`**
# 
# A DataFrame with two columns:
# 
# | column  | explanation                                        |
# | ------- | -------------------------------------------------- |
# | `day`   | the day of the month                               |
# | `count` | the number of unique requesting hosts for that day |

# In[60]:


def_mr = pd.get_option('max_rows')
pd.set_option('max_rows', 10)

daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .sort("day"))

daily_hosts_df = daily_hosts_df.toPandas()
daily_hosts_df


# In[61]:


c = sns.catplot(x='day', y='count', 
                data=daily_hosts_df, 
                kind='point', height=5, 
                aspect=1.5)


# ## Average Number of Daily Requests per Host
# 
# In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let's now try and find the average number of requests being made per Host to the NASA website per day based on our logs. 
# 
# We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host. 

# In[62]:


daily_hosts_df = (host_day_distinct_df
                     .groupBy('day')
                     .count()
                     .select(col("day"), 
                                      col("count").alias("total_hosts")))

total_daily_reqests_df = (logs_df
                              .select(F.dayofmonth("time")
                                          .alias("day"))
                              .groupBy("day")
                              .count()
                              .select(col("day"), 
                                      col("count").alias("total_reqs")))

avg_daily_reqests_per_host_df = total_daily_reqests_df.join(daily_hosts_df, 'day')
avg_daily_reqests_per_host_df = (avg_daily_reqests_per_host_df
                                    .withColumn('avg_reqs', col('total_reqs') / col('total_hosts'))
                                    .sort("day"))
avg_daily_reqests_per_host_df = avg_daily_reqests_per_host_df.toPandas()
avg_daily_reqests_per_host_df


# In[63]:


c = sns.catplot(x='day', y='avg_reqs', 
                data=avg_daily_reqests_per_host_df, 
                kind='point', height=5, aspect=1.5)


# ## Counting 404 Response Codes
# 
# Create a DataFrame containing only log records with a 404 status code (Not Found). 
# 
# We make sure to `cache()` the `not_found_df` dataframe as we will use it in the rest of the examples here.
# 
# __How many 404 records are in the log?__

# In[64]:


not_found_df = logs_df.filter(logs_df["status"] == 404).cache()
print(('Total 404 responses: {}').format(not_found_df.count()))


# ## Listing the Top Twenty 404 Response Code Endpoints
# 
# Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty endpoints that generate the most 404 errors.
# 
# *Remember, top endpoints should be in sorted order*

# In[65]:


endpoints_404_count_df = (not_found_df
                          .groupBy("endpoint")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

endpoints_404_count_df.show(truncate=False)


# ## Listing the Top Twenty 404 Response Code Hosts
# 
# Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty hosts that generate the most 404 errors.
# 
# *Remember, top hosts should be in sorted order*

# In[66]:


hosts_404_count_df = (not_found_df
                          .groupBy("host")
                          .count()
                          .sort("count", ascending=False)
                          .limit(20))

hosts_404_count_df.show(truncate=False)


# ## Visualizing 404 Errors per Day
# 
# Let's explore our 404 records temporally (by time) now. Similar to the example showing the number of unique daily hosts, we will break down the 404 requests by day and get the daily counts sorted by day in `errors_by_date_sorted_df`.

# In[67]:


errors_by_date_sorted_df = (not_found_df
                                .groupBy(F.dayofmonth('time').alias('day'))
                                .count()
                                .sort("day"))

errors_by_date_sorted_pd_df = errors_by_date_sorted_df.toPandas()
errors_by_date_sorted_pd_df


# In[68]:


c = sns.catplot(x='day', y='count', 
                data=errors_by_date_sorted_pd_df, 
                kind='point', height=5, aspect=1.5)


# ## Top Three Days for 404 Errors
# 
# What are the top three days of the month having the most 404 errors, we can leverage our previously created __`errors_by_date_sorted_df`__ for this. 

# In[69]:


(errors_by_date_sorted_df
    .sort("count", ascending=False)
    .show(3))


# ## Visualizing Hourly 404 Errors
# 
# Using the DataFrame `not_found_df` we cached earlier, we will now group and sort by hour of the day in increasing order, to create a DataFrame containing the total number of 404 responses for HTTP requests for each hour of the day (midnight starts at 0)

# In[70]:


hourly_avg_errors_sorted_df = (not_found_df
                                   .groupBy(F.hour('time')
                                             .alias('hour'))
                                   .count()
                                   .sort('hour'))
hourly_avg_errors_sorted_pd_df = hourly_avg_errors_sorted_df.toPandas()


# In[71]:


c = sns.catplot(x='hour', y='count', 
                data=hourly_avg_errors_sorted_pd_df, 
                kind='bar', height=5, aspect=1.5)


# ### Reset the max rows displayed in pandas

# In[72]:


pd.set_option('max_rows', def_mr)