#!/usr/bin/env python # coding: utf-8 #

# Case Study: Air Quality Dataset

#
This notebook uses a dataset from the UC Irvine website. We will describe the dataset further as we explore with it using *pandas*. # # ## Download the Dataset # # Please note that **you will need to download the dataset** from the UC Irvine Machine Learning Dataset Repository. # # You can find the data at http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. Please unzip the file at a filepath of your choice. # # Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. # For Macs, simply double-click on the file. # # # # # #

# Use Pandas to Read the Dataset
#

#
# In this notebook, we will be using a CSV file: # * **AirQualityUCI.csv :** # # The dataset contains over 9000 measurements of air quality taken over a year. # # The following are the measures taken: # Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH # # # # Using the *read_csv* function in pandas, we will transfer this information into our code. # In[25]: # import pandas and load data import pandas as pd filepath = './AirQualityUCI.csv' airquality = pd.read_csv(filepath,sep=',',delimiter=';') airquality = pd.read_csv('AirQualityUCI.csv',sep=',',delimiter=';') # In[26]: # Now that we have the dataset we will start to get a feeling for its layout airquality.head(5) # Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how the last two columns are unnamed and how the C6H6(GT) and T columns have values with commas. In this case, the commas are not delimiters but are instead subsitutes for decimal points. However, the following code block shows why Python is really useful. # In[27]: airquality_comma = pd.read_csv(filepath, delimiter=';|,',engine='python',index_col=False) airquality_comma.head(5) # We've made our data a lot more usable in just one line of code. Some pointers: # - the delimiter parameter in read_csv is capable of taking a "regex", which allows you to, in this case, insert a boolean expression for which characters are delimiters (the semicolon OR the comma) # - the 'python' engine parameter enables us to use a **regular expression** delimiter # - index_col is set to False because our first line of column names does not terminate in a delimiter. # # We're not going to use airquality_comma because we know that the comma is not a delimiter, but we've figured out how to manipulate delimiters somewhat in the event that it was. # In[28]: del (airquality_comma) # You might still be wondering how to fix the issue with the commas for decimals. It turns out it's just one more parameter in read_csv. We'll also delete the last two columns pandas has interpreted, because they contain no information. # In[29]: airquality = pd.read_csv(filepath,sep=',',delimiter=';',decimal=',') del (airquality['Unnamed: 15']) airquality.pop('Unnamed: 16') #There are a couple ways to get rid of the columns airquality.head(5) #

Data Structures

Series

# # Next, let's observe how pandas interprets our data as Series and DataFrame objects, and how we can leverage that to get information about the data. # In[32]: #Extract 0th row: notice that it is in fact a Series row_0 = airquality.iloc[0] type(row_0) # In[33]: print(row_0) # In[34]: row_0.index # In[35]: # value of 'Time' in row_0 row_0['Time'] # In[36]: # is 'RH' in row_0? 'RH' in row_0 # In[37]: # how can we reference row_0? row_0.name # In[38]: # let's change the name of row_0 row_0 = row_0.rename('first_row') row_0.name #

DataFrames

# In[39]: # Let's take a look at our dataset again airquality.head() # In[40]: airquality.index # In[41]: airquality.columns # In[42]: # Extract row 0, 11, 2000 from DataFrame airquality.iloc[ [0,11,2000] ] #

Descriptive Statistics

# # Pandas also provides some basic quantitative functions to understand our data. # In[43]: airquality['NO2(GT)'].describe() # In[44]: airquality.describe() # In[45]: airquality['NMHC(GT)'].mean() # In[46]: airquality.mean() # In[47]: airquality['C6H6(GT)'].min() # In[48]: airquality['NOx(GT)'].max() # In[49]: airquality['NOx(GT)'].std() # In[50]: airquality['T'].mode() # In[51]: airquality.corr() # We can also filter information conditionally. # In[53]: filter_1 = airquality['CO(GT)'] > 3.0 print(filter_1) filter_1.any() # In[54]: filter_2 = airquality['CO(GT)'] > 3.0 filter_2.all() #

Handling Missing Data

# # If you looked at the numbers, you may have noticed that a lot of them were -200.0. That doesn't seem like a reasonable pollutant concentration, so perhaps the creators of the dataset intended that to be a NULL value. Assuming this is the case, let's fill all Null values to be -200.0. # In[55]: airquality.shape # In[56]: #Check if there are Null values in each row airquality.isnull().any() # In[57]: airquality_filled = airquality.fillna(value=-200.0) # In[58]: airquality_filled.isnull().any() # The NULL values are gone, and now our data is consistent. We can use rows which have some NULL columns if we want by using airquality_filled, where the NULL values are replaced with the value -200.0. However, we can get rid all samples with NULL columns in the original DataFrame if we want to use only complete observations. # In[59]: airquality = airquality.dropna() # In[60]: airquality.isnull().any() # In[61]: airquality.shape # Thats nice! No NULL rows! Any observation with a NULL value has been removed. Because of this, the number of rows as been reduced.