Case Study: Air Quality Dataset


This notebook uses a dataset from the UC Irvine website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset from the UC Irvine Machine Learning Dataset Repository.

You can find the data at http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. Please unzip the file at a filepath of your choice.

Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. For Macs, simply double-click on the file.

Use Pandas to Read the Dataset


In this notebook, we will be using a CSV file:

  • AirQualityUCI.csv :

The dataset contains over 9000 measurements of air quality taken over a year.

The following are the measures taken: Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH

Using the read_csv function in pandas, we will transfer this information into our code.

In [25]:
# import pandas and load data
import pandas as pd

filepath = './AirQualityUCI.csv'
airquality = pd.read_csv(filepath,sep=',',delimiter=';')

airquality = pd.read_csv('AirQualityUCI.csv',sep=',',delimiter=';')
In [26]:
# Now that we have the dataset we will start to get a feeling for its layout
airquality.head(5)
Out[26]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502 NaN NaN
3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11,0 60,0 0,7867 NaN NaN
4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11,2 59,6 0,7888 NaN NaN

Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how the last two columns are unnamed and how the C6H6(GT) and T columns have values with commas. In this case, the commas are not delimiters but are instead subsitutes for decimal points. However, the following code block shows why Python is really useful.

In [27]:
airquality_comma = pd.read_csv(filepath, delimiter=';|,',engine='python',index_col=False)
airquality_comma.head(5)
Out[27]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2.0 6.0 1360.0 150.0 11.0 9.0 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.0 6.0 48.0
1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.0 4.0 955.0 103.0 1174.0 92.0 1559.0 972.0 13.0 3.0 47.0 7.0
2 10/03/2004 20.00.00 2.0 2.0 1402.0 88.0 9.0 0.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.0 9.0 54.0
3 10/03/2004 21.00.00 2.0 2.0 1376.0 80.0 9.0 2.0 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 0.0 60.0
4 10/03/2004 22.00.00 1.0 6.0 1272.0 51.0 6.0 5.0 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.0 2.0 59.0

We've made our data a lot more usable in just one line of code. Some pointers:

  • the delimiter parameter in read_csv is capable of taking a "regex", which allows you to, in this case, insert a boolean expression for which characters are delimiters (the semicolon OR the comma)
  • the 'python' engine parameter enables us to use a regular expression delimiter
  • index_col is set to False because our first line of column names does not terminate in a delimiter.

We're not going to use airquality_comma because we know that the comma is not a delimiter, but we've figured out how to manipulate delimiters somewhat in the event that it was.

In [28]:
del (airquality_comma)

You might still be wondering how to fix the issue with the commas for decimals. It turns out it's just one more parameter in read_csv. We'll also delete the last two columns pandas has interpreted, because they contain no information.

In [29]:
airquality = pd.read_csv(filepath,sep=',',delimiter=';',decimal=',')
del (airquality['Unnamed: 15'])
airquality.pop('Unnamed: 16') #There are a couple ways to get rid of the columns
airquality.head(5)
Out[29]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 10/03/2004 20.00.00 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502
3 10/03/2004 21.00.00 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 60.0 0.7867
4 10/03/2004 22.00.00 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.2 59.6 0.7888

Data Structures

Series

Next, let's observe how pandas interprets our data as Series and DataFrame objects, and how we can leverage that to get information about the data.

In [32]:
#Extract 0th row: notice that it is in fact a Series

row_0 = airquality.iloc[0]
type(row_0)
Out[32]:
pandas.core.series.Series
In [33]:
print(row_0)
Date             10/03/2004
Time               18.00.00
CO(GT)                  2.6
PT08.S1(CO)            1360
NMHC(GT)                150
C6H6(GT)               11.9
PT08.S2(NMHC)          1046
NOx(GT)                 166
PT08.S3(NOx)           1056
NO2(GT)                 113
PT08.S4(NO2)           1692
PT08.S5(O3)            1268
T                      13.6
RH                     48.9
AH                   0.7578
Name: 0, dtype: object
In [34]:
row_0.index
Out[34]:
Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')
In [35]:
# value of 'Time' in row_0
row_0['Time']
Out[35]:
'18.00.00'
In [36]:
# is 'RH' in row_0?
'RH' in row_0
Out[36]:
True
In [37]:
# how can we reference row_0?
row_0.name
Out[37]:
0
In [38]:
# let's change the name of row_0
row_0 = row_0.rename('first_row')
row_0.name
Out[38]:
'first_row'

DataFrames

In [39]:
# Let's take a look at our dataset again
airquality.head()
Out[39]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 10/03/2004 20.00.00 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502
3 10/03/2004 21.00.00 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0 1584.0 1203.0 11.0 60.0 0.7867
4 10/03/2004 22.00.00 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0 1490.0 1110.0 11.2 59.6 0.7888
In [40]:
airquality.index
Out[40]:
RangeIndex(start=0, stop=9471, step=1)
In [41]:
airquality.columns
Out[41]:
Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')
In [42]:
# Extract row 0, 11, 2000 from DataFrame

airquality.iloc[ [0,11,2000] ]
Out[42]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
11 11/03/2004 05.00.00 0.7 1066.0 8.0 1.1 512.0 16.0 1918.0 28.0 1182.0 422.0 11.0 56.2 0.7366
2000 02/06/2004 02.00.00 -200.0 887.0 -200.0 4.9 758.0 43.0 1094.0 38.0 1622.0 804.0 19.1 69.0 1.5043

Descriptive Statistics

Pandas also provides some basic quantitative functions to understand our data.

In [43]:
airquality['NO2(GT)'].describe()
Out[43]:
count    9357.000000
mean       58.148873
std       126.940455
min      -200.000000
25%        53.000000
50%        96.000000
75%       133.000000
max       340.000000
Name: NO2(GT), dtype: float64
In [44]:
airquality.describe()
Out[44]:
CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
count 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000
mean -34.207524 1048.990061 -159.090093 1.865683 894.595276 168.616971 794.990168 58.148873 1391.479641 975.072032 9.778305 39.485380 -6.837604
std 77.657170 329.832710 139.789093 41.380206 342.333252 257.433866 321.993552 126.940455 467.210125 456.938184 43.203623 51.216145 38.976670
min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000
25% 0.600000 921.000000 -200.000000 4.000000 711.000000 50.000000 637.000000 53.000000 1185.000000 700.000000 10.900000 34.100000 0.692300
50% 1.500000 1053.000000 -200.000000 7.900000 895.000000 141.000000 794.000000 96.000000 1446.000000 942.000000 17.200000 48.600000 0.976800
75% 2.600000 1221.000000 -200.000000 13.600000 1105.000000 284.000000 960.000000 133.000000 1662.000000 1255.000000 24.100000 61.900000 1.296200
max 11.900000 2040.000000 1189.000000 63.700000 2214.000000 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 44.600000 88.700000 2.231000
In [45]:
airquality['NMHC(GT)'].mean()
Out[45]:
-159.09009297851875
In [46]:
airquality.mean()
Out[46]:
CO(GT)            -34.207524
PT08.S1(CO)      1048.990061
NMHC(GT)         -159.090093
C6H6(GT)            1.865683
PT08.S2(NMHC)     894.595276
NOx(GT)           168.616971
PT08.S3(NOx)      794.990168
NO2(GT)            58.148873
PT08.S4(NO2)     1391.479641
PT08.S5(O3)       975.072032
T                   9.778305
RH                 39.485380
AH                 -6.837604
dtype: float64
In [47]:
airquality['C6H6(GT)'].min()
Out[47]:
-200.0
In [48]:
airquality['NOx(GT)'].max()
Out[48]:
1479.0
In [49]:
airquality['NOx(GT)'].std()
Out[49]:
257.43386629107437
In [50]:
airquality['T'].mode()
Out[50]:
0   -200.0
dtype: float64
In [51]:
airquality.corr()
Out[51]:
CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.041411 0.128351 -0.031378 0.029926 0.526451 -0.089981 0.671127 -0.073724 0.080310 -0.068939 -0.048227 -0.045892
PT08.S1(CO) 0.041411 1.000000 0.170007 0.852687 0.933102 0.277993 0.087019 0.154030 0.845149 0.892434 0.754844 0.745375 0.764903
NMHC(GT) 0.128351 0.170007 1.000000 0.037323 0.110104 -0.004427 0.048821 0.103307 0.162680 0.101185 -0.000009 0.008284 0.012500
C6H6(GT) -0.031378 0.852687 0.037323 1.000000 0.767433 -0.001174 0.512193 -0.010992 0.774673 0.641334 0.971375 0.925062 0.984555
PT08.S2(NMHC) 0.029926 0.933102 0.110104 0.767433 1.000000 0.331272 -0.073667 0.176488 0.874782 0.909905 0.669025 0.585803 0.646572
NOx(GT) 0.526451 0.277993 -0.004427 -0.001174 0.331272 1.000000 -0.436084 0.817139 0.035546 0.461889 -0.138452 -0.053009 -0.095847
PT08.S3(NOx) -0.089981 0.087019 0.048821 0.512193 -0.073667 -0.436084 1.000000 -0.256232 0.122734 -0.208865 0.588111 0.573549 0.621618
NO2(GT) 0.671127 0.154030 0.103307 -0.010992 0.176488 0.817139 -0.256232 1.000000 -0.022174 0.253439 -0.084104 -0.081305 -0.060440
PT08.S4(NO2) -0.073724 0.845149 0.162680 0.774673 0.874782 0.035546 0.122734 -0.022174 1.000000 0.723690 0.755060 0.640707 0.691913
PT08.S5(O3) 0.080310 0.892434 0.101185 0.641334 0.909905 0.461889 -0.208865 0.253439 0.723690 1.000000 0.503700 0.524955 0.519467
T -0.068939 0.754844 -0.000009 0.971375 0.669025 -0.138452 0.588111 -0.084104 0.755060 0.503700 1.000000 0.885911 0.981001
RH -0.048227 0.745375 0.008284 0.925062 0.585803 -0.053009 0.573549 -0.081305 0.640707 0.524955 0.885911 1.000000 0.943995
AH -0.045892 0.764903 0.012500 0.984555 0.646572 -0.095847 0.621618 -0.060440 0.691913 0.519467 0.981001 0.943995 1.000000

We can also filter information conditionally.

In [53]:
filter_1 = airquality['CO(GT)'] > 3.0
print(filter_1)
filter_1.any()
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24       True
25       True
26       True
27       True
28      False
29      False
        ...  
9441    False
9442    False
9443    False
9444    False
9445    False
9446    False
9447    False
9448    False
9449    False
9450    False
9451    False
9452    False
9453    False
9454    False
9455    False
9456    False
9457    False
9458    False
9459    False
9460    False
9461    False
9462    False
9463    False
9464    False
9465    False
9466    False
9467    False
9468    False
9469    False
9470    False
Name: CO(GT), Length: 9471, dtype: bool
Out[53]:
True
In [54]:
filter_2 = airquality['CO(GT)'] > 3.0
filter_2.all()
Out[54]:
False

Handling Missing Data

If you looked at the numbers, you may have noticed that a lot of them were -200.0. That doesn't seem like a reasonable pollutant concentration, so perhaps the creators of the dataset intended that to be a NULL value. Assuming this is the case, let's fill all Null values to be -200.0.

In [55]:
airquality.shape
Out[55]:
(9471, 15)
In [56]:
#Check if there are Null values in each row
airquality.isnull().any()
Out[56]:
Date             True
Time             True
CO(GT)           True
PT08.S1(CO)      True
NMHC(GT)         True
C6H6(GT)         True
PT08.S2(NMHC)    True
NOx(GT)          True
PT08.S3(NOx)     True
NO2(GT)          True
PT08.S4(NO2)     True
PT08.S5(O3)      True
T                True
RH               True
AH               True
dtype: bool
In [57]:
airquality_filled = airquality.fillna(value=-200.0)
In [58]:
airquality_filled.isnull().any()
Out[58]:
Date             False
Time             False
CO(GT)           False
PT08.S1(CO)      False
NMHC(GT)         False
C6H6(GT)         False
PT08.S2(NMHC)    False
NOx(GT)          False
PT08.S3(NOx)     False
NO2(GT)          False
PT08.S4(NO2)     False
PT08.S5(O3)      False
T                False
RH               False
AH               False
dtype: bool

The NULL values are gone, and now our data is consistent. We can use rows which have some NULL columns if we want by using airquality_filled, where the NULL values are replaced with the value -200.0. However, we can get rid all samples with NULL columns in the original DataFrame if we want to use only complete observations.

In [59]:
airquality = airquality.dropna()
In [60]:
airquality.isnull().any()
Out[60]:
Date             False
Time             False
CO(GT)           False
PT08.S1(CO)      False
NMHC(GT)         False
C6H6(GT)         False
PT08.S2(NMHC)    False
NOx(GT)          False
PT08.S3(NOx)     False
NO2(GT)          False
PT08.S4(NO2)     False
PT08.S5(O3)      False
T                False
RH               False
AH               False
dtype: bool
In [61]:
airquality.shape
Out[61]:
(9357, 15)

Thats nice! No NULL rows! Any observation with a NULL value has been removed. Because of this, the number of rows as been reduced.