Case Study: Air Quality Dataset

This notebook uses a dataset from the UC Irvine website. We will describe the dataset further as we explore with it using *pandas*.

Download the Dataset¶

Please note that you will need to download the dataset from the UC Irvine Machine Learning Dataset Repository.

You can find the data at http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. Please unzip the file at a filepath of your choice.

Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. For Macs, simply double-click on the file.

Use Pandas to Read the Dataset

In this notebook, we will be using a CSV file: * **AirQualityUCI.csv :**

The dataset contains over 9000 measurements of air quality taken over a year.

The following are the measures taken: Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH

Using the read_csv function in pandas, we will transfer this information into our code.

In [25]:

# import pandas and load data
import pandas as pd

filepath = './AirQualityUCI.csv'
airquality = pd.read_csv(filepath,sep=',',delimiter=';')

airquality = pd.read_csv('AirQualityUCI.csv',sep=',',delimiter=';')

In [26]:

# Now that we have the dataset we will start to get a feeling for its layout
airquality.head(5)

Out[26]:

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	Unnamed: 15	Unnamed: 16
0	10/03/2004	18.00.00	2,6	1360.0	150.0	11,9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13,6	48,9	0,7578	NaN	NaN
1	10/03/2004	19.00.00	2	1292.0	112.0	9,4	955.0	103.0	1174.0	92.0	1559.0	972.0	13,3	47,7	0,7255	NaN	NaN
2	10/03/2004	20.00.00	2,2	1402.0	88.0	9,0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11,9	54,0	0,7502	NaN	NaN
3	10/03/2004	21.00.00	2,2	1376.0	80.0	9,2	948.0	172.0	1092.0	122.0	1584.0	1203.0	11,0	60,0	0,7867	NaN	NaN
4	10/03/2004	22.00.00	1,6	1272.0	51.0	6,5	836.0	131.0	1205.0	116.0	1490.0	1110.0	11,2	59,6	0,7888	NaN	NaN

Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how the last two columns are unnamed and how the C6H6(GT) and T columns have values with commas. In this case, the commas are not delimiters but are instead subsitutes for decimal points. However, the following code block shows why Python is really useful.

In [27]:

airquality_comma = pd.read_csv(filepath, delimiter=';|,',engine='python',index_col=False)
airquality_comma.head(5)

Out[27]:

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	Unnamed: 15	Unnamed: 16
0	10/03/2004	18.00.00	2.0	6.0	1360.0	150.0	11.0	9.0	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.0	6.0	48.0
1	10/03/2004	19.00.00	2.0	1292.0	112.0	9.0	4.0	955.0	103.0	1174.0	92.0	1559.0	972.0	13.0	3.0	47.0	7.0
2	10/03/2004	20.00.00	2.0	2.0	1402.0	88.0	9.0	0.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.0	9.0	54.0
3	10/03/2004	21.00.00	2.0	2.0	1376.0	80.0	9.0	2.0	948.0	172.0	1092.0	122.0	1584.0	1203.0	11.0	0.0	60.0
4	10/03/2004	22.00.00	1.0	6.0	1272.0	51.0	6.0	5.0	836.0	131.0	1205.0	116.0	1490.0	1110.0	11.0	2.0	59.0

We've made our data a lot more usable in just one line of code. Some pointers:

the delimiter parameter in read_csv is capable of taking a "regex", which allows you to, in this case, insert a boolean expression for which characters are delimiters (the semicolon OR the comma)
the 'python' engine parameter enables us to use a regular expression delimiter
index_col is set to False because our first line of column names does not terminate in a delimiter.

We're not going to use airquality_comma because we know that the comma is not a delimiter, but we've figured out how to manipulate delimiters somewhat in the event that it was.

In [28]:

del (airquality_comma)

You might still be wondering how to fix the issue with the commas for decimals. It turns out it's just one more parameter in read_csv. We'll also delete the last two columns pandas has interpreted, because they contain no information.

In [29]:

airquality = pd.read_csv(filepath,sep=',',delimiter=';',decimal=',')
del (airquality['Unnamed: 15'])
airquality.pop('Unnamed: 16') #There are a couple ways to get rid of the columns
airquality.head(5)

Out[29]:

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	10/03/2004	18.00.00	2.6	1360.0	150.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	10/03/2004	19.00.00	2.0	1292.0	112.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	10/03/2004	20.00.00	2.2	1402.0	88.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502
3	10/03/2004	21.00.00	2.2	1376.0	80.0	9.2	948.0	172.0	1092.0	122.0	1584.0	1203.0	11.0	60.0	0.7867
4	10/03/2004	22.00.00	1.6	1272.0	51.0	6.5	836.0	131.0	1205.0	116.0	1490.0	1110.0	11.2	59.6	0.7888

Data Structures

Series

Next, let's observe how pandas interprets our data as Series and DataFrame objects, and how we can leverage that to get information about the data.

In [32]:

#Extract 0th row: notice that it is in fact a Series

row_0 = airquality.iloc[0]
type(row_0)

Out[32]:

pandas.core.series.Series

In [33]:

print(row_0)

Date             10/03/2004
Time               18.00.00
CO(GT)                  2.6
PT08.S1(CO)            1360
NMHC(GT)                150
C6H6(GT)               11.9
PT08.S2(NMHC)          1046
NOx(GT)                 166
PT08.S3(NOx)           1056
NO2(GT)                 113
PT08.S4(NO2)           1692
PT08.S5(O3)            1268
T                      13.6
RH                     48.9
AH                   0.7578
Name: 0, dtype: object

In [34]:

row_0.index

Out[34]:

Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')

In [35]:

# value of 'Time' in row_0
row_0['Time']

Out[35]:

'18.00.00'

In [36]:

# is 'RH' in row_0?
'RH' in row_0

Out[36]:

True

In [37]:

# how can we reference row_0?
row_0.name

Out[37]:

In [38]:

# let's change the name of row_0
row_0 = row_0.rename('first_row')
row_0.name

Out[38]:

'first_row'

DataFrames

In [39]:

# Let's take a look at our dataset again
airquality.head()

Out[39]:

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	10/03/2004	18.00.00	2.6	1360.0	150.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	10/03/2004	19.00.00	2.0	1292.0	112.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	10/03/2004	20.00.00	2.2	1402.0	88.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502
3	10/03/2004	21.00.00	2.2	1376.0	80.0	9.2	948.0	172.0	1092.0	122.0	1584.0	1203.0	11.0	60.0	0.7867
4	10/03/2004	22.00.00	1.6	1272.0	51.0	6.5	836.0	131.0	1205.0	116.0	1490.0	1110.0	11.2	59.6	0.7888

In [40]:

airquality.index

Out[40]:

RangeIndex(start=0, stop=9471, step=1)

In [41]:

airquality.columns

Out[41]:

Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')

In [42]:

# Extract row 0, 11, 2000 from DataFrame

airquality.iloc[ [0,11,2000] ]

Out[42]:

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	10/03/2004	18.00.00	2.6	1360.0	150.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
11	11/03/2004	05.00.00	0.7	1066.0	8.0	1.1	512.0	16.0	1918.0	28.0	1182.0	422.0	11.0	56.2	0.7366
2000	02/06/2004	02.00.00	-200.0	887.0	-200.0	4.9	758.0	43.0	1094.0	38.0	1622.0	804.0	19.1	69.0	1.5043

Descriptive Statistics

Pandas also provides some basic quantitative functions to understand our data.

In [43]:

airquality['NO2(GT)'].describe()

Out[43]:

count    9357.000000
mean       58.148873
std       126.940455
min      -200.000000
25%        53.000000
50%        96.000000
75%       133.000000
max       340.000000
Name: NO2(GT), dtype: float64

In [44]:

airquality.describe()

Out[44]:

	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
count	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000	9357.000000
mean	-34.207524	1048.990061	-159.090093	1.865683	894.595276	168.616971	794.990168	58.148873	1391.479641	975.072032	9.778305	39.485380	-6.837604
std	77.657170	329.832710	139.789093	41.380206	342.333252	257.433866	321.993552	126.940455	467.210125	456.938184	43.203623	51.216145	38.976670
min	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000
25%	0.600000	921.000000	-200.000000	4.000000	711.000000	50.000000	637.000000	53.000000	1185.000000	700.000000	10.900000	34.100000	0.692300
50%	1.500000	1053.000000	-200.000000	7.900000	895.000000	141.000000	794.000000	96.000000	1446.000000	942.000000	17.200000	48.600000	0.976800
75%	2.600000	1221.000000	-200.000000	13.600000	1105.000000	284.000000	960.000000	133.000000	1662.000000	1255.000000	24.100000	61.900000	1.296200
max	11.900000	2040.000000	1189.000000	63.700000	2214.000000	1479.000000	2683.000000	340.000000	2775.000000	2523.000000	44.600000	88.700000	2.231000

In [45]:

airquality['NMHC(GT)'].mean()

Out[45]:

-159.09009297851875

In [46]:

airquality.mean()

Out[46]:

CO(GT)            -34.207524
PT08.S1(CO)      1048.990061
NMHC(GT)         -159.090093
C6H6(GT)            1.865683
PT08.S2(NMHC)     894.595276
NOx(GT)           168.616971
PT08.S3(NOx)      794.990168
NO2(GT)            58.148873
PT08.S4(NO2)     1391.479641
PT08.S5(O3)       975.072032
T                   9.778305
RH                 39.485380
AH                 -6.837604
dtype: float64

In [47]:

airquality['C6H6(GT)'].min()

Out[47]:

-200.0

In [48]:

airquality['NOx(GT)'].max()

Out[48]:

1479.0

In [49]:

airquality['NOx(GT)'].std()

Out[49]:

257.43386629107437

In [50]:

airquality['T'].mode()

Out[50]:

0   -200.0
dtype: float64

In [51]:

airquality.corr()

Out[51]:

	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
CO(GT)	1.000000	0.041411	0.128351	-0.031378	0.029926	0.526451	-0.089981	0.671127	-0.073724	0.080310	-0.068939	-0.048227	-0.045892
PT08.S1(CO)	0.041411	1.000000	0.170007	0.852687	0.933102	0.277993	0.087019	0.154030	0.845149	0.892434	0.754844	0.745375	0.764903
NMHC(GT)	0.128351	0.170007	1.000000	0.037323	0.110104	-0.004427	0.048821	0.103307	0.162680	0.101185	-0.000009	0.008284	0.012500
C6H6(GT)	-0.031378	0.852687	0.037323	1.000000	0.767433	-0.001174	0.512193	-0.010992	0.774673	0.641334	0.971375	0.925062	0.984555
PT08.S2(NMHC)	0.029926	0.933102	0.110104	0.767433	1.000000	0.331272	-0.073667	0.176488	0.874782	0.909905	0.669025	0.585803	0.646572
NOx(GT)	0.526451	0.277993	-0.004427	-0.001174	0.331272	1.000000	-0.436084	0.817139	0.035546	0.461889	-0.138452	-0.053009	-0.095847
PT08.S3(NOx)	-0.089981	0.087019	0.048821	0.512193	-0.073667	-0.436084	1.000000	-0.256232	0.122734	-0.208865	0.588111	0.573549	0.621618
NO2(GT)	0.671127	0.154030	0.103307	-0.010992	0.176488	0.817139	-0.256232	1.000000	-0.022174	0.253439	-0.084104	-0.081305	-0.060440
PT08.S4(NO2)	-0.073724	0.845149	0.162680	0.774673	0.874782	0.035546	0.122734	-0.022174	1.000000	0.723690	0.755060	0.640707	0.691913
PT08.S5(O3)	0.080310	0.892434	0.101185	0.641334	0.909905	0.461889	-0.208865	0.253439	0.723690	1.000000	0.503700	0.524955	0.519467
T	-0.068939	0.754844	-0.000009	0.971375	0.669025	-0.138452	0.588111	-0.084104	0.755060	0.503700	1.000000	0.885911	0.981001
RH	-0.048227	0.745375	0.008284	0.925062	0.585803	-0.053009	0.573549	-0.081305	0.640707	0.524955	0.885911	1.000000	0.943995
AH	-0.045892	0.764903	0.012500	0.984555	0.646572	-0.095847	0.621618	-0.060440	0.691913	0.519467	0.981001	0.943995	1.000000

We can also filter information conditionally.

In [53]:

filter_1 = airquality['CO(GT)'] > 3.0
print(filter_1)
filter_1.any()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24       True
25       True
26       True
27       True
28      False
29      False
        ...  
9441    False
9442    False
9443    False
9444    False
9445    False
9446    False
9447    False
9448    False
9449    False
9450    False
9451    False
9452    False
9453    False
9454    False
9455    False
9456    False
9457    False
9458    False
9459    False
9460    False
9461    False
9462    False
9463    False
9464    False
9465    False
9466    False
9467    False
9468    False
9469    False
9470    False
Name: CO(GT), Length: 9471, dtype: bool

Out[53]:

True

In [54]:

filter_2 = airquality['CO(GT)'] > 3.0
filter_2.all()

Out[54]:

False

Handling Missing Data

If you looked at the numbers, you may have noticed that a lot of them were -200.0. That doesn't seem like a reasonable pollutant concentration, so perhaps the creators of the dataset intended that to be a NULL value. Assuming this is the case, let's fill all Null values to be -200.0.

In [55]:

airquality.shape

Out[55]:

(9471, 15)

In [56]:

#Check if there are Null values in each row
airquality.isnull().any()

Out[56]:

Date             True
Time             True
CO(GT)           True
PT08.S1(CO)      True
NMHC(GT)         True
C6H6(GT)         True
PT08.S2(NMHC)    True
NOx(GT)          True
PT08.S3(NOx)     True
NO2(GT)          True
PT08.S4(NO2)     True
PT08.S5(O3)      True
T                True
RH               True
AH               True
dtype: bool

In [57]:

airquality_filled = airquality.fillna(value=-200.0)

In [58]:

airquality_filled.isnull().any()

Out[58]:

Date             False
Time             False
CO(GT)           False
PT08.S1(CO)      False
NMHC(GT)         False
C6H6(GT)         False
PT08.S2(NMHC)    False
NOx(GT)          False
PT08.S3(NOx)     False
NO2(GT)          False
PT08.S4(NO2)     False
PT08.S5(O3)      False
T                False
RH               False
AH               False
dtype: bool

The NULL values are gone, and now our data is consistent. We can use rows which have some NULL columns if we want by using airquality_filled, where the NULL values are replaced with the value -200.0. However, we can get rid all samples with NULL columns in the original DataFrame if we want to use only complete observations.

In [59]:

airquality = airquality.dropna()

In [60]:

airquality.isnull().any()

Out[60]:

Date             False
Time             False
CO(GT)           False
PT08.S1(CO)      False
NMHC(GT)         False
C6H6(GT)         False
PT08.S2(NMHC)    False
NOx(GT)          False
PT08.S3(NOx)     False
NO2(GT)          False
PT08.S4(NO2)     False
PT08.S5(O3)      False
T                False
RH               False
AH               False
dtype: bool

In [61]:

airquality.shape

Out[61]:

(9357, 15)

Thats nice! No NULL rows! Any observation with a NULL value has been removed. Because of this, the number of rows as been reduced.