Case Study: Air Quality Dataset
Please note that you will need to download the dataset from the UC Irvine Machine Learning Dataset Repository.
You can find the data at http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. Please unzip the file at a filepath of your choice.
Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. For Macs, simply double-click on the file.
Use Pandas to Read the Dataset
The dataset contains over 9000 measurements of air quality taken over a year.
The following are the measures taken: Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
Using the read_csv function in pandas, we will transfer this information into our code.
# import pandas and load data
import pandas as pd
filepath = './AirQualityUCI.csv'
airquality = pd.read_csv(filepath,sep=',',delimiter=';')
airquality = pd.read_csv('AirQualityUCI.csv',sep=',',delimiter=';')
# Now that we have the dataset we will start to get a feeling for its layout
airquality.head(5)
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | Unnamed: 15 | Unnamed: 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10/03/2004 | 18.00.00 | 2,6 | 1360.0 | 150.0 | 11,9 | 1046.0 | 166.0 | 1056.0 | 113.0 | 1692.0 | 1268.0 | 13,6 | 48,9 | 0,7578 | NaN | NaN |
1 | 10/03/2004 | 19.00.00 | 2 | 1292.0 | 112.0 | 9,4 | 955.0 | 103.0 | 1174.0 | 92.0 | 1559.0 | 972.0 | 13,3 | 47,7 | 0,7255 | NaN | NaN |
2 | 10/03/2004 | 20.00.00 | 2,2 | 1402.0 | 88.0 | 9,0 | 939.0 | 131.0 | 1140.0 | 114.0 | 1555.0 | 1074.0 | 11,9 | 54,0 | 0,7502 | NaN | NaN |
3 | 10/03/2004 | 21.00.00 | 2,2 | 1376.0 | 80.0 | 9,2 | 948.0 | 172.0 | 1092.0 | 122.0 | 1584.0 | 1203.0 | 11,0 | 60,0 | 0,7867 | NaN | NaN |
4 | 10/03/2004 | 22.00.00 | 1,6 | 1272.0 | 51.0 | 6,5 | 836.0 | 131.0 | 1205.0 | 116.0 | 1490.0 | 1110.0 | 11,2 | 59,6 | 0,7888 | NaN | NaN |
Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how the last two columns are unnamed and how the C6H6(GT) and T columns have values with commas. In this case, the commas are not delimiters but are instead subsitutes for decimal points. However, the following code block shows why Python is really useful.
airquality_comma = pd.read_csv(filepath, delimiter=';|,',engine='python',index_col=False)
airquality_comma.head(5)
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | Unnamed: 15 | Unnamed: 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10/03/2004 | 18.00.00 | 2.0 | 6.0 | 1360.0 | 150.0 | 11.0 | 9.0 | 1046.0 | 166.0 | 1056.0 | 113.0 | 1692.0 | 1268.0 | 13.0 | 6.0 | 48.0 |
1 | 10/03/2004 | 19.00.00 | 2.0 | 1292.0 | 112.0 | 9.0 | 4.0 | 955.0 | 103.0 | 1174.0 | 92.0 | 1559.0 | 972.0 | 13.0 | 3.0 | 47.0 | 7.0 |
2 | 10/03/2004 | 20.00.00 | 2.0 | 2.0 | 1402.0 | 88.0 | 9.0 | 0.0 | 939.0 | 131.0 | 1140.0 | 114.0 | 1555.0 | 1074.0 | 11.0 | 9.0 | 54.0 |
3 | 10/03/2004 | 21.00.00 | 2.0 | 2.0 | 1376.0 | 80.0 | 9.0 | 2.0 | 948.0 | 172.0 | 1092.0 | 122.0 | 1584.0 | 1203.0 | 11.0 | 0.0 | 60.0 |
4 | 10/03/2004 | 22.00.00 | 1.0 | 6.0 | 1272.0 | 51.0 | 6.0 | 5.0 | 836.0 | 131.0 | 1205.0 | 116.0 | 1490.0 | 1110.0 | 11.0 | 2.0 | 59.0 |
We've made our data a lot more usable in just one line of code. Some pointers:
We're not going to use airquality_comma because we know that the comma is not a delimiter, but we've figured out how to manipulate delimiters somewhat in the event that it was.
del (airquality_comma)
You might still be wondering how to fix the issue with the commas for decimals. It turns out it's just one more parameter in read_csv. We'll also delete the last two columns pandas has interpreted, because they contain no information.
airquality = pd.read_csv(filepath,sep=',',delimiter=';',decimal=',')
del (airquality['Unnamed: 15'])
airquality.pop('Unnamed: 16') #There are a couple ways to get rid of the columns
airquality.head(5)
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10/03/2004 | 18.00.00 | 2.6 | 1360.0 | 150.0 | 11.9 | 1046.0 | 166.0 | 1056.0 | 113.0 | 1692.0 | 1268.0 | 13.6 | 48.9 | 0.7578 |
1 | 10/03/2004 | 19.00.00 | 2.0 | 1292.0 | 112.0 | 9.4 | 955.0 | 103.0 | 1174.0 | 92.0 | 1559.0 | 972.0 | 13.3 | 47.7 | 0.7255 |
2 | 10/03/2004 | 20.00.00 | 2.2 | 1402.0 | 88.0 | 9.0 | 939.0 | 131.0 | 1140.0 | 114.0 | 1555.0 | 1074.0 | 11.9 | 54.0 | 0.7502 |
3 | 10/03/2004 | 21.00.00 | 2.2 | 1376.0 | 80.0 | 9.2 | 948.0 | 172.0 | 1092.0 | 122.0 | 1584.0 | 1203.0 | 11.0 | 60.0 | 0.7867 |
4 | 10/03/2004 | 22.00.00 | 1.6 | 1272.0 | 51.0 | 6.5 | 836.0 | 131.0 | 1205.0 | 116.0 | 1490.0 | 1110.0 | 11.2 | 59.6 | 0.7888 |
Next, let's observe how pandas interprets our data as Series and DataFrame objects, and how we can leverage that to get information about the data.
#Extract 0th row: notice that it is in fact a Series
row_0 = airquality.iloc[0]
type(row_0)
pandas.core.series.Series
print(row_0)
Date 10/03/2004 Time 18.00.00 CO(GT) 2.6 PT08.S1(CO) 1360 NMHC(GT) 150 C6H6(GT) 11.9 PT08.S2(NMHC) 1046 NOx(GT) 166 PT08.S3(NOx) 1056 NO2(GT) 113 PT08.S4(NO2) 1692 PT08.S5(O3) 1268 T 13.6 RH 48.9 AH 0.7578 Name: 0, dtype: object
row_0.index
Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH'], dtype='object')
# value of 'Time' in row_0
row_0['Time']
'18.00.00'
# is 'RH' in row_0?
'RH' in row_0
True
# how can we reference row_0?
row_0.name
0
# let's change the name of row_0
row_0 = row_0.rename('first_row')
row_0.name
'first_row'
# Let's take a look at our dataset again
airquality.head()
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10/03/2004 | 18.00.00 | 2.6 | 1360.0 | 150.0 | 11.9 | 1046.0 | 166.0 | 1056.0 | 113.0 | 1692.0 | 1268.0 | 13.6 | 48.9 | 0.7578 |
1 | 10/03/2004 | 19.00.00 | 2.0 | 1292.0 | 112.0 | 9.4 | 955.0 | 103.0 | 1174.0 | 92.0 | 1559.0 | 972.0 | 13.3 | 47.7 | 0.7255 |
2 | 10/03/2004 | 20.00.00 | 2.2 | 1402.0 | 88.0 | 9.0 | 939.0 | 131.0 | 1140.0 | 114.0 | 1555.0 | 1074.0 | 11.9 | 54.0 | 0.7502 |
3 | 10/03/2004 | 21.00.00 | 2.2 | 1376.0 | 80.0 | 9.2 | 948.0 | 172.0 | 1092.0 | 122.0 | 1584.0 | 1203.0 | 11.0 | 60.0 | 0.7867 |
4 | 10/03/2004 | 22.00.00 | 1.6 | 1272.0 | 51.0 | 6.5 | 836.0 | 131.0 | 1205.0 | 116.0 | 1490.0 | 1110.0 | 11.2 | 59.6 | 0.7888 |
airquality.index
RangeIndex(start=0, stop=9471, step=1)
airquality.columns
Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH'], dtype='object')
# Extract row 0, 11, 2000 from DataFrame
airquality.iloc[ [0,11,2000] ]
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10/03/2004 | 18.00.00 | 2.6 | 1360.0 | 150.0 | 11.9 | 1046.0 | 166.0 | 1056.0 | 113.0 | 1692.0 | 1268.0 | 13.6 | 48.9 | 0.7578 |
11 | 11/03/2004 | 05.00.00 | 0.7 | 1066.0 | 8.0 | 1.1 | 512.0 | 16.0 | 1918.0 | 28.0 | 1182.0 | 422.0 | 11.0 | 56.2 | 0.7366 |
2000 | 02/06/2004 | 02.00.00 | -200.0 | 887.0 | -200.0 | 4.9 | 758.0 | 43.0 | 1094.0 | 38.0 | 1622.0 | 804.0 | 19.1 | 69.0 | 1.5043 |
Pandas also provides some basic quantitative functions to understand our data.
airquality['NO2(GT)'].describe()
count 9357.000000 mean 58.148873 std 126.940455 min -200.000000 25% 53.000000 50% 96.000000 75% 133.000000 max 340.000000 Name: NO2(GT), dtype: float64
airquality.describe()
CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 | 9357.000000 |
mean | -34.207524 | 1048.990061 | -159.090093 | 1.865683 | 894.595276 | 168.616971 | 794.990168 | 58.148873 | 1391.479641 | 975.072032 | 9.778305 | 39.485380 | -6.837604 |
std | 77.657170 | 329.832710 | 139.789093 | 41.380206 | 342.333252 | 257.433866 | 321.993552 | 126.940455 | 467.210125 | 456.938184 | 43.203623 | 51.216145 | 38.976670 |
min | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 | -200.000000 |
25% | 0.600000 | 921.000000 | -200.000000 | 4.000000 | 711.000000 | 50.000000 | 637.000000 | 53.000000 | 1185.000000 | 700.000000 | 10.900000 | 34.100000 | 0.692300 |
50% | 1.500000 | 1053.000000 | -200.000000 | 7.900000 | 895.000000 | 141.000000 | 794.000000 | 96.000000 | 1446.000000 | 942.000000 | 17.200000 | 48.600000 | 0.976800 |
75% | 2.600000 | 1221.000000 | -200.000000 | 13.600000 | 1105.000000 | 284.000000 | 960.000000 | 133.000000 | 1662.000000 | 1255.000000 | 24.100000 | 61.900000 | 1.296200 |
max | 11.900000 | 2040.000000 | 1189.000000 | 63.700000 | 2214.000000 | 1479.000000 | 2683.000000 | 340.000000 | 2775.000000 | 2523.000000 | 44.600000 | 88.700000 | 2.231000 |
airquality['NMHC(GT)'].mean()
-159.09009297851875
airquality.mean()
CO(GT) -34.207524 PT08.S1(CO) 1048.990061 NMHC(GT) -159.090093 C6H6(GT) 1.865683 PT08.S2(NMHC) 894.595276 NOx(GT) 168.616971 PT08.S3(NOx) 794.990168 NO2(GT) 58.148873 PT08.S4(NO2) 1391.479641 PT08.S5(O3) 975.072032 T 9.778305 RH 39.485380 AH -6.837604 dtype: float64
airquality['C6H6(GT)'].min()
-200.0
airquality['NOx(GT)'].max()
1479.0
airquality['NOx(GT)'].std()
257.43386629107437
airquality['T'].mode()
0 -200.0 dtype: float64
airquality.corr()
CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CO(GT) | 1.000000 | 0.041411 | 0.128351 | -0.031378 | 0.029926 | 0.526451 | -0.089981 | 0.671127 | -0.073724 | 0.080310 | -0.068939 | -0.048227 | -0.045892 |
PT08.S1(CO) | 0.041411 | 1.000000 | 0.170007 | 0.852687 | 0.933102 | 0.277993 | 0.087019 | 0.154030 | 0.845149 | 0.892434 | 0.754844 | 0.745375 | 0.764903 |
NMHC(GT) | 0.128351 | 0.170007 | 1.000000 | 0.037323 | 0.110104 | -0.004427 | 0.048821 | 0.103307 | 0.162680 | 0.101185 | -0.000009 | 0.008284 | 0.012500 |
C6H6(GT) | -0.031378 | 0.852687 | 0.037323 | 1.000000 | 0.767433 | -0.001174 | 0.512193 | -0.010992 | 0.774673 | 0.641334 | 0.971375 | 0.925062 | 0.984555 |
PT08.S2(NMHC) | 0.029926 | 0.933102 | 0.110104 | 0.767433 | 1.000000 | 0.331272 | -0.073667 | 0.176488 | 0.874782 | 0.909905 | 0.669025 | 0.585803 | 0.646572 |
NOx(GT) | 0.526451 | 0.277993 | -0.004427 | -0.001174 | 0.331272 | 1.000000 | -0.436084 | 0.817139 | 0.035546 | 0.461889 | -0.138452 | -0.053009 | -0.095847 |
PT08.S3(NOx) | -0.089981 | 0.087019 | 0.048821 | 0.512193 | -0.073667 | -0.436084 | 1.000000 | -0.256232 | 0.122734 | -0.208865 | 0.588111 | 0.573549 | 0.621618 |
NO2(GT) | 0.671127 | 0.154030 | 0.103307 | -0.010992 | 0.176488 | 0.817139 | -0.256232 | 1.000000 | -0.022174 | 0.253439 | -0.084104 | -0.081305 | -0.060440 |
PT08.S4(NO2) | -0.073724 | 0.845149 | 0.162680 | 0.774673 | 0.874782 | 0.035546 | 0.122734 | -0.022174 | 1.000000 | 0.723690 | 0.755060 | 0.640707 | 0.691913 |
PT08.S5(O3) | 0.080310 | 0.892434 | 0.101185 | 0.641334 | 0.909905 | 0.461889 | -0.208865 | 0.253439 | 0.723690 | 1.000000 | 0.503700 | 0.524955 | 0.519467 |
T | -0.068939 | 0.754844 | -0.000009 | 0.971375 | 0.669025 | -0.138452 | 0.588111 | -0.084104 | 0.755060 | 0.503700 | 1.000000 | 0.885911 | 0.981001 |
RH | -0.048227 | 0.745375 | 0.008284 | 0.925062 | 0.585803 | -0.053009 | 0.573549 | -0.081305 | 0.640707 | 0.524955 | 0.885911 | 1.000000 | 0.943995 |
AH | -0.045892 | 0.764903 | 0.012500 | 0.984555 | 0.646572 | -0.095847 | 0.621618 | -0.060440 | 0.691913 | 0.519467 | 0.981001 | 0.943995 | 1.000000 |
We can also filter information conditionally.
filter_1 = airquality['CO(GT)'] > 3.0
print(filter_1)
filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 True 25 True 26 True 27 True 28 False 29 False ... 9441 False 9442 False 9443 False 9444 False 9445 False 9446 False 9447 False 9448 False 9449 False 9450 False 9451 False 9452 False 9453 False 9454 False 9455 False 9456 False 9457 False 9458 False 9459 False 9460 False 9461 False 9462 False 9463 False 9464 False 9465 False 9466 False 9467 False 9468 False 9469 False 9470 False Name: CO(GT), Length: 9471, dtype: bool
True
filter_2 = airquality['CO(GT)'] > 3.0
filter_2.all()
False
If you looked at the numbers, you may have noticed that a lot of them were -200.0. That doesn't seem like a reasonable pollutant concentration, so perhaps the creators of the dataset intended that to be a NULL value. Assuming this is the case, let's fill all Null values to be -200.0.
airquality.shape
(9471, 15)
#Check if there are Null values in each row
airquality.isnull().any()
Date True Time True CO(GT) True PT08.S1(CO) True NMHC(GT) True C6H6(GT) True PT08.S2(NMHC) True NOx(GT) True PT08.S3(NOx) True NO2(GT) True PT08.S4(NO2) True PT08.S5(O3) True T True RH True AH True dtype: bool
airquality_filled = airquality.fillna(value=-200.0)
airquality_filled.isnull().any()
Date False Time False CO(GT) False PT08.S1(CO) False NMHC(GT) False C6H6(GT) False PT08.S2(NMHC) False NOx(GT) False PT08.S3(NOx) False NO2(GT) False PT08.S4(NO2) False PT08.S5(O3) False T False RH False AH False dtype: bool
The NULL values are gone, and now our data is consistent. We can use rows which have some NULL columns if we want by using airquality_filled, where the NULL values are replaced with the value -200.0. However, we can get rid all samples with NULL columns in the original DataFrame if we want to use only complete observations.
airquality = airquality.dropna()
airquality.isnull().any()
Date False Time False CO(GT) False PT08.S1(CO) False NMHC(GT) False C6H6(GT) False PT08.S2(NMHC) False NOx(GT) False PT08.S3(NOx) False NO2(GT) False PT08.S4(NO2) False PT08.S5(O3) False T False RH False AH False dtype: bool
airquality.shape
(9357, 15)
Thats nice! No NULL rows! Any observation with a NULL value has been removed. Because of this, the number of rows as been reduced.