Exercise notebook 2: Cleaning up our act¶

This Jupyter notebook, for the second part of The Open University's Learn to code for Data Analysis course, has code examples and coding activities for you. Remember to run the code in this notebook before you start.

You'll come across steps in the course directing you to this notebook. Once you've done each exercise, go back to the corresponding step and mark it as complete.

In [1]:

import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *

Exercise 1: Dataframes and CSV files¶

To read a CSV file into a dataframe you need to call the pandas function called read_csv(). The simplest usage of this function is with a single argument, a string that holds the name of the CSV file, for example.

In [2]:

df = read_csv('WHO POP TB all.csv')

Dataframe attributes¶

A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is columns which holds a dataframe's column names.

In [3]:

df.columns

Out[3]:

Index(['Country', 'Population (1000s)', 'TB deaths'], dtype='object')

Dataframe rows¶

A dataframe has a default integer index for its rows, which starts at zero. The iloc attribute can be used to obtain the row at the given index.

In [4]:

df.iloc[0] # first row, index 0

Out[4]:

Country               Afghanistan
Population (1000s)          30552
TB deaths                   13000
Name: 0, dtype: object

In [5]:

df.iloc[2] # third row, index 2

Out[5]:

Country               Algeria
Population (1000s)      39208
TB deaths                5100
Name: 2, dtype: object

The `head()` method¶

The head() method returns a dataframe with the first rows, as many as given in the argument. By default, if the argument is missing, it returns the first five rows.

In [6]:

df.head() # first five rows

Out[6]:

	Country	Population (1000s)	TB deaths
0	Afghanistan	30552	13000.00
1	Albania	3173	20.00
2	Algeria	39208	5100.00
3	Andorra	79	0.26
4	Angola	21472	6900.00

In [7]:

df.head(7) # first seven rows

Out[7]:

	Country	Population (1000s)	TB deaths
0	Afghanistan	30552	13000.00
1	Albania	3173	20.00
2	Algeria	39208	5100.00
3	Andorra	79	0.26
4	Angola	21472	6900.00
5	Antigua and Barbuda	90	1.20
6	Argentina	41446	570.00

The `tail()` method¶

The tail() method is similar to the head() method. If no argument is used, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument.

In [8]:

df.tail() # last five rows

Out[8]:

	Country	Population (1000s)	TB deaths
189	Venezuela (Bolivarian Republic of)	30405	480.0
190	Viet Nam	91680	17000.0
191	Yemen	24407	990.0
192	Zambia	14539	3600.0
193	Zimbabwe	14150	5700.0

Selecting and displaying columns in a dataframe¶

You found in Week 1 that you can select and display the values for a single dataframe column by puting the name of the column (in quotes) within square brackets immediately after the dataframe's name. However you can also select and display the values for multiple columns too.

To get multiple columns you need to use a list. A list in Python is a number of items separated by commas within square brackets, for example ['Country', 'Population (1000s)']. This list is then put within square brackets immediately after the dataframe's name. The resulting expression represents a new dataframe, just with those columns, and therefore any dataframe method can be applied to it, like for example head():

In [9]:

df[['Country', 'Population (1000s)']].head()

Out[9]:

	Country	Population (1000s)
0	Afghanistan	30552
1	Albania	3173
2	Algeria	39208
3	Andorra	79
4	Angola	21472

Applying methods to a dataframe column¶

The iloc attribute and the head() and tail() methods discussed above can be used with single columns.

In [10]:

df['TB deaths'].iloc[2] # third value of deaths column

Out[10]:

5100.0

In [11]:

df['Population (1000s)'].tail() # last five values of population column      

Out[11]:

189    30405
190    91680
191    24407
192    14539
193    14150
Name: Population (1000s), dtype: int64

Tasks¶

In the code cell below, write the code to get and display the 55th row in the dataframe df.

In [ ]:

In the code cell below write the code to display the first 10 rows of the dataframe df.

In [ ]:

In the code cell below, select and display the first eight rows from the 'Country' and 'TB deaths' columns.

In [ ]:

Now go back to the course.

Exercise 2: Comparison operators¶

Python has the following comparison operators:

== (equals)
!= (not equal)
< (less than)
> (greater than)
<= (less than or equal to)
>= (greater than or equal to)

The following code will get and display all the rows in df where it is True that the value in the 'Population (1000s)' column is greater than 80000.

In [12]:

df[df['Population (1000s)'] > 80000]         

Out[12]:

	Country	Population (1000s)	TB deaths
13	Bangladesh	156595	80000.0
23	Brazil	200362	4400.0
36	China	1393337	41000.0
53	Egypt	82056	550.0
58	Ethiopia	94101	30000.0
65	Germany	82727	300.0
77	India	1252140	240000.0
78	Indonesia	249866	64000.0
85	Japan	127144	2100.0
109	Mexico	122332	2200.0
124	Nigeria	173615	160000.0
128	Pakistan	182143	49000.0
134	Philippines	98394	27000.0
141	Russian Federation	142834	17000.0
185	United States of America	320051	490.0
190	Viet Nam	91680	17000.0

Task¶

In the code cell below write code to find all the rows in df where TB deaths exceed 10000.

In [ ]:

Now go back to the course.

Exercise 3: Bitwise operators¶

Pandas has two operators to make more complicated queries. Use the operator & (means 'and') to select rows where two conditions are both true. Use the operator | (means 'or') to select rows where at least one condition is true. Don't forget to put parentheses around each comparison. For example, the following expression selects only countries with a population over 80 million inhabitants and with more that 10 thousand deaths.

In [13]:

df[(df['Population (1000s)'] > 80000) & (df['TB deaths'] > 10000)]

Out[13]:

	Country	Population (1000s)	TB deaths
13	Bangladesh	156595	80000.0
36	China	1393337	41000.0
58	Ethiopia	94101	30000.0
77	India	1252140	240000.0
78	Indonesia	249866	64000.0
124	Nigeria	173615	160000.0
128	Pakistan	182143	49000.0
134	Philippines	98394	27000.0
141	Russian Federation	142834	17000.0
190	Viet Nam	91680	17000.0

If the same columns will be used repeatedly in the program, the code becomes more readable if written as follows:

In [14]:

population = df['Population (1000s)']
deaths = df['TB deaths']
df[(population > 80000) & (deaths > 10000)]  

Out[14]:

	Country	Population (1000s)	TB deaths
13	Bangladesh	156595	80000.0
36	China	1393337	41000.0
58	Ethiopia	94101	30000.0
77	India	1252140	240000.0
78	Indonesia	249866	64000.0
124	Nigeria	173615	160000.0
128	Pakistan	182143	49000.0
134	Philippines	98394	27000.0
141	Russian Federation	142834	17000.0
190	Viet Nam	91680	17000.0

Task¶

In the code cell below find all the countries where the Population (1000s) is less than or equal to 50000 or TB deaths are greater than or equal to 20000.

In [ ]:

Now go back to the course.

Exercise 4: Display rows from dataframe¶

You have downloaded the file London_2014.csv from our website, it can now be read into a dataframe.

In [15]:

london = read_csv('London_2014.csv')
london.head()

Out[15]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees<br />
0	2014-1-1	11	8	6	9	7	4	94	86	73	...	31	11	2	40	26	66.0	9.91	4.0	Rain	186<br />
1	2014-1-2	11	9	6	9	6	4	94	81	60	...	31	15	5	32	21	50.0	1.02	2.0	Rain	214<br />
2	2014-1-3	11	8	6	9	5	2	94	76	54	...	31	12	0	50	29	69.0	7.11	2.0	Rain-Thunderstorm	219<br />
3	2014-1-4	9	6	3	8	6	2	93	85	65	...	31	12	5	35	21	39.0	9.91	4.0	Rain	211<br />
4	2014-1-5	12	7	1	11	4	-1	100	88	77	...	31	14	5	35	16	NaN	0.25	5.0	Rain	199<br />

5 rows × 23 columns

Removing initial spaces¶

There are too many columns for the dataframe to fit horizontally in this notebook, but they can be displayed separately.

In [16]:

london.columns

Out[16]:

Index(['GMT', 'Max TemperatureC', 'Mean TemperatureC', 'Min TemperatureC',
       'Dew PointC', 'MeanDew PointC', 'Min DewpointC', 'Max Humidity',
       ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressurehPa',
       ' Mean Sea Level PressurehPa', ' Min Sea Level PressurehPa',
       ' Max VisibilityKm', ' Mean VisibilityKm', ' Min VisibilitykM',
       ' Max Wind SpeedKm/h', ' Mean Wind SpeedKm/h', ' Max Gust SpeedKm/h',
       'Precipitationmm', ' CloudCover', ' Events', 'WindDirDegrees<br />'],
      dtype='object')

This shows that ' Max Wind SpeedKm/h' is prefixed by a space, as are other columm names such as ' Mean Humidity' and ' Max Sea Level PressurehPa'.

The read_csv() function has interpreted spaces after commas as being part of the next value. This can be rectified easily by adding another argument to the read_csv() function to skip the initial spaces after a comma.

In [17]:

london = read_csv('London_2014.csv', skipinitialspace=True)

Removing extra characters¶

Another problem shown above is that the final column is called 'WindDirDegrees< br />'.

When the dataset was exported from the Weather Underground web site, HTML line breaks were automatically added to each line in the file which read_csv() has interpreted as part of the column name and its values. This can be seen more clearly by looking at more values in the final column:

In [18]:

london['WindDirDegrees<br />'].head()

Out[18]:

0    186<br />
1    214<br />
2    219<br />
3    211<br />
4    199<br />
Name: WindDirDegrees<br />, dtype: object

'WindDirDegrees< br />' can be changed to 'WindDirDegrees' with the rename() method as follows:

In [19]:

london = london.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})

To remove the '< br />' html line breaks from the values in the 'WindDirDegrees' column you need to use the string method rstrip() which is used to remove characters from the end or 'rear' of a string:

In [20]:

london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />')

Display the first few rows of the 'WindDirDegrees' to confirm the change:

In [21]:

london['WindDirDegrees'].head()

Out[21]:

0    186
1    214
2    219
3    211
4    199
Name: WindDirDegrees, dtype: object

Missing values¶

Missing (also called null or not available) values are marked as NaN (not a number) in dataframes.

In [22]:

london['Events'].tail()

Out[22]:

360    Rain
361     NaN
362     NaN
363     NaN
364     Fog
Name: Events, dtype: object

The isnull() method returns True for each row in a column that has a null value. The method can be used to select and display those rows. Scroll the table below to the right to check that the events column is only showing missing values.

In [23]:

london[london['Events'].isnull()]

Out[23]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
32	2014-2-2	10	8	5	4	3	1	93	71	43	...	31	16	10	37	24	47.0	0.00	2.0	NaN	216
33	2014-2-3	8	6	3	5	3	2	93	79	55	...	31	15	10	32	18	NaN	0.00	4.0	NaN	133
43	2014-2-13	8	6	3	2	1	-1	81	66	45	...	31	15	10	42	27	48.0	0.00	2.0	NaN	236
52	2014-2-22	12	8	4	5	3	1	87	68	38	...	31	16	10	29	21	NaN	0.00	2.0	NaN	228
63	2014-3-5	13	7	0	4	2	-1	100	72	40	...	31	15	8	21	6	NaN	0.25	1.0	NaN	248
64	2014-3-6	13	9	5	7	5	3	93	77	50	...	31	15	7	32	14	42.0	0.00	2.0	NaN	206
66	2014-3-8	16	11	7	7	4	2	93	65	29	...	31	11	6	27	18	NaN	0.00	6.0	NaN	156
67	2014-3-9	19	12	6	7	4	3	81	58	28	...	26	9	5	24	10	34.0	0.00	NaN	NaN	170
68	2014-3-10	13	8	4	8	4	2	87	73	57	...	31	12	6	23	8	NaN	0.00	6.0	NaN	27
70	2014-3-12	16	9	2	8	4	1	93	76	40	...	19	9	3	11	8	NaN	0.00	4.0	NaN	31
73	2014-3-15	18	12	7	6	5	2	87	64	24	...	31	18	5	27	16	NaN	0.00	1.0	NaN	277
74	2014-3-16	19	13	7	8	6	3	87	63	28	...	31	19	7	24	16	NaN	0.00	NaN	NaN	266
75	2014-3-17	14	9	5	7	6	4	93	74	47	...	31	13	5	21	14	NaN	0.00	3.0	NaN	263
77	2014-3-19	16	11	7	7	5	4	93	68	35	...	31	18	8	26	18	NaN	0.00	2.0	NaN	241
87	2014-3-29	19	13	8	7	4	-1	93	57	18	...	31	9	3	34	16	NaN	0.00	NaN	NaN	99
92	2014-4-3	18	14	11	11	9	9	88	75	52	...	21	7	3	26	11	NaN	0.00	5.0	NaN	96
93	2014-4-4	15	12	9	10	8	6	88	71	46	...	31	11	2	24	14	NaN	0.00	5.0	NaN	253
98	2014-4-9	17	10	4	7	6	3	93	68	38	...	31	14	6	23	13	NaN	0.00	2.0	NaN	261
99	2014-4-10	16	11	6	6	5	3	93	62	31	...	31	14	9	19	10	NaN	0.00	3.0	NaN	278
100	2014-4-11	16	12	9	5	4	2	71	53	28	...	31	18	10	13	8	NaN	0.00	1.0	NaN	351
101	2014-4-12	14	10	6	7	5	3	87	66	42	...	31	16	10	29	11	NaN	0.00	1.0	NaN	262
102	2014-4-13	16	12	8	8	5	2	93	63	33	...	31	14	9	26	14	NaN	0.00	2.0	NaN	283
103	2014-4-14	16	11	6	5	3	0	87	57	25	...	31	17	10	21	13	NaN	0.00	1.0	NaN	330
104	2014-4-15	14	9	3	4	2	0	81	58	26	...	31	20	10	21	10	NaN	0.00	1.0	NaN	64
105	2014-4-16	17	10	3	4	1	-6	87	54	15	...	31	20	10	16	11	NaN	0.00	1.0	NaN	139
106	2014-4-17	17	11	6	7	4	1	81	59	27	...	31	17	10	21	11	NaN	0.00	2.0	NaN	279
107	2014-4-18	12	9	6	5	2	-1	81	57	28	...	31	15	10	19	13	NaN	0.00	3.0	NaN	21
108	2014-4-19	14	9	3	4	2	0	81	59	33	...	31	14	10	23	11	NaN	0.00	4.0	NaN	56
122	2014-5-3	14	8	2	3	1	-1	87	56	26	...	31	16	10	14	8	NaN	0.00	1.0	NaN	39
123	2014-5-4	16	10	4	6	4	2	87	62	32	...	31	17	10	19	10	40.0	0.00	1.0	NaN	196
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
252	2014-9-10	21	16	11	12	10	8	94	68	38	...	23	13	9	21	8	NaN	0.00	2.0	NaN	92
253	2014-9-11	18	14	10	11	10	8	94	70	45	...	13	8	5	11	6	NaN	0.00	6.0	NaN	42
254	2014-9-12	23	18	13	14	12	9	94	70	33	...	31	11	5	19	11	NaN	0.00	3.0	NaN	76
255	2014-9-13	21	17	12	14	12	11	100	75	45	...	31	12	3	19	11	NaN	0.00	4.0	NaN	56
257	2014-9-15	22	17	12	14	12	9	100	74	35	...	26	12	5	14	6	NaN	0.00	3.0	NaN	71
258	2014-9-16	23	18	13	16	14	11	100	77	39	...	19	6	2	19	8	NaN	0.00	5.0	NaN	70
259	2014-9-17	21	18	16	15	14	13	94	78	59	...	11	6	3	19	13	NaN	0.00	5.0	NaN	58
260	2014-9-18	26	20	14	16	15	13	94	74	43	...	19	6	2	19	11	35.0	0.00	5.0	NaN	71
264	2014-9-22	19	13	7	9	8	6	100	68	35	...	19	10	7	11	6	NaN	0.25	1.0	NaN	309
267	2014-9-25	19	13	7	12	9	5	93	71	48	...	31	17	10	26	13	NaN	0.00	5.0	NaN	249
269	2014-9-27	21	17	12	14	12	11	94	71	43	...	31	15	10	13	5	NaN	0.00	6.0	NaN	146
270	2014-9-28	24	19	14	15	14	13	94	72	41	...	27	13	6	16	6	NaN	0.00	4.0	NaN	116
272	2014-9-30	22	18	14	15	13	12	94	78	44	...	27	10	4	19	13	32.0	0.00	4.0	NaN	222
274	2014-10-2	22	19	14	14	13	11	94	75	46	...	31	15	6	21	8	35.0	0.00	3.0	NaN	203
292	2014-10-20	17	14	12	11	11	9	88	76	50	...	31	15	10	27	19	NaN	0.25	3.0	NaN	238
295	2014-10-23	16	13	11	14	10	7	94	76	58	...	31	14	9	23	16	NaN	0.00	5.0	NaN	214
297	2014-10-25	16	11	7	9	8	6	100	75	41	...	31	13	10	26	13	NaN	0.00	4.0	NaN	237
298	2014-10-26	16	13	12	9	8	7	77	67	45	...	31	12	10	27	18	40.0	0.00	5.0	NaN	217
299	2014-10-27	18	16	12	13	12	9	94	80	55	...	31	14	8	29	16	40.0	0.00	3.0	NaN	192
302	2014-10-30	18	16	13	15	14	12	100	89	72	...	13	6	1	16	10	NaN	0.00	5.0	NaN	170
303	2014-10-31	23	17	12	15	13	11	100	78	40	...	31	13	3	23	8	48.0	0.00	1.0	NaN	170
322	2014-11-19	11	9	7	9	7	5	100	88	70	...	10	5	3	14	10	NaN	0.00	2.0	NaN	93
323	2014-11-20	11	8	6	8	6	5	93	87	66	...	13	6	3	16	8	NaN	0.00	3.0	NaN	69
339	2014-12-6	6	2	-1	3	1	-1	100	85	60	...	31	15	10	14	8	NaN	0.25	4.0	NaN	265
346	2014-12-13	7	3	0	2	1	-1	100	83	53	...	31	19	9	13	8	NaN	0.00	2.0	NaN	278
351	2014-12-18	13	12	11	11	11	10	94	85	74	...	31	12	10	37	29	60.0	0.00	6.0	NaN	245
358	2014-12-25	8	6	2	3	2	1	93	79	51	...	31	19	10	24	16	NaN	0.00	1.0	NaN	275
361	2014-12-28	6	2	-2	1	-2	-3	100	76	45	...	31	17	8	16	11	NaN	0.00	1.0	NaN	5
362	2014-12-29	6	1	-4	2	-1	-4	100	87	62	...	24	10	5	13	8	NaN	0.25	1.0	NaN	289
363	2014-12-30	6	1	-4	2	-1	-4	100	90	68	...	27	9	3	14	6	NaN	0.25	3.0	NaN	258

114 rows × 23 columns

One way to deal with missing values is to replace them by some value. The column method fillna() fills all not available value cells with the value given as argument. In the example below, each missing event is replaced by the empty string.

In [24]:

london['Events'] = london['Events'].fillna('')
london[london['Events'].isnull()]

Out[24]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees

0 rows × 23 columns

The empty dataframe (no rows) confirms there are no more missing event values.

Another way to deal with missing values is to ignore rows with them. The dropna() dataframe method returns a new dataframe where all rows with at least one non-available value have been removed.

In [25]:

london.dropna()

Out[25]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
0	2014-1-1	11	8	6	9	7	4	94	86	73	...	31	11	2	40	26	66.0	9.91	4.0	Rain	186
1	2014-1-2	11	9	6	9	6	4	94	81	60	...	31	15	5	32	21	50.0	1.02	2.0	Rain	214
2	2014-1-3	11	8	6	9	5	2	94	76	54	...	31	12	0	50	29	69.0	7.11	2.0	Rain-Thunderstorm	219
3	2014-1-4	9	6	3	8	6	2	93	85	65	...	31	12	5	35	21	39.0	9.91	4.0	Rain	211
5	2014-1-6	12	10	8	11	8	6	94	78	60	...	31	12	5	50	29	61.0	2.03	4.0	Rain	219
6	2014-1-7	12	10	8	8	7	6	87	78	63	...	31	13	5	39	27	55.0	5.08	3.0	Rain	219
7	2014-1-8	12	9	7	11	7	6	94	85	69	...	27	14	6	24	18	37.0	1.02	3.0	Rain	197
8	2014-1-9	11	8	5	11	5	2	94	76	53	...	31	13	5	40	26	60.0	0.00	4.0	Rain	247
9	2014-1-10	11	7	2	7	4	2	93	85	66	...	31	17	10	24	13	32.0	0.25	3.0	Rain	216
15	2014-1-16	11	8	6	9	6	4	94	85	67	...	31	14	3	27	18	47.0	3.05	3.0	Rain	188
16	2014-1-17	9	7	5	7	6	4	93	86	68	...	31	13	5	29	19	47.0	4.06	3.0	Rain	192
21	2014-1-22	11	8	6	7	6	4	100	87	62	...	31	13	6	23	14	32.0	0.25	5.0	Rain	221
22	2014-1-23	8	6	2	6	3	0	100	81	50	...	31	14	5	26	13	45.0	1.02	2.0	Rain	265
24	2014-1-25	12	9	5	8	6	0	100	83	48	...	31	10	2	40	13	61.0	2.03	4.0	Fog-Rain	249
25	2014-1-26	9	6	2	8	3	0	93	80	62	...	31	14	3	37	19	39.0	5.08	4.0	Rain	232
26	2014-1-27	8	5	2	4	2	-1	87	80	61	...	31	16	9	34	23	47.0	2.03	3.0	Rain	228
30	2014-1-31	9	5	1	8	4	1	100	90	72	...	23	7	0	37	13	50.0	7.11	5.0	Fog-Rain	173
31	2014-2-1	9	7	5	9	3	0	100	73	46	...	31	13	3	50	24	71.0	0.00	2.0	Rain	220
32	2014-2-2	10	8	5	4	3	1	93	71	43	...	31	16	10	37	24	47.0	0.00	2.0		216
34	2014-2-4	9	7	4	4	3	1	87	76	50	...	31	14	6	39	21	53.0	0.51	4.0	Rain	176
35	2014-2-5	10	8	5	7	4	2	93	75	56	...	26	13	6	45	34	66.0	2.03	4.0	Rain	192
36	2014-2-6	10	8	6	9	6	3	100	82	62	...	29	12	2	35	26	40.0	7.11	4.0	Rain	190
37	2014-2-7	9	7	4	9	4	0	100	78	46	...	31	12	3	39	19	58.0	0.51	4.0	Rain	230
38	2014-2-8	9	8	7	8	4	0	93	74	50	...	31	14	3	45	32	69.0	2.03	3.0	Rain	208
39	2014-2-9	8	7	4	3	1	-2	81	65	42	...	31	14	10	47	34	66.0	0.00	3.0	Rain	228
41	2014-2-11	8	5	2	6	2	-1	93	75	43	...	31	14	6	40	19	58.0	2.03	3.0	Rain	222
42	2014-2-12	9	6	2	6	2	-1	93	74	49	...	31	14	5	52	26	76.0	7.87	3.0	Rain-Thunderstorm	211
43	2014-2-13	8	6	3	2	1	-1	81	66	45	...	31	15	10	42	27	48.0	0.00	2.0		236
44	2014-2-14	12	8	3	10	4	0	94	81	60	...	31	14	5	63	24	90.0	8.89	5.0	Rain	180
45	2014-2-15	11	8	5	5	3	2	87	67	51	...	31	15	9	66	47	85.0	0.51	4.0	Rain	237
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
281	2014-10-9	18	14	11	11	10	8	88	77	48	...	31	15	3	39	21	53.0	2.03	2.0	Rain-Thunderstorm	209
282	2014-10-10	18	13	9	12	10	8	94	80	45	...	31	17	10	26	14	42.0	3.05	2.0	Rain	207
284	2014-10-12	14	10	6	11	8	5	100	84	57	...	31	14	6	23	6	37.0	2.03	2.0	Fog-Rain	53
288	2014-10-16	18	14	11	14	12	11	100	87	58	...	31	12	6	21	10	34.0	0.25	2.0	Rain	214
289	2014-10-17	19	16	13	15	14	12	100	84	62	...	29	14	8	24	13	42.0	0.51	2.0	Rain	188
291	2014-10-19	21	17	14	16	13	11	94	78	48	...	31	14	5	34	21	47.0	0.00	3.0	Rain	225
293	2014-10-21	14	11	8	13	6	1	94	65	37	...	31	14	3	47	27	69.0	3.05	3.0	Rain	268
298	2014-10-26	16	13	12	9	8	7	77	67	45	...	31	12	10	27	18	40.0	0.00	5.0		217
299	2014-10-27	18	16	12	13	12	9	94	80	55	...	31	14	8	29	16	40.0	0.00	3.0		192
300	2014-10-28	18	14	9	14	12	9	100	83	47	...	31	13	0	24	11	37.0	0.00	3.0	Fog-Rain	181
303	2014-10-31	23	17	12	15	13	11	100	78	40	...	31	13	3	23	8	48.0	0.00	1.0		170
304	2014-11-1	18	16	13	15	11	6	94	74	39	...	29	13	9	32	18	48.0	0.25	3.0	Rain	211
305	2014-11-2	16	13	10	13	11	8	94	82	64	...	29	14	0	27	23	50.0	11.94	4.0	Rain	207
306	2014-11-3	10	8	6	9	7	4	94	84	60	...	31	14	2	23	14	34.0	5.08	3.0	Rain	208
310	2014-11-7	13	10	8	12	8	5	95	80	52	...	31	11	2	42	23	42.0	5.08	4.0	Rain	207
311	2014-11-8	14	11	7	11	8	5	94	84	64	...	31	14	5	34	18	52.0	2.03	3.0	Rain	192
312	2014-11-9	13	10	6	9	7	4	100	83	47	...	31	11	0	14	6	32.0	0.00	2.0	Fog	212
336	2014-12-3	8	6	4	5	3	1	93	78	56	...	31	15	9	23	16	37.0	0.51	4.0	Rain	17
342	2014-12-9	12	6	-1	11	3	-1	100	86	64	...	31	15	3	37	14	52.0	0.25	5.0	Rain	224
345	2014-12-12	9	6	2	9	4	-1	94	79	59	...	31	11	3	40	26	55.0	1.02	3.0	Rain	253
347	2014-12-14	10	4	-1	9	3	-1	100	84	62	...	31	16	9	29	14	45.0	0.25	3.0	Rain	215
350	2014-12-17	13	9	4	11	10	6	100	90	76	...	31	12	6	34	21	55.0	0.25	5.0	Rain	252
351	2014-12-18	13	12	11	11	11	10	94	85	74	...	31	12	10	37	29	60.0	0.00	6.0		245
353	2014-12-20	9	7	4	4	3	2	87	75	53	...	31	15	10	32	19	45.0	0.00	3.0	Rain	251
354	2014-12-21	12	8	3	9	6	3	93	80	65	...	31	12	10	35	23	47.0	0.00	6.0	Rain	237
355	2014-12-22	13	12	10	11	9	7	94	77	63	...	31	14	8	40	32	55.0	0.00	5.0	Rain	240
356	2014-12-23	13	11	10	9	7	5	82	72	58	...	31	12	10	42	32	61.0	0.00	4.0	Rain	238
357	2014-12-24	12	8	4	10	4	-2	94	71	38	...	31	13	10	34	26	45.0	0.25	5.0	Rain	262
360	2014-12-27	4	3	2	8	1	-1	93	83	64	...	31	13	9	42	26	61.0	0.00	4.0	Rain	330
364	2014-12-31	8	2	-4	5	2	-4	100	89	70	...	19	9	0	23	8	37.0	0.25	4.0	Fog	201

128 rows × 23 columns

Note that the table above has fewer than 251 of the original 365 rows, so there must be further null values besides the 114 missing events.

Changing the value type of a column¶

The type of every column in a dataframe can be determined by looking at the dataframe's dtypes attribute, like this:

In [26]:

london.dtypes

Out[26]:

GMT                            object
Max TemperatureC                int64
Mean TemperatureC               int64
Min TemperatureC                int64
Dew PointC                      int64
MeanDew PointC                  int64
Min DewpointC                   int64
Max Humidity                    int64
Mean Humidity                   int64
Min Humidity                    int64
Max Sea Level PressurehPa       int64
Mean Sea Level PressurehPa      int64
Min Sea Level PressurehPa       int64
Max VisibilityKm                int64
Mean VisibilityKm               int64
Min VisibilitykM                int64
Max Wind SpeedKm/h              int64
Mean Wind SpeedKm/h             int64
Max Gust SpeedKm/h            float64
Precipitationmm               float64
CloudCover                    float64
Events                         object
WindDirDegrees                 object
dtype: object

The type of all the values in a column can be changed using the astype() method. The following code will change the values in the 'WindDirDegrees' column from strings (object) to integers (int64).

In [27]:

london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')   

The function to_datetime() is needed to change the values in the 'GMT' column from strings (object) to dates (datetime64):

In [28]:

london['GMT'] = to_datetime(london['GMT'])
london.dtypes

Out[28]:

GMT                           datetime64[ns]
Max TemperatureC                       int64
Mean TemperatureC                      int64
Min TemperatureC                       int64
Dew PointC                             int64
MeanDew PointC                         int64
Min DewpointC                          int64
Max Humidity                           int64
Mean Humidity                          int64
Min Humidity                           int64
Max Sea Level PressurehPa              int64
Mean Sea Level PressurehPa             int64
Min Sea Level PressurehPa              int64
Max VisibilityKm                       int64
Mean VisibilityKm                      int64
Min VisibilitykM                       int64
Max Wind SpeedKm/h                     int64
Mean Wind SpeedKm/h                    int64
Max Gust SpeedKm/h                   float64
Precipitationmm                      float64
CloudCover                           float64
Events                                object
WindDirDegrees                         int64
dtype: object

Values of type datetime64 can be created using the datetime() function, provided by the package of the same name. The first integer argument is the year, the second the month and the third the day. The code below will get and display the row in the dataframe whose 'GMT' value is 4th June 2014.

In [29]:

from datetime import datetime

london[london['GMT'] == datetime(2014, 6, 4)] 

Out[29]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
154	2014-06-04	14	11	8	11	9	3	94	78	47	...	31	12	7	47	13	64.0	2.03	4.0	Rain-Thunderstorm	230

1 rows × 23 columns

Queries such as 'Return all the rows where the date is between 8 December and 12 December' can be made:

In [30]:

dates = london['GMT']
start = datetime(2014, 12, 8)
end = datetime(2014, 12, 12)
london[(dates >= start) & (dates <= end)]

Out[30]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
341	2014-12-08	7	4	1	2	1	-1	93	76	51	...	31	17	10	23	13	NaN	0.00	2.0	Rain	270
342	2014-12-09	12	6	-1	11	3	-1	100	86	64	...	31	15	3	37	14	52.0	0.25	5.0	Rain	224
343	2014-12-10	9	7	6	11	3	1	90	70	44	...	31	18	3	40	27	NaN	0.00	2.0	Rain	246
344	2014-12-11	10	8	5	6	3	2	82	72	51	...	31	15	10	40	31	NaN	0.25	2.0	Rain	246
345	2014-12-12	9	6	2	9	4	-1	94	79	59	...	31	11	3	40	26	55.0	1.02	3.0	Rain	253

5 rows × 23 columns

Tasks¶

Now that the wind direction is given by a number, write code to select all days that had a northerly wind. Hint: select the rows where the direction is greater than or equal to 350 or smaller than or equal to 10, as the compass rose shows.

In [ ]:

In the code cell below, write code to get and display all the rows in the dataframe that are beween 1 April 2014 and 11 April 2014.

In [ ]:

In the cell below, write two lines of code to display the first five rows that have a missing value in the 'Max Gust SpeedKm/h' column. Hint: first select the missing value rows and store them in a new dataframe, then display the first five rows of the new dataframe.

In [ ]:

Now go back to the course.

Exercise 5: Every picture tells a story¶

The following line of code tells Jupyter to display inside this notebook any graph that is created.

In [31]:

%matplotlib inline

The plot() method can make a graph of the values in a column. Gridlines are turned on by the grid argument.

In [32]:

london['Max Wind SpeedKm/h'].plot(grid=True)

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x109ea33c8>

The graph can be made bigger by giving the method a figsize=(x,y) argument where x and y are integers that determine the length of the x-axis and y-axis.

In [33]:

london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a366550>

Multiple lines can be plotted by selecting multiple columns.

In [34]:

london[['Max Wind SpeedKm/h', 'Mean Wind SpeedKm/h']].plot(grid=True, figsize=(10,5))

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x112637160>

Task¶

In the cell below, write code to plot the minimum, mean, and maximum temperature during 2014 in London.

In [ ]:

Now go back to the course.

Exercise 6: Changing a dataframe's index¶

Changing the dataframe's index from the default to datetime64 values is done by assigning to the dataframe's index attribute the contents of the 'GMT' column, like this:

In [35]:

london.index = london['GMT']
london.head(2)

Out[35]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
GMT
2014-01-01	2014-01-01	11	8	6	9	7	4	94	86	73	...	31	11	2	40	26	66.0	9.91	4.0	Rain	186
2014-01-02	2014-01-02	11	9	6	9	6	4	94	81	60	...	31	15	5	32	21	50.0	1.02	2.0	Rain	214

2 rows × 23 columns

The iloc attribute can still be used to get and display rows by number, but now you can now also use the datetime64 index to get a row by date, using the dataframe's loc attribute, like this:

In [36]:

london.loc[datetime(2014, 1, 1)]

Out[36]:

GMT                           2014-01-01 00:00:00
Max TemperatureC                               11
Mean TemperatureC                               8
Min TemperatureC                                6
Dew PointC                                      9
MeanDew PointC                                  7
Min DewpointC                                   4
Max Humidity                                   94
Mean Humidity                                  86
Min Humidity                                   73
Max Sea Level PressurehPa                    1002
Mean Sea Level PressurehPa                    993
Min Sea Level PressurehPa                     984
Max VisibilityKm                               31
Mean VisibilityKm                              11
Min VisibilitykM                                2
Max Wind SpeedKm/h                             40
Mean Wind SpeedKm/h                            26
Max Gust SpeedKm/h                             66
Precipitationmm                              9.91
CloudCover                                      4
Events                                       Rain
WindDirDegrees                                186
Name: 2014-01-01 00:00:00, dtype: object

A query such as 'Return all the rows where the date is between December 8th and December 12th' can now be done succinctly like this:

In [37]:

london.loc[datetime(2014,12,8) : datetime(2014,12,12)]

#The meaning of the above code is get the rows beween and including 
#the indices datetime(2014,12,8) and datetime(2014,12,12)

Out[37]:

	GMT	Max TemperatureC	Mean TemperatureC	Min TemperatureC	Dew PointC	MeanDew PointC	Min DewpointC	Max Humidity	Mean Humidity	Min Humidity	...	Max VisibilityKm	Mean VisibilityKm	Min VisibilitykM	Max Wind SpeedKm/h	Mean Wind SpeedKm/h	Max Gust SpeedKm/h	Precipitationmm	CloudCover	Events	WindDirDegrees
GMT
2014-12-08	2014-12-08	7	4	1	2	1	-1	93	76	51	...	31	17	10	23	13	NaN	0.00	2.0	Rain	270
2014-12-09	2014-12-09	12	6	-1	11	3	-1	100	86	64	...	31	15	3	37	14	52.0	0.25	5.0	Rain	224
2014-12-10	2014-12-10	9	7	6	11	3	1	90	70	44	...	31	18	3	40	27	NaN	0.00	2.0	Rain	246
2014-12-11	2014-12-11	10	8	5	6	3	2	82	72	51	...	31	15	10	40	31	NaN	0.25	2.0	Rain	246
2014-12-12	2014-12-12	9	6	2	9	4	-1	94	79	59	...	31	11	3	40	26	55.0	1.02	3.0	Rain	253

5 rows × 23 columns

Now we have a datetime64 index, let's plot 'Max Wind SpeedKm/h' again:

In [38]:

london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a4414e0>

Now it is much clearer that the worst winds were in mid February.

Task¶

Use the code cell below to plot the values of 'Mean Humidity' during spring (full months of March, April and May).

In [ ]:

Now go back to the course.

Exercise notebook 2: Cleaning up our act¶

Exercise 1: Dataframes and CSV files¶

Dataframe attributes¶

Dataframe rows¶

The head() method¶

The tail() method¶

Selecting and displaying columns in a dataframe¶

Applying methods to a dataframe column¶

Tasks¶

Exercise 2: Comparison operators¶

Task¶

Exercise 3: Bitwise operators¶

Task¶

Exercise 4: Display rows from dataframe¶

Removing initial spaces¶

Removing extra characters¶

Missing values¶

Changing the value type of a column¶

Tasks¶

Exercise 5: Every picture tells a story¶

Task¶

Exercise 6: Changing a dataframe's index¶

Task¶

The `head()` method¶

The `tail()` method¶