Problem 1: Healthy Breakfast revisited¶

In this exercise we will do some simple analyses on a data set that contains information about 77 types of breakfast cereal. You can read about the data set at CMU's website. Download the files cereal.csv from the course website.

Today you will be using pandas to read the data and do the analysis!

In [5]:

%matplotlib inline

In [6]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

1.1 Import the data using pandas¶

In [27]:

col_labels = ["name", "mfr", "type", "calories (number)", "protein (g)", "fat (g)", "sodium (mg)", 
              "dietary fiber (g)", "complex carbohydrates (g)", "sugars (g)", 
              "shelf", "potassium (mg)", 
              "vitamins and minerals", 
            "weight (in ounces) of one serving (serving size)", "cups per serving"]

df = pd.read_csv('cereal.txt', sep=' ', header=None, names=col_labels)

In [28]:

df

Out[28]:

	name	mfr	type	calories (number)	protein (g)	fat (g)	sodium (mg)	dietary fiber (g)	complex carbohydrates (g)	sugars (g)	shelf	potassium (mg)	vitamins and minerals	weight (in ounces) of one serving (serving size)	cups per serving
0	100%_Bran	N	C	70	4	1	130	10.0	5.0	6	3	280	25	1.00	0.33
1	100%_Natural_Bran	Q	C	120	3	5	15	2.0	8.0	8	3	135	0	1.00	-1.00
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	3	320	25	1.00	0.33
3	All-Bran_with_Extra_Fiber	K	C	50	4	0	140	14.0	8.0	0	3	330	25	1.00	0.50
4	Almond_Delight	R	C	110	2	2	200	1.0	14.0	8	3	-1	25	1.00	0.75
5	Apple_Cinnamon_Cheerios	G	C	110	2	2	180	1.5	10.5	10	1	70	25	1.00	0.75
6	Apple_Jacks	K	C	110	2	0	125	1.0	11.0	14	2	30	25	1.00	1.00
7	Basic_4	G	C	130	3	2	210	2.0	18.0	8	3	100	25	1.33	0.75
8	Bran_Chex	R	C	90	2	1	200	4.0	15.0	6	1	125	25	1.00	0.67
9	Bran_Flakes	P	C	90	3	0	210	5.0	13.0	5	3	190	25	1.00	0.67
10	Cap'n'Crunch	Q	C	120	1	2	220	0.0	12.0	12	2	35	25	1.00	0.75
11	Cheerios	G	C	110	6	2	290	2.0	17.0	1	1	105	25	1.00	1.25
12	Cinnamon_Toast_Crunch	G	C	120	1	3	210	0.0	13.0	9	2	45	25	1.00	0.75
13	Clusters	G	C	110	3	2	140	2.0	13.0	7	3	105	25	1.00	0.50
14	Cocoa_Puffs	G	C	110	1	1	180	0.0	12.0	13	2	55	25	1.00	1.00
15	Corn_Chex	R	C	110	2	0	280	0.0	22.0	3	1	25	25	1.00	1.00
16	Corn_Flakes	K	C	100	2	0	290	1.0	21.0	2	1	35	25	1.00	1.00
17	Corn_Pops	K	C	110	1	0	90	1.0	13.0	12	2	20	25	1.00	1.00
18	Count_Chocula	G	C	110	1	1	180	0.0	12.0	13	2	65	25	1.00	1.00
19	Cracklin'_Oat_Bran	K	C	110	3	3	140	4.0	10.0	7	3	160	25	1.00	0.50
20	Cream_of_Wheat_(Quick)	N	H	100	3	0	80	1.0	21.0	0	2	-1	0	1.00	1.00
21	Crispix	K	C	110	2	0	220	1.0	21.0	3	3	30	25	1.00	1.00
22	Crispy_Wheat_&_Raisins	G	C	100	2	1	140	2.0	11.0	10	3	120	25	1.00	0.75
23	Double_Chex	R	C	100	2	0	190	1.0	18.0	5	3	80	25	1.00	0.75
24	Froot_Loops	K	C	110	2	1	125	1.0	11.0	13	2	30	25	1.00	1.00
25	Frosted_Flakes	K	C	110	1	0	200	1.0	14.0	11	1	25	25	1.00	0.75
26	Frosted_Mini-Wheats	K	C	100	3	0	0	3.0	14.0	7	2	100	25	1.00	0.80
27	Fruit_&_Fibre_Dates,_Walnuts,_and_Oats	P	C	120	3	2	160	5.0	12.0	10	3	200	25	1.25	0.67
28	Fruitful_Bran	K	C	120	3	0	240	5.0	14.0	12	3	190	25	1.33	0.67
29	Fruity_Pebbles	P	C	110	1	1	135	0.0	13.0	12	2	25	25	1.00	0.75
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47	Multi-Grain_Cheerios	G	C	100	2	1	220	2.0	15.0	6	1	90	25	1.00	1.00
48	Nut&Honey_Crunch	K	C	120	2	1	190	0.0	15.0	9	2	40	25	1.00	0.67
49	Nutri-Grain_Almond-Raisin	K	C	140	3	2	220	3.0	21.0	7	3	130	25	1.33	0.67
50	Nutri-grain_Wheat	K	C	90	3	0	170	3.0	18.0	2	3	90	25	1.00	-1.00
51	Oatmeal_Raisin_Crisp	G	C	130	3	2	170	1.5	13.5	10	3	120	25	1.25	0.50
52	Post_Nat._Raisin_Bran	P	C	120	3	1	200	6.0	11.0	14	3	260	25	1.33	0.67
53	Product_19	K	C	100	3	0	320	1.0	20.0	3	3	45	100	1.00	1.00
54	Puffed_Rice	Q	C	50	1	0	0	0.0	13.0	0	3	15	0	0.50	1.00
55	Puffed_Wheat	Q	C	50	2	0	0	1.0	10.0	0	3	50	0	0.50	-1.00
56	Quaker_Oat_Squares	Q	C	100	4	1	135	2.0	14.0	6	3	110	25	1.00	0.50
57	Quaker_Oatmeal	Q	H	100	5	2	0	2.7	-1.0	-1	1	110	0	1.00	0.67
58	Raisin_Bran	K	C	120	3	1	210	5.0	14.0	12	2	240	25	1.33	0.75
59	Raisin_Nut_Bran	G	C	100	3	2	140	2.5	10.5	8	3	140	25	1.00	0.50
60	Raisin_Squares	K	C	90	2	0	0	2.0	15.0	6	3	110	25	1.00	0.50
61	Rice_Chex	R	C	110	1	0	240	0.0	23.0	2	1	30	25	1.00	1.13
62	Rice_Krispies	K	C	110	2	0	290	0.0	22.0	3	1	35	25	1.00	1.00
63	Shredded_Wheat	N	C	80	2	0	0	3.0	16.0	0	1	95	0	0.83	-1.00
64	Shredded_Wheat_'n'Bran	N	C	90	3	0	0	4.0	19.0	0	1	140	0	1.00	0.67
65	Shredded_Wheat_spoon_size	N	C	90	3	0	0	3.0	20.0	0	1	120	0	1.00	0.67
66	Smacks	K	C	110	2	1	70	1.0	9.0	15	2	40	25	1.00	0.75
67	Special_K	K	C	110	6	0	230	1.0	16.0	3	1	55	25	1.00	1.00
68	Strawberry_Fruit_Wheats	N	C	90	2	0	15	3.0	15.0	5	2	90	25	1.00	-1.00
69	Total_Corn_Flakes	G	C	110	2	1	200	0.0	21.0	3	3	35	100	1.00	1.00
70	Total_Raisin_Bran	G	C	140	3	1	190	4.0	15.0	14	3	230	100	1.50	1.00
71	Total_Whole_Grain	G	C	100	3	1	200	3.0	16.0	3	3	110	100	1.00	1.00
72	Triples	G	C	110	2	1	250	0.0	21.0	3	3	60	25	1.00	0.75
73	Trix	G	C	110	1	1	140	0.0	13.0	12	2	25	25	1.00	1.00
74	Wheat_Chex	R	C	100	3	1	230	3.0	17.0	3	1	115	25	1.00	0.67
75	Wheaties	G	C	100	3	1	200	3.0	17.0	3	1	110	25	1.00	1.00
76	Wheaties_Honey_Gold	G	C	110	2	1	200	1.0	16.0	8	1	60	25	1.00	0.75

77 rows × 15 columns

1.2 Plot FatVsCalories¶

In [16]:

df.plot(x='fat (g)', y='calories (number)', kind='scatter')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0xf094506518>

1.3 Compute the average number of calories, as well as the average amount of complex carbohydrates and sugars in the data set.¶

In [17]:

#Start your answers here - add more cells as necessary!
avg_calories = df['calories (number)'].mean()
avg_cplx_carbs = df['complex carbohydrates (g)'].mean()
avg_sugars = df['sugars (g)'].mean()

In [18]:

avg_calories, avg_cplx_carbs, avg_sugars

Out[18]:

(106.88311688311688, 14.597402597402597, 6.922077922077922)

1.4 What is the name of the cereal that has the highest sugar content?¶

In [26]:

df['sugars (g)'].max()

Out[26]:

In [25]:

df[df['sugars (g)'] == df['sugars (g)'].max()].name

Out[25]:

30    Golden_Crisp
66          Smacks
Name: name, dtype: object

1.5 Calorie content by shelf¶

The 8th column of the data set contains the shelf at which the cereal is displayed in the supermarket (either 1, 2, or 3). Compute the mean number of calories of the cereals displayed in each shelf. If you do this correctly you should find the the cereals displayed on the middle shelf have a higher calorie content on average.

In [29]:

df.shelf

Out[29]:

0     3
1     3
2     3
3     3
4     3
5     1
6     2
7     3
8     1
9     3
10    2
11    1
12    2
13    3
14    2
15    1
16    1
17    2
18    2
19    3
20    2
21    3
22    3
23    3
24    2
25    1
26    2
27    3
28    3
29    2
     ..
47    1
48    2
49    3
50    3
51    3
52    3
53    3
54    3
55    3
56    3
57    1
58    2
59    3
60    3
61    1
62    1
63    1
64    1
65    1
66    2
67    1
68    2
69    3
70    3
71    3
72    3
73    2
74    1
75    1
76    1
Name: shelf, dtype: int64

Hint: remember .grouby ?

In [30]:

df.groupby('shelf').mean()

Out[30]:

	calories (number)	protein (g)	fat (g)	sodium (mg)	dietary fiber (g)	complex carbohydrates (g)	sugars (g)	potassium (mg)	vitamins and minerals	weight (in ounces) of one serving (serving size)	cups per serving
shelf
1	102.500000	2.650000	0.60	176.250000	1.685000	15.800000	4.800000	75.500000	20.000000	0.991500	0.797000
2	109.523810	1.904762	1.00	145.714286	0.904762	13.619048	9.619048	57.809524	23.809524	1.015714	0.720952
3	107.777778	2.861111	1.25	158.611111	3.138889	14.500000	6.527778	129.833333	35.416667	0.947778	0.392778

1.6 Make a box plot showing the sugar content of cereals grouped by shelf (recreating the plot that you'll find on the homepage for this data set).¶

In [33]:

df.boxplot(by='shelf', column='sugars (g)');

1.7 Is there a relationship between sugar content and fat content?¶

Make a scatter plot to explore.

In [34]:

df.plot(kind='scatter', x='sugars (g)', y="fat (g)");

1.8 Create a pivot table to analyse contents by manufacturer and type of cereal (hot/cold)¶

Let's focus on protein and sodium contents.

In [43]:

df.pivot_table(values=["protein (g)", "sodium (mg)"], index=['mfr', 'type'], margins=True)

Out[43]:

		protein (g)	sodium (mg)
mfr	type
A	H	4.000000	0.000000
G	C	2.318182	200.454545
K	C	2.652174	174.782609
N	C	2.800000	29.000000
N	H	3.000000	80.000000
P	C	2.444444	146.111111
Q	C	2.285714	105.714286
Q	H	5.000000	0.000000
R	C	2.500000	198.125000
All		2.545455	159.675325

1.9 Which cereal has the heighest protein per cup ratio?¶

In [48]:

proteinPerCup = df['protein (g)'] / df['cups per serving']

In [51]:

proteinPerCup.max()

Out[51]:

12.121212121212121

In [50]:

df[proteinPerCup == proteinPerCup.max()]

Out[50]:

	name	mfr	type	calories (number)	protein (g)	fat (g)	sodium (mg)	dietary fiber (g)	complex carbohydrates (g)	sugars (g)	shelf	potassium (mg)	vitamins and minerals	weight (in ounces) of one serving (serving size)	cups per serving
0	100%_Bran	N	C	70	4	1	130	10.0	5.0	6	3	280	25	1.0	0.33
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	3	320	25	1.0	0.33

Problem 2: Chocolate makes you faster?¶

Your colleague (a psychologist) has conducted an experiment where he measured subjects' performance on a memory and a reaction time task. He claims to have found a correlation between the two, but you are sceptical, because you know your colleague is not very good at clicking buttons in SPSS. Luckily, you can convince your colleague to give you his data file so you can assess his claims.

1. He gives you the file experiment.csv. It contains one line per subject, and each line contains the values for the following variables that he measured: height, weight, amount-chocolate, task-one-score, task-two-score. (He allowed the participants to eat chocolate during the experiment, and the third column shows the amount of chocolate they ate in grams.) Load the data into an array.

In [52]:

col_names = ["height", "weight", "amount-chocolate", "task-one-score", "task-two-score"]
df = pd.read_csv("experiment.csv", header=None, sep=' ', names=col_names)

2. As a sanity check, create a scatter plot of height vs. weight. These two should clearly be correlated and you should see a straight line relationship.

In [13]:

df.plot(kind='scatter', x='height', y='weight')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x107b92150>

3. To assess your colleague's claim, first create a box plot of the scores in task 1 and task 2. Also, compute the correlation between the two (using Pearson's $r$) and check whether it is significant.

In [53]:

print("Pearson's r is {}".format(df['task-one-score'].corr(df['task-two-score'])))

df.boxplot(column=['task-one-score', 'task-two-score'],return_type='axes');

Pearson's r is 0.04231920574803165

4. When your colleague hears about your results, he is a bit disappointed. However, he already has a new hypothesis: maybe chocolate is acting as a promoter somehow and the amount of chocolate consumed has some effect? Create some plots and hypothesis tests to assess this new hypothesis.

In [15]:

df.plot(kind='scatter', x='task-one-score', y='amount-chocolate');
df.plot(kind='scatter', x='task-two-score', y='amount-chocolate');



print("Pearson's r for task one is {}".format(df['task-one-score'].corr( 
                                                       df['amount-chocolate'])))

print("Pearson's r for task two is {}".format(df['task-two-score'].corr( 
                                                       df['amount-chocolate'])))

Pearson's r for task one is 0.0116431816265
Pearson's r for task two is -0.00551700962886

5. After you come back with your results, your colleague has one last idea: Maybe you need to eat more than some critical amount of chocolate (which depends on your body mass index) in order for there to be a correlation between the two tasks? He guesses that you need to eat at least six times your BMI in chocolate in order for the correlation to show up. You think your colleague is crazy, but go ahead an test his claim. What do you find? Again, produce some figures, compute correlations and test for significance.

In [16]:

bmi = df['weight'] / (df['height'] ** 2) 
filtered_group = df.ix[df['amount-chocolate'] > bmi * 6]

filtered_group.plot(kind='scatter', x='task-one-score', y='amount-chocolate');
filtered_group.plot(kind='scatter', x='task-two-score',y='amount-chocolate');

print("Pearson's r for task one is {}".format(filtered_group['task-one-score'].corr( 
                                                       filtered_group['amount-chocolate'])))

print("Pearson's r for task two is {}".format(filtered_group['task-two-score'].corr(
                                                       filtered_group['amount-chocolate'])))

Pearson's r for task one is -0.167591510532
Pearson's r for task two is -0.207677642024

In [ ]: