In this exercise we will do some simple analyses on a data set that contains information about 77 types of breakfast cereal. You can read about the data set at CMU's website. Download the files cereal.csv from the course website.
Today you will be using pandas to read the data and do the analysis!
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
col_labels = ["name", "mfr", "type", "calories (number)", "protein (g)", "fat (g)", "sodium (mg)",
"dietary fiber (g)", "complex carbohydrates (g)", "sugars (g)",
"shelf", "potassium (mg)",
"vitamins and minerals",
"weight (in ounces) of one serving (serving size)", "cups per serving"]
df = pd.read_csv('cereal.txt', sep=' ', header=None, names=col_labels)
df
name | mfr | type | calories (number) | protein (g) | fat (g) | sodium (mg) | dietary fiber (g) | complex carbohydrates (g) | sugars (g) | shelf | potassium (mg) | vitamins and minerals | weight (in ounces) of one serving (serving size) | cups per serving | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100%_Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 3 | 280 | 25 | 1.00 | 0.33 |
1 | 100%_Natural_Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 3 | 135 | 0 | 1.00 | -1.00 |
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 3 | 320 | 25 | 1.00 | 0.33 |
3 | All-Bran_with_Extra_Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 3 | 330 | 25 | 1.00 | 0.50 |
4 | Almond_Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | 3 | -1 | 25 | 1.00 | 0.75 |
5 | Apple_Cinnamon_Cheerios | G | C | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 1 | 70 | 25 | 1.00 | 0.75 |
6 | Apple_Jacks | K | C | 110 | 2 | 0 | 125 | 1.0 | 11.0 | 14 | 2 | 30 | 25 | 1.00 | 1.00 |
7 | Basic_4 | G | C | 130 | 3 | 2 | 210 | 2.0 | 18.0 | 8 | 3 | 100 | 25 | 1.33 | 0.75 |
8 | Bran_Chex | R | C | 90 | 2 | 1 | 200 | 4.0 | 15.0 | 6 | 1 | 125 | 25 | 1.00 | 0.67 |
9 | Bran_Flakes | P | C | 90 | 3 | 0 | 210 | 5.0 | 13.0 | 5 | 3 | 190 | 25 | 1.00 | 0.67 |
10 | Cap'n'Crunch | Q | C | 120 | 1 | 2 | 220 | 0.0 | 12.0 | 12 | 2 | 35 | 25 | 1.00 | 0.75 |
11 | Cheerios | G | C | 110 | 6 | 2 | 290 | 2.0 | 17.0 | 1 | 1 | 105 | 25 | 1.00 | 1.25 |
12 | Cinnamon_Toast_Crunch | G | C | 120 | 1 | 3 | 210 | 0.0 | 13.0 | 9 | 2 | 45 | 25 | 1.00 | 0.75 |
13 | Clusters | G | C | 110 | 3 | 2 | 140 | 2.0 | 13.0 | 7 | 3 | 105 | 25 | 1.00 | 0.50 |
14 | Cocoa_Puffs | G | C | 110 | 1 | 1 | 180 | 0.0 | 12.0 | 13 | 2 | 55 | 25 | 1.00 | 1.00 |
15 | Corn_Chex | R | C | 110 | 2 | 0 | 280 | 0.0 | 22.0 | 3 | 1 | 25 | 25 | 1.00 | 1.00 |
16 | Corn_Flakes | K | C | 100 | 2 | 0 | 290 | 1.0 | 21.0 | 2 | 1 | 35 | 25 | 1.00 | 1.00 |
17 | Corn_Pops | K | C | 110 | 1 | 0 | 90 | 1.0 | 13.0 | 12 | 2 | 20 | 25 | 1.00 | 1.00 |
18 | Count_Chocula | G | C | 110 | 1 | 1 | 180 | 0.0 | 12.0 | 13 | 2 | 65 | 25 | 1.00 | 1.00 |
19 | Cracklin'_Oat_Bran | K | C | 110 | 3 | 3 | 140 | 4.0 | 10.0 | 7 | 3 | 160 | 25 | 1.00 | 0.50 |
20 | Cream_of_Wheat_(Quick) | N | H | 100 | 3 | 0 | 80 | 1.0 | 21.0 | 0 | 2 | -1 | 0 | 1.00 | 1.00 |
21 | Crispix | K | C | 110 | 2 | 0 | 220 | 1.0 | 21.0 | 3 | 3 | 30 | 25 | 1.00 | 1.00 |
22 | Crispy_Wheat_&_Raisins | G | C | 100 | 2 | 1 | 140 | 2.0 | 11.0 | 10 | 3 | 120 | 25 | 1.00 | 0.75 |
23 | Double_Chex | R | C | 100 | 2 | 0 | 190 | 1.0 | 18.0 | 5 | 3 | 80 | 25 | 1.00 | 0.75 |
24 | Froot_Loops | K | C | 110 | 2 | 1 | 125 | 1.0 | 11.0 | 13 | 2 | 30 | 25 | 1.00 | 1.00 |
25 | Frosted_Flakes | K | C | 110 | 1 | 0 | 200 | 1.0 | 14.0 | 11 | 1 | 25 | 25 | 1.00 | 0.75 |
26 | Frosted_Mini-Wheats | K | C | 100 | 3 | 0 | 0 | 3.0 | 14.0 | 7 | 2 | 100 | 25 | 1.00 | 0.80 |
27 | Fruit_&_Fibre_Dates,_Walnuts,_and_Oats | P | C | 120 | 3 | 2 | 160 | 5.0 | 12.0 | 10 | 3 | 200 | 25 | 1.25 | 0.67 |
28 | Fruitful_Bran | K | C | 120 | 3 | 0 | 240 | 5.0 | 14.0 | 12 | 3 | 190 | 25 | 1.33 | 0.67 |
29 | Fruity_Pebbles | P | C | 110 | 1 | 1 | 135 | 0.0 | 13.0 | 12 | 2 | 25 | 25 | 1.00 | 0.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47 | Multi-Grain_Cheerios | G | C | 100 | 2 | 1 | 220 | 2.0 | 15.0 | 6 | 1 | 90 | 25 | 1.00 | 1.00 |
48 | Nut&Honey_Crunch | K | C | 120 | 2 | 1 | 190 | 0.0 | 15.0 | 9 | 2 | 40 | 25 | 1.00 | 0.67 |
49 | Nutri-Grain_Almond-Raisin | K | C | 140 | 3 | 2 | 220 | 3.0 | 21.0 | 7 | 3 | 130 | 25 | 1.33 | 0.67 |
50 | Nutri-grain_Wheat | K | C | 90 | 3 | 0 | 170 | 3.0 | 18.0 | 2 | 3 | 90 | 25 | 1.00 | -1.00 |
51 | Oatmeal_Raisin_Crisp | G | C | 130 | 3 | 2 | 170 | 1.5 | 13.5 | 10 | 3 | 120 | 25 | 1.25 | 0.50 |
52 | Post_Nat._Raisin_Bran | P | C | 120 | 3 | 1 | 200 | 6.0 | 11.0 | 14 | 3 | 260 | 25 | 1.33 | 0.67 |
53 | Product_19 | K | C | 100 | 3 | 0 | 320 | 1.0 | 20.0 | 3 | 3 | 45 | 100 | 1.00 | 1.00 |
54 | Puffed_Rice | Q | C | 50 | 1 | 0 | 0 | 0.0 | 13.0 | 0 | 3 | 15 | 0 | 0.50 | 1.00 |
55 | Puffed_Wheat | Q | C | 50 | 2 | 0 | 0 | 1.0 | 10.0 | 0 | 3 | 50 | 0 | 0.50 | -1.00 |
56 | Quaker_Oat_Squares | Q | C | 100 | 4 | 1 | 135 | 2.0 | 14.0 | 6 | 3 | 110 | 25 | 1.00 | 0.50 |
57 | Quaker_Oatmeal | Q | H | 100 | 5 | 2 | 0 | 2.7 | -1.0 | -1 | 1 | 110 | 0 | 1.00 | 0.67 |
58 | Raisin_Bran | K | C | 120 | 3 | 1 | 210 | 5.0 | 14.0 | 12 | 2 | 240 | 25 | 1.33 | 0.75 |
59 | Raisin_Nut_Bran | G | C | 100 | 3 | 2 | 140 | 2.5 | 10.5 | 8 | 3 | 140 | 25 | 1.00 | 0.50 |
60 | Raisin_Squares | K | C | 90 | 2 | 0 | 0 | 2.0 | 15.0 | 6 | 3 | 110 | 25 | 1.00 | 0.50 |
61 | Rice_Chex | R | C | 110 | 1 | 0 | 240 | 0.0 | 23.0 | 2 | 1 | 30 | 25 | 1.00 | 1.13 |
62 | Rice_Krispies | K | C | 110 | 2 | 0 | 290 | 0.0 | 22.0 | 3 | 1 | 35 | 25 | 1.00 | 1.00 |
63 | Shredded_Wheat | N | C | 80 | 2 | 0 | 0 | 3.0 | 16.0 | 0 | 1 | 95 | 0 | 0.83 | -1.00 |
64 | Shredded_Wheat_'n'Bran | N | C | 90 | 3 | 0 | 0 | 4.0 | 19.0 | 0 | 1 | 140 | 0 | 1.00 | 0.67 |
65 | Shredded_Wheat_spoon_size | N | C | 90 | 3 | 0 | 0 | 3.0 | 20.0 | 0 | 1 | 120 | 0 | 1.00 | 0.67 |
66 | Smacks | K | C | 110 | 2 | 1 | 70 | 1.0 | 9.0 | 15 | 2 | 40 | 25 | 1.00 | 0.75 |
67 | Special_K | K | C | 110 | 6 | 0 | 230 | 1.0 | 16.0 | 3 | 1 | 55 | 25 | 1.00 | 1.00 |
68 | Strawberry_Fruit_Wheats | N | C | 90 | 2 | 0 | 15 | 3.0 | 15.0 | 5 | 2 | 90 | 25 | 1.00 | -1.00 |
69 | Total_Corn_Flakes | G | C | 110 | 2 | 1 | 200 | 0.0 | 21.0 | 3 | 3 | 35 | 100 | 1.00 | 1.00 |
70 | Total_Raisin_Bran | G | C | 140 | 3 | 1 | 190 | 4.0 | 15.0 | 14 | 3 | 230 | 100 | 1.50 | 1.00 |
71 | Total_Whole_Grain | G | C | 100 | 3 | 1 | 200 | 3.0 | 16.0 | 3 | 3 | 110 | 100 | 1.00 | 1.00 |
72 | Triples | G | C | 110 | 2 | 1 | 250 | 0.0 | 21.0 | 3 | 3 | 60 | 25 | 1.00 | 0.75 |
73 | Trix | G | C | 110 | 1 | 1 | 140 | 0.0 | 13.0 | 12 | 2 | 25 | 25 | 1.00 | 1.00 |
74 | Wheat_Chex | R | C | 100 | 3 | 1 | 230 | 3.0 | 17.0 | 3 | 1 | 115 | 25 | 1.00 | 0.67 |
75 | Wheaties | G | C | 100 | 3 | 1 | 200 | 3.0 | 17.0 | 3 | 1 | 110 | 25 | 1.00 | 1.00 |
76 | Wheaties_Honey_Gold | G | C | 110 | 2 | 1 | 200 | 1.0 | 16.0 | 8 | 1 | 60 | 25 | 1.00 | 0.75 |
77 rows × 15 columns
df.plot(x='fat (g)', y='calories (number)', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0xf094506518>
#Start your answers here - add more cells as necessary!
avg_calories = df['calories (number)'].mean()
avg_cplx_carbs = df['complex carbohydrates (g)'].mean()
avg_sugars = df['sugars (g)'].mean()
avg_calories, avg_cplx_carbs, avg_sugars
(106.88311688311688, 14.597402597402597, 6.922077922077922)
df['sugars (g)'].max()
15
df[df['sugars (g)'] == df['sugars (g)'].max()].name
30 Golden_Crisp 66 Smacks Name: name, dtype: object
The 8th column of the data set contains the shelf at which the cereal is displayed in the supermarket (either 1, 2, or 3). Compute the mean number of calories of the cereals displayed in each shelf. If you do this correctly you should find the the cereals displayed on the middle shelf have a higher calorie content on average.
df.shelf
0 3 1 3 2 3 3 3 4 3 5 1 6 2 7 3 8 1 9 3 10 2 11 1 12 2 13 3 14 2 15 1 16 1 17 2 18 2 19 3 20 2 21 3 22 3 23 3 24 2 25 1 26 2 27 3 28 3 29 2 .. 47 1 48 2 49 3 50 3 51 3 52 3 53 3 54 3 55 3 56 3 57 1 58 2 59 3 60 3 61 1 62 1 63 1 64 1 65 1 66 2 67 1 68 2 69 3 70 3 71 3 72 3 73 2 74 1 75 1 76 1 Name: shelf, dtype: int64
Hint: remember .grouby
?
df.groupby('shelf').mean()
calories (number) | protein (g) | fat (g) | sodium (mg) | dietary fiber (g) | complex carbohydrates (g) | sugars (g) | potassium (mg) | vitamins and minerals | weight (in ounces) of one serving (serving size) | cups per serving | |
---|---|---|---|---|---|---|---|---|---|---|---|
shelf | |||||||||||
1 | 102.500000 | 2.650000 | 0.60 | 176.250000 | 1.685000 | 15.800000 | 4.800000 | 75.500000 | 20.000000 | 0.991500 | 0.797000 |
2 | 109.523810 | 1.904762 | 1.00 | 145.714286 | 0.904762 | 13.619048 | 9.619048 | 57.809524 | 23.809524 | 1.015714 | 0.720952 |
3 | 107.777778 | 2.861111 | 1.25 | 158.611111 | 3.138889 | 14.500000 | 6.527778 | 129.833333 | 35.416667 | 0.947778 | 0.392778 |
df.boxplot(by='shelf', column='sugars (g)');
Make a scatter plot to explore.
df.plot(kind='scatter', x='sugars (g)', y="fat (g)");
Let's focus on protein and sodium contents.
df.pivot_table(values=["protein (g)", "sodium (mg)"], index=['mfr', 'type'], margins=True)
protein (g) | sodium (mg) | ||
---|---|---|---|
mfr | type | ||
A | H | 4.000000 | 0.000000 |
G | C | 2.318182 | 200.454545 |
K | C | 2.652174 | 174.782609 |
N | C | 2.800000 | 29.000000 |
H | 3.000000 | 80.000000 | |
P | C | 2.444444 | 146.111111 |
Q | C | 2.285714 | 105.714286 |
H | 5.000000 | 0.000000 | |
R | C | 2.500000 | 198.125000 |
All | 2.545455 | 159.675325 |
proteinPerCup = df['protein (g)'] / df['cups per serving']
proteinPerCup.max()
12.121212121212121
df[proteinPerCup == proteinPerCup.max()]
name | mfr | type | calories (number) | protein (g) | fat (g) | sodium (mg) | dietary fiber (g) | complex carbohydrates (g) | sugars (g) | shelf | potassium (mg) | vitamins and minerals | weight (in ounces) of one serving (serving size) | cups per serving | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100%_Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 3 | 280 | 25 | 1.0 | 0.33 |
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 3 | 320 | 25 | 1.0 | 0.33 |
Your colleague (a psychologist) has conducted an experiment where he measured subjects' performance on a memory and a reaction time task. He claims to have found a correlation between the two, but you are sceptical, because you know your colleague is not very good at clicking buttons in SPSS. Luckily, you can convince your colleague to give you his data file so you can assess his claims.
1. He gives you the file experiment.csv. It contains one line per subject, and each line contains the values for the following variables that he measured:
height
,weight
,amount-chocolate
,task-one-score
,task-two-score
. (He allowed the participants to eat chocolate during the experiment, and the third column shows the amount of chocolate they ate in grams.) Load the data into an array.
col_names = ["height", "weight", "amount-chocolate", "task-one-score", "task-two-score"]
df = pd.read_csv("experiment.csv", header=None, sep=' ', names=col_names)
2. As a sanity check, create a scatter plot of height vs. weight. These two should clearly be correlated and you should see a straight line relationship.
df.plot(kind='scatter', x='height', y='weight')
<matplotlib.axes._subplots.AxesSubplot at 0x107b92150>
3. To assess your colleague's claim, first create a box plot of the scores in task 1 and task 2. Also, compute the correlation between the two (using Pearson's $r$) and check whether it is significant.
print("Pearson's r is {}".format(df['task-one-score'].corr(df['task-two-score'])))
df.boxplot(column=['task-one-score', 'task-two-score'],return_type='axes');
Pearson's r is 0.04231920574803165
4. When your colleague hears about your results, he is a bit disappointed. However, he already has a new hypothesis: maybe chocolate is acting as a promoter somehow and the amount of chocolate consumed has some effect? Create some plots and hypothesis tests to assess this new hypothesis.
df.plot(kind='scatter', x='task-one-score', y='amount-chocolate');
df.plot(kind='scatter', x='task-two-score', y='amount-chocolate');
print("Pearson's r for task one is {}".format(df['task-one-score'].corr(
df['amount-chocolate'])))
print("Pearson's r for task two is {}".format(df['task-two-score'].corr(
df['amount-chocolate'])))
Pearson's r for task one is 0.0116431816265 Pearson's r for task two is -0.00551700962886
5. After you come back with your results, your colleague has one last idea: Maybe you need to eat more than some critical amount of chocolate (which depends on your body mass index) in order for there to be a correlation between the two tasks? He guesses that you need to eat at least six times your BMI in chocolate in order for the correlation to show up. You think your colleague is crazy, but go ahead an test his claim. What do you find? Again, produce some figures, compute correlations and test for significance.
bmi = df['weight'] / (df['height'] ** 2)
filtered_group = df.ix[df['amount-chocolate'] > bmi * 6]
filtered_group.plot(kind='scatter', x='task-one-score', y='amount-chocolate');
filtered_group.plot(kind='scatter', x='task-two-score',y='amount-chocolate');
print("Pearson's r for task one is {}".format(filtered_group['task-one-score'].corr(
filtered_group['amount-chocolate'])))
print("Pearson's r for task two is {}".format(filtered_group['task-two-score'].corr(
filtered_group['amount-chocolate'])))
Pearson's r for task one is -0.167591510532 Pearson's r for task two is -0.207677642024