Examples and Exercises from Think Stats, 2nd Edition¶

MIT License: https://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

import nsfg #importing the nsfg dataset

Examples from Chapter 1¶

Read NSFG data into a Pandas DataFrame.

In [2]:

preg = nsfg.ReadFemPreg() #reading the data into a pandas data frame
preg.head() #shows the first 5 rows of the data

Out[2]:

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
0	1	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	8.8125
1	1	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	7.8750
2	2	1	NaN	NaN	NaN	NaN	5.0	NaN	3.0	5.0	...	7226.301740	8567.549110	12999.542264	2	12	NaN	9.1250
3	2	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	7.0000
4	2	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	6.1875

5 rows × 244 columns

Print the column names.

In [3]:

preg.columns #index of column names

Out[3]:

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.

In [4]:

preg.columns[1]

Out[4]:

'pregordr'

Select a column and check what type it is.

In [5]:

pregordr = preg['pregordr']
type(pregordr)

Out[5]:

pandas.core.series.Series

Print a column.

In [6]:

pregordr

Out[6]:

0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [7]:

pregordr[0]

Out[7]:

Select a slice from a column.

In [8]:

pregordr[2:5]

Out[8]:

2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [9]:

pregordr = preg.pregordr

In [10]:

preg.outcome

Out[10]:

0        1
1        1
2        1
3        1
4        1
5        1
6        1
7        1
8        1
9        1
10       1
11       1
12       1
13       2
14       4
15       1
16       1
17       1
18       4
19       1
20       1
21       1
22       4
23       1
24       1
25       1
26       1
27       1
28       1
29       1
        ..
13563    1
13564    1
13565    1
13566    1
13567    2
13568    5
13569    1
13570    1
13571    1
13572    1
13573    1
13574    1
13575    2
13576    1
13577    6
13578    1
13579    1
13580    4
13581    1
13582    5
13583    2
13584    1
13585    2
13586    2
13587    2
13588    1
13589    2
13590    2
13591    1
13592    1
Name: outcome, Length: 13593, dtype: int64

Count the number of times each value occurs.

In [11]:

preg.outcome.value_counts()
#type(preg.outcome.value_counts())

Out[11]:

1    9148
4    1921
2    1862
6     352
5     190
3     120
Name: outcome, dtype: int64

In [12]:

preg.outcome.value_counts().sort_index()

Out[12]:

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.

In [13]:

preg.birthwgt_lb.value_counts().sort_index()

Out[13]:

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.

In [14]:

caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

Out[14]:

array([4, 4, 4, 4, 4, 4, 1])

Exercises¶

Select the birthord column, print the value counts, and compare to results published in the codebook

In [15]:

pregBirthOrder = preg["birthord"]# Solution goes here
print (pregBirthOrder.value_counts())

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
Name: birthord, dtype: int64

We can also use isnull to count the number of nans.

In [16]:

preg.birthord.isnull().sum()

Out[16]:

Select the prglngth column, print the value counts, and compare to results published in the codebook

In [17]:

preg.prglngth.value_counts().sort_index() #how to slice the data to grab a selected few of the indices?

Out[17]:

0       15
1        9
2       78
3      151
4      412
5      181
6      543
7      175
8      409
9      594
10     137
11     202
12     170
13     446
14      29
15      39
16      44
17     253
18      17
19      34
20      18
21      37
22     147
23      12
24      31
25      15
26     117
27       8
28      38
29      23
30     198
31      29
32     122
33      50
34      60
35     357
36     329
37     457
38     609
39    4744
40    1120
41     591
42     328
43     148
44      46
45      10
46       1
47       1
48       7
50       2
Name: prglngth, dtype: int64

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:

In [18]:

preg.totalwgt_lb.mean()

Out[18]:

7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [19]:

preg["totalwgt_kg"] = preg.totalwgt_lb*2.2# Solution goes here
preg.totalwgt_kg.mean()

Out[19]:

15.984382606771542

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:

In [20]:

resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:

In [21]:

resp.head()

Out[21]:

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667
1	5012	1	5	1	5	5.0	42	42	718	42	...	2335.279149	2846.799490	4744.191350	2	18	1233	1221	16:30:59	64.294000
2	11586	1	5	1	5	5.0	43	43	708	43	...	2335.279149	2846.799490	4744.191350	2	18	1234	1222	18:19:09	75.149167
3	6794	5	5	4	1	5.0	15	15	1042	15	...	3783.152221	5071.464231	5923.977368	2	18	1234	1222	15:54:43	28.642833
4	616	1	5	4	1	5.0	20	20	991	20	...	5341.329968	6437.335772	7229.128072	2	18	1233	1221	14:19:44	69.502667

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?

In [22]:

resp.age_r.value_counts().sort_index()# Solution goes here
#oldest are 44 and youngest are 15

Out[22]:

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
Name: age_r, dtype: int64

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:

In [25]:

resp.caseid==2298

Out[25]:

0        True
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
7613    False
7614    False
7615    False
7616    False
7617    False
7618    False
7619    False
7620    False
7621    False
7622    False
7623    False
7624    False
7625    False
7626    False
7627    False
7628    False
7629    False
7630    False
7631    False
7632    False
7633    False
7634    False
7635    False
7636    False
7637    False
7638    False
7639    False
7640    False
7641    False
7642    False
Name: caseid, Length: 7643, dtype: bool

In [23]:

resp[resp.caseid==2298] #isn't this evaluating a boolean condition? how is t/f being used as a key?

Out[23]:

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	pubassis_i	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	0	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:

In [24]:

preg[preg.caseid==2298]

Out[24]:

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb	totalwgt_kg
2610	2298	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750	15.1250
2611	2298	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	5.5000	12.1000
2612	2298	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	4.1875	9.2125
2613	2298	4	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750	15.1250

4 rows × 245 columns

How old is the respondent with caseid 1?

In [47]:

resp[resp.caseid==1].age_r# Solution goes here

Out[47]:

1069    44
Name: age_r, dtype: int64

What are the pregnancy lengths for the respondent with caseid 2298?

In [29]:

preg[preg.caseid==2298].prglngth# Solution goes here

Out[29]:

2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with caseid 5012?

In [27]:

preg[preg.caseid==5012].totalwgt_lb

Out[27]:

5515    6.0
Name: totalwgt_lb, dtype: float64

In [ ]: