Examples and Exercises from Think Stats, 2nd Edition¶

MIT License: https://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

import nsfg

Examples from Chapter 1¶

Read NSFG data into a Pandas DataFrame.

In [2]:

preg = nsfg.ReadFemPreg()
preg.head()

Out[2]:

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
0	1	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	8.8125
1	1	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	7.8750
2	2	1	NaN	NaN	NaN	NaN	5.0	NaN	3.0	5.0	...	7226.301740	8567.549110	12999.542264	2	12	NaN	9.1250
3	2	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	7.0000
4	2	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	6.1875

5 rows × 244 columns

Print the column names.

In [3]:

preg.columns

Out[3]:

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.

In [4]:

preg.columns[1]

Out[4]:

'pregordr'

Select a column and check what type it is.

In [5]:

pregordr = preg['pregordr']
type(pregordr)

Out[5]:

pandas.core.series.Series

Print a column.

In [6]:

pregordr

Out[6]:

0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [7]:

pregordr[0]

Out[7]:

Select a slice from a column.

In [8]:

pregordr[2:5]

Out[8]:

2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [9]:

pregordr = preg.pregordr

Count the number of times each value occurs.

In [10]:

preg.outcome.value_counts().sort_index()

Out[10]:

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.

In [11]:

preg.birthwgt_lb.value_counts().sort_index()

Out[11]:

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.

In [12]:

caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

Out[12]:

array([4, 4, 4, 4, 4, 4, 1])

Exercises¶

Select the birthord column, print the value counts, and compare to results published in the codebook

In [13]:

# Solution goes here

We can also use isnull to count the number of nans.

In [14]:

preg.birthord.isnull().sum()

Out[14]:

Select the prglngth column, print the value counts, and compare to results published in the codebook

In [15]:

# Solution goes here

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:

In [16]:

preg.totalwgt_lb.mean()

Out[16]:

7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [17]:

# Solution goes here

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:

In [18]:

resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:

In [19]:

resp.head()

Out[19]:

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667
1	5012	1	5	1	5	5.0	42	42	718	42	...	2335.279149	2846.799490	4744.191350	2	18	1233	1221	16:30:59	64.294000
2	11586	1	5	1	5	5.0	43	43	708	43	...	2335.279149	2846.799490	4744.191350	2	18	1234	1222	18:19:09	75.149167
3	6794	5	5	4	1	5.0	15	15	1042	15	...	3783.152221	5071.464231	5923.977368	2	18	1234	1222	15:54:43	28.642833
4	616	1	5	4	1	5.0	20	20	991	20	...	5341.329968	6437.335772	7229.128072	2	18	1233	1221	14:19:44	69.502667

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?

In [20]:

# Solution goes here

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:

In [21]:

resp[resp.caseid==2298]

Out[21]:

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	pubassis_i	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	0	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:

In [22]:

preg[preg.caseid==2298]

Out[22]:

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
2610	2298	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750
2611	2298	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	5.5000
2612	2298	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	4.1875
2613	2298	4	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750

4 rows × 244 columns

How old is the respondent with caseid 1?

In [23]:

# Solution goes here

What are the pregnancy lengths for the respondent with caseid 2298?

In [24]:

# Solution goes here

What was the birthweight of the first baby born to the respondent with caseid 5012?

In [25]:

# Solution goes here

In [ ]: