Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
from __future__ import print_function, division
import nsfg #importing the nsfg dataset
Read NSFG data into a Pandas DataFrame.
preg = nsfg.ReadFemPreg() #reading the data into a pandas data frame
preg.head() #shows the first 5 rows of the data
caseid | pregordr | howpreg_n | howpreg_p | moscurrp | nowprgdk | pregend1 | pregend2 | nbrnaliv | multbrth | ... | laborfor_i | religion_i | metro_i | basewgt | adj_mod_basewgt | finalwgt | secu_p | sest | cmintvw | totalwgt_lb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | NaN | 8.8125 |
1 | 1 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 3410.389399 | 3869.349602 | 6448.271112 | 2 | 9 | NaN | 7.8750 |
2 | 2 | 1 | NaN | NaN | NaN | NaN | 5.0 | NaN | 3.0 | 5.0 | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 9.1250 |
3 | 2 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 7.0000 |
4 | 2 | 3 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 0 | 7226.301740 | 8567.549110 | 12999.542264 | 2 | 12 | NaN | 6.1875 |
5 rows × 244 columns
Print the column names.
preg.columns #index of column names
Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk', 'pregend1', 'pregend2', 'nbrnaliv', 'multbrth', ... 'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'], dtype='object', length=244)
Select a single column name.
preg.columns[1]
'pregordr'
Select a column and check what type it is.
pregordr = preg['pregordr']
type(pregordr)
pandas.core.series.Series
Print a column.
pregordr
0 1 1 2 2 1 3 2 4 3 5 1 6 2 7 3 8 1 9 2 10 1 11 1 12 2 13 3 14 1 15 2 16 3 17 1 18 2 19 1 20 2 21 1 22 2 23 1 24 2 25 3 26 1 27 1 28 2 29 3 .. 13563 2 13564 3 13565 1 13566 1 13567 1 13568 2 13569 1 13570 2 13571 3 13572 4 13573 1 13574 2 13575 1 13576 1 13577 2 13578 1 13579 2 13580 1 13581 2 13582 3 13583 1 13584 2 13585 1 13586 2 13587 3 13588 1 13589 2 13590 3 13591 4 13592 5 Name: pregordr, Length: 13593, dtype: int64
Select a single element from a column.
pregordr[0]
1
Select a slice from a column.
pregordr[2:5]
2 1 3 2 4 3 Name: pregordr, dtype: int64
Select a column using dot notation.
pregordr = preg.pregordr
preg.outcome
0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 2 14 4 15 1 16 1 17 1 18 4 19 1 20 1 21 1 22 4 23 1 24 1 25 1 26 1 27 1 28 1 29 1 .. 13563 1 13564 1 13565 1 13566 1 13567 2 13568 5 13569 1 13570 1 13571 1 13572 1 13573 1 13574 1 13575 2 13576 1 13577 6 13578 1 13579 1 13580 4 13581 1 13582 5 13583 2 13584 1 13585 2 13586 2 13587 2 13588 1 13589 2 13590 2 13591 1 13592 1 Name: outcome, Length: 13593, dtype: int64
Count the number of times each value occurs.
preg.outcome.value_counts()
#type(preg.outcome.value_counts())
1 9148 4 1921 2 1862 6 352 5 190 3 120 Name: outcome, dtype: int64
preg.outcome.value_counts().sort_index()
1 9148 2 1862 3 120 4 1921 5 190 6 352 Name: outcome, dtype: int64
Check the values of another variable.
preg.birthwgt_lb.value_counts().sort_index()
0.0 8 1.0 40 2.0 53 3.0 98 4.0 229 5.0 697 6.0 2223 7.0 3049 8.0 1889 9.0 623 10.0 132 11.0 26 12.0 10 13.0 3 14.0 3 15.0 1 Name: birthwgt_lb, dtype: int64
Make a dictionary that maps from each respondent's caseid
to a list of indices into the pregnancy DataFrame
. Use it to select the pregnancy outcomes for a single respondent.
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values
array([4, 4, 4, 4, 4, 4, 1])
Select the birthord
column, print the value counts, and compare to results published in the codebook
pregBirthOrder = preg["birthord"]# Solution goes here
print (pregBirthOrder.value_counts())
1.0 4413 2.0 2874 3.0 1234 4.0 421 5.0 126 6.0 50 7.0 20 8.0 7 9.0 2 10.0 1 Name: birthord, dtype: int64
We can also use isnull
to count the number of nans.
preg.birthord.isnull().sum()
4445
Select the prglngth
column, print the value counts, and compare to results published in the codebook
preg.prglngth.value_counts().sort_index() #how to slice the data to grab a selected few of the indices?
0 15 1 9 2 78 3 151 4 412 5 181 6 543 7 175 8 409 9 594 10 137 11 202 12 170 13 446 14 29 15 39 16 44 17 253 18 17 19 34 20 18 21 37 22 147 23 12 24 31 25 15 26 117 27 8 28 38 29 23 30 198 31 29 32 122 33 50 34 60 35 357 36 329 37 457 38 609 39 4744 40 1120 41 591 42 328 43 148 44 46 45 10 46 1 47 1 48 7 50 2 Name: prglngth, dtype: int64
To compute the mean of a column, you can invoke the mean
method on a Series. For example, here is the mean birthweight in pounds:
preg.totalwgt_lb.mean()
7.265628457623368
Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.
preg["totalwgt_kg"] = preg.totalwgt_lb*2.2# Solution goes here
preg.totalwgt_kg.mean()
15.984382606771542
nsfg.py
also provides ReadFemResp
, which reads the female respondents file and returns a DataFrame
:
resp = nsfg.ReadFemResp()
DataFrame
provides a method head
that displays the first five rows:
resp.head()
caseid | rscrinf | rdormres | rostscrn | rscreenhisp | rscreenrace | age_a | age_r | cmbirth | agescrn | ... | pubassis_i | basewgt | adj_mod_basewgt | finalwgt | secu_r | sest | cmintvw | cmlstyr | screentime | intvlngth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2298 | 1 | 5 | 5 | 1 | 5.0 | 27 | 27 | 902 | 27 | ... | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | 1234 | 1222 | 18:26:36 | 110.492667 |
1 | 5012 | 1 | 5 | 1 | 5 | 5.0 | 42 | 42 | 718 | 42 | ... | 0 | 2335.279149 | 2846.799490 | 4744.191350 | 2 | 18 | 1233 | 1221 | 16:30:59 | 64.294000 |
2 | 11586 | 1 | 5 | 1 | 5 | 5.0 | 43 | 43 | 708 | 43 | ... | 0 | 2335.279149 | 2846.799490 | 4744.191350 | 2 | 18 | 1234 | 1222 | 18:19:09 | 75.149167 |
3 | 6794 | 5 | 5 | 4 | 1 | 5.0 | 15 | 15 | 1042 | 15 | ... | 0 | 3783.152221 | 5071.464231 | 5923.977368 | 2 | 18 | 1234 | 1222 | 15:54:43 | 28.642833 |
4 | 616 | 1 | 5 | 4 | 1 | 5.0 | 20 | 20 | 991 | 20 | ... | 0 | 5341.329968 | 6437.335772 | 7229.128072 | 2 | 18 | 1233 | 1221 | 14:19:44 | 69.502667 |
5 rows × 3087 columns
Select the age_r
column from resp
and print the value counts. How old are the youngest and oldest respondents?
resp.age_r.value_counts().sort_index()# Solution goes here
#oldest are 44 and youngest are 15
15 217 16 223 17 234 18 235 19 241 20 258 21 267 22 287 23 282 24 269 25 267 26 260 27 255 28 252 29 262 30 292 31 278 32 273 33 257 34 255 35 262 36 266 37 271 38 256 39 215 40 256 41 250 42 215 43 253 44 235 Name: age_r, dtype: int64
We can use the caseid
to match up rows from resp
and preg
. For example, we can select the row from resp
for caseid
2298 like this:
resp.caseid==2298
0 True 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 7613 False 7614 False 7615 False 7616 False 7617 False 7618 False 7619 False 7620 False 7621 False 7622 False 7623 False 7624 False 7625 False 7626 False 7627 False 7628 False 7629 False 7630 False 7631 False 7632 False 7633 False 7634 False 7635 False 7636 False 7637 False 7638 False 7639 False 7640 False 7641 False 7642 False Name: caseid, Length: 7643, dtype: bool
resp[resp.caseid==2298] #isn't this evaluating a boolean condition? how is t/f being used as a key?
caseid | rscrinf | rdormres | rostscrn | rscreenhisp | rscreenrace | age_a | age_r | cmbirth | agescrn | ... | pubassis_i | basewgt | adj_mod_basewgt | finalwgt | secu_r | sest | cmintvw | cmlstyr | screentime | intvlngth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2298 | 1 | 5 | 5 | 1 | 5.0 | 27 | 27 | 902 | 27 | ... | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | 1234 | 1222 | 18:26:36 | 110.492667 |
1 rows × 3087 columns
And we can get the corresponding rows from preg
like this:
preg[preg.caseid==2298]
caseid | pregordr | howpreg_n | howpreg_p | moscurrp | nowprgdk | pregend1 | pregend2 | nbrnaliv | multbrth | ... | religion_i | metro_i | basewgt | adj_mod_basewgt | finalwgt | secu_p | sest | cmintvw | totalwgt_lb | totalwgt_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2610 | 2298 | 1 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 6.8750 | 15.1250 |
2611 | 2298 | 2 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 5.5000 | 12.1000 |
2612 | 2298 | 3 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 4.1875 | 9.2125 |
2613 | 2298 | 4 | NaN | NaN | NaN | NaN | 6.0 | NaN | 1.0 | NaN | ... | 0 | 0 | 3247.916977 | 5123.759559 | 5556.717241 | 2 | 18 | NaN | 6.8750 | 15.1250 |
4 rows × 245 columns
How old is the respondent with caseid
1?
resp[resp.caseid==1].age_r# Solution goes here
1069 44 Name: age_r, dtype: int64
What are the pregnancy lengths for the respondent with caseid
2298?
preg[preg.caseid==2298].prglngth# Solution goes here
2610 40 2611 36 2612 30 2613 40 Name: prglngth, dtype: int64
What was the birthweight of the first baby born to the respondent with caseid
5012?
preg[preg.caseid==5012].totalwgt_lb
5515 6.0 Name: totalwgt_lb, dtype: float64