Does Trivers-Willard apply to people?¶

This notebook contains a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day.

MIT License: https://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

%matplotlib inline

Trivers-Willard¶

According to Wikipedia, the Trivers-Willard hypothesis:

"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."

For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.

To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.

I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.

Summary of results

Running regression with one variable at a time, many of the variables have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.
However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, or both, most other variables have no additional predictive power.
Contrary to other reports, the age of the parents seems to have no predictive power.
Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and general socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).

Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.

Data cleaning¶

Here's how I loaded the data:

In [2]:

names = ['year', 'mager9', 'mnativ', 'restatus', 'mbrace', 'mhisp_r',
        'mar_p', 'dmar', 'meduc', 'fagerrec11', 'fbrace', 'fhisp_r', 'feduc', 
        'lbo_rec', 'previs_rec', 'wic', 'height', 'bmi_r', 'pay_rec', 'sex']
colspecs = [(9, 12),
            (79, 79),
            (84, 84),
            (104, 104),
            (110, 110),
            (115, 115),
            (119, 119),
            (120, 120),
            (124, 124),
            (149, 150),
            (156, 156),
            (160, 160),
            (163, 163),
            (179, 179),
            (242, 243),
            (251, 251),
            (280, 281),
            (287, 287),
            (436, 436),
            (475, 475),
           ]

colspecs = [(start-1, end) for start, end in colspecs]

In [3]:

df = None

In [4]:

filename = 'Nat2014PublicUS.c20150514.r20151022.txt.gz'
#df = pd.read_fwf(filename, compression='gzip', header=None, names=names, colspecs=colspecs)
#df.head()

In [5]:

# store the dataframe for faster loading

#store = pd.HDFStore('store.h5')
#store['births2014'] = df
#store.close()

In [6]:

# load the dataframe

store = pd.HDFStore('store.h5')
df = store['births2014']
store.close()

In [7]:

def series_to_ratio(series):
    """Takes a boolean series and computes sex ratio.
    """
    boys = np.mean(series)
    return np.round(100 * boys / (1-boys)).astype(int)

I have to recode sex as 0 or 1 to make logit happy.

In [8]:

df['boy'] = (df.sex=='M').astype(int)
df.boy.value_counts().sort_index()

Out[8]:

0    1952273
1    2045902
Name: boy, dtype: int64

All births are from 2014.

In [9]:

df.year.value_counts().sort_index()

Out[9]:

2014    3998175
Name: year, dtype: int64

Mother's age:

In [10]:

df.mager9.value_counts().sort_index()

Out[10]:

1       2777
2     249581
3     884246
4    1148469
5    1084064
6     510214
7     110318
8       7750
9        756
Name: mager9, dtype: int64

In [11]:

var = 'mager9'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[11]:

	boy
mager9
1	109
2	105
3	105
4	105
5	105
6	105
7	104
8	104
9	102

In [12]:

df.mager9.isnull().mean()

Out[12]:

0.0

In [13]:

df['youngm'] = df.mager9<=2
df['oldm'] = df.mager9>=7
df.youngm.mean(), df.oldm.mean()

Out[13]:

(0.06311829772333627, 0.029719559549044251)

Mother's nativity (1 = born in the U.S.)

In [14]:

df.mnativ.replace([3], np.nan, inplace=True)
df.mnativ.value_counts().sort_index()

Out[14]:

1    3106689
2     881662
Name: mnativ, dtype: int64

In [15]:

var = 'mnativ'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[15]:

	boy
mnativ
1	105
2	105

Residence status (1=resident)

In [16]:

df.restatus.value_counts().sort_index()

Out[16]:

1    2873404
2    1025766
3      88906
4      10099
Name: restatus, dtype: int64

In [17]:

var = 'restatus'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[17]:

	boy
restatus
1	105
2	105
3	106
4	106

Mother's race (1=White, 2=Black, 3=American Indian or Alaskan Native, 4=Asian or Pacific Islander)

In [18]:

df.mbrace.value_counts().sort_index()

Out[18]:

1    3029013
2     641089
3      44962
4     283111
Name: mbrace, dtype: int64

In [19]:

var = 'mbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[19]:

	boy
mbrace
1	105
2	103
3	103
4	106

Mother's Hispanic origin (0=Non-Hispanic)

In [20]:

df.mhisp_r.replace([9], np.nan, inplace=True)
df.mhisp_r.value_counts().sort_index()

Out[20]:

0    3045419
1     553738
2      69894
3      20165
4     136785
5     141497
Name: mhisp_r, dtype: int64

In [21]:

def copy_null(df, oldvar, newvar):
    df.loc[df[oldvar].isnull(), newvar] = np.nan

In [22]:

df['mhisp'] = df.mhisp_r > 0
copy_null(df, 'mhisp_r', 'mhisp')
df.mhisp.isnull().mean(), df.mhisp.mean()

Out[22]:

(0.0076727506925034546, 0.23240818268843488)

In [23]:

var = 'mhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[23]:

	boy
mhisp
0	105
1	104

Marital status (1=Married)

In [24]:

df.dmar.value_counts().sort_index()

Out[24]:

1    2390630
2    1607545
Name: dmar, dtype: int64

In [25]:

var = 'dmar'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[25]:

	boy
dmar
1	105
2	104

Paternity acknowledged, if unmarried (Y=yes, N=no, X=not applicable, U=unknown).

I recode X (not applicable because married) as Y (paternity acknowledged).

In [26]:

df.mar_p.replace(['U'], np.nan, inplace=True)
df.mar_p.replace(['X'], 'Y', inplace=True)
df.mar_p.value_counts().sort_index()

Out[26]:

N     462627
Y    3386542
Name: mar_p, dtype: int64

In [27]:

var = 'mar_p'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[27]:

	boy
mar_p
N	103
Y	105

Mother's education level

In [28]:

df.meduc.replace([9], np.nan, inplace=True)
df.meduc.value_counts().sort_index()

Out[28]:

1    138589
2    437081
3    957265
4    815688
5    308384
6    732661
7    326800
8     94057
Name: meduc, dtype: int64

In [29]:

var = 'meduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[29]:

	boy
meduc
1	104
2	104
3	105
4	105
5	105
6	105
7	105
8	104

In [30]:

df['lowed'] = df.meduc <= 2
copy_null(df, 'meduc', 'lowed')
df.lowed.isnull().mean(), df.lowed.mean()

Out[30]:

(0.046933913598079122, 0.15107367095085322)

Father's age, in 10 ranges

In [31]:

df.fagerrec11.replace([11], np.nan, inplace=True)
df.fagerrec11.value_counts().sort_index()

Out[31]:

1         277
2       84852
3      498779
4      869280
5     1025631
6      631685
7      262169
8       87432
9       28465
10      12490
Name: fagerrec11, dtype: int64

In [32]:

var = 'fagerrec11'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[32]:

	boy
fagerrec11
1	102
2	106
3	106
4	105
5	105
6	105
7	105
8	105
9	104
10	109

In [33]:

df['youngf'] = df.fagerrec11<=2
copy_null(df, 'fagerrec11', 'youngf')
df.youngf.isnull().mean(), df.youngf.mean()

Out[33]:

(0.12433547806186572, 0.024315207394332003)

In [34]:

df['oldf'] = df.fagerrec11>=8
copy_null(df, 'fagerrec11', 'oldf')
df.oldf.isnull().mean(), df.oldf.mean()

Out[34]:

(0.12433547806186572, 0.036670893957829916)

Father's race

In [35]:

df.fbrace.replace([9], np.nan, inplace=True)
df.fbrace.value_counts().sort_index()

Out[35]:

1    2497901
2     482433
3      35408
4     238394
Name: fbrace, dtype: int64

In [36]:

var = 'fbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[36]:

	boy
fbrace
1	105
2	103
3	103
4	107

Father's Hispanic origin (0=non-hispanic, other values indicate country of origin)

In [37]:

df.fhisp_r.replace([9], np.nan, inplace=True)
df.fhisp_r.value_counts().sort_index()

Out[37]:

0    2649007
1     493497
2      59137
3      19128
4     108111
5     124172
Name: fhisp_r, dtype: int64

In [38]:

df['fhisp'] = df.fhisp_r > 0
copy_null(df, 'fhisp_r', 'fhisp')
df.fhisp.isnull().mean(), df.fhisp.mean()

Out[38]:

(0.13634295647389122, 0.23285053338322156)

In [39]:

var = 'fhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[39]:

	boy
fhisp
0	105
1	104

Father's education level

In [40]:

df.feduc.replace([9], np.nan, inplace=True)
df.feduc.value_counts().sort_index()

Out[40]:

1    141654
2    342061
3    951980
4    643118
5    232622
6    616187
7    242022
8    109482
Name: feduc, dtype: int64

In [41]:

var = 'feduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[41]:

	boy
feduc
1	104
2	105
3	105
4	105
5	106
6	105
7	105
8	105

Live birth order.

In [42]:

df.lbo_rec.replace([9], np.nan, inplace=True)
df.lbo_rec.value_counts().sort_index()

Out[42]:

1    1555006
2    1270496
3     669016
4     284435
5     110708
6      46093
7      20786
8      21610
Name: lbo_rec, dtype: int64

In [43]:

var = 'lbo_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[43]:

	boy
lbo_rec
1	105
2	105
3	105
4	105
5	104
6	104
7	104
8	102

In [44]:

df['highbo'] = df.lbo_rec >= 5
copy_null(df, 'lbo_rec', 'highbo')
df.highbo.isnull().mean(), df.highbo.mean()

Out[44]:

(0.0050085351441595226, 0.050072772519889897)

Number of prenatal visits, in 11 ranges

In [45]:

df.previs_rec.replace([12], np.nan, inplace=True)
df.previs_rec.value_counts().sort_index()

Out[45]:

1      59670
2      44923
3      98141
4     201032
5     366887
6     826908
7     998330
8     684997
9     379305
10     99067
11    128805
Name: previs_rec, dtype: int64

In [46]:

df.previs_rec.mean()
df['previs'] = df.previs_rec - 7

In [47]:

var = 'previs'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[47]:

	boy
previs
-6	105
-5	107
-4	107
-3	108
-2	107
-1	106
0	105
1	103
2	102
3	102
4	102

In [48]:

df['no_previs'] = df.previs_rec <= 1
copy_null(df, 'previs_rec', 'no_previs')
df.no_previs.isnull().mean(), df.no_previs.mean()

Out[48]:

(0.027540065154726845, 0.015346965650008423)

Whether the mother is eligible for food stamps

In [49]:

df.wic.replace(['U'], np.nan, inplace=True)
df.wic.value_counts().sort_index()

Out[49]:

N    2124143
Y    1634978
Name: wic, dtype: int64

In [50]:

var = 'wic'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[50]:

	boy
wic
N	105
Y	104

Mother's height in inches

In [51]:

df.height.replace([99], np.nan, inplace=True)
df.height.value_counts().sort_index()

Out[51]:

30        28
31         1
34         2
36        14
37         7
38         7
39         7
40         6
41        10
42        13
43         3
44         8
45        11
46        14
47        22
48       857
49       544
50       357
51       422
52       493
53      1503
54      1414
55      2762
56      6678
57     18359
58     21019
59     81588
60    209490
61    269142
62    474306
63    485840
64    559249
65    453503
66    429253
67    334485
68    189690
69    127789
70     62364
71     33428
72     15323
73      5200
74      2538
75      1019
76       590
77       593
78       941
Name: height, dtype: int64

In [52]:

df['mshort'] = df.height<60
copy_null(df, 'height', 'mshort')
df.mshort.isnull().mean(), df.mshort.mean()

Out[52]:

(0.051844404009329256, 0.0359147662344377)

In [53]:

df['mtall'] = df.height>=70
copy_null(df, 'height', 'mtall')
df.mtall.isnull().mean(), df.mtall.mean()

Out[53]:

(0.051844404009329256, 0.03218134412692316)

In [54]:

var = 'mshort'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[54]:

	boy
mshort
0	105
1	104

In [55]:

var = 'mtall'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[55]:

	boy
mtall
0	105
1	104

Mother's BMI in 6 ranges

In [56]:

df.bmi_r.replace([9], np.nan, inplace=True)
df.bmi_r.value_counts().sort_index()

Out[56]:

1     140142
2    1702519
3     949075
4     506017
5     242957
6     168515
Name: bmi_r, dtype: int64

In [57]:

var = 'bmi_r'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[57]:

	boy
bmi_r
1	105
2	105
3	105
4	104
5	104
6	104

In [58]:

df['obese'] = df.bmi_r >= 4
copy_null(df, 'bmi_r', 'obese')
df.obese.isnull().mean(), df.obese.mean()

Out[58]:

(0.07227047340349034, 0.2473532880857861)

Payment method (1=Medicaid, 2=Private insurance, 3=Self pay, 4=Other)

In [59]:

df.pay_rec.replace([9], np.nan, inplace=True)
df.pay_rec.value_counts().sort_index()

Out[59]:

1    1665161
2    1824151
3     162650
4     167806
Name: pay_rec, dtype: int64

In [60]:

var = 'pay_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[60]:

	boy
pay_rec
1	104
2	105
3	107
4	105

Sex of baby

In [61]:

df.sex.value_counts().sort_index()

Out[61]:

F    1952273
M    2045902
Name: sex, dtype: int64

Regression models¶

Here are some functions I'll use to interpret the results of logistic regression

In [62]:

def logodds_to_ratio(logodds):
    """Convert from log odds to probability."""
    odds = np.exp(logodds)
    return 100 * odds

def summarize(results):
    """Summarize parameters in terms of birth ratio."""
    inter_or = results.params['Intercept']
    inter_rat = logodds_to_ratio(inter_or)
    
    for value, lor in results.params.iteritems():
        if value=='Intercept':
            continue
        
        rat = logodds_to_ratio(inter_or + lor)
        code = '*' if results.pvalues[value] < 0.05 else ' '
        
        print('%-20s   %0.1f   %0.1f' % (value, inter_rat, rat), code)

Now I'll run models with each variable, one at a time.

Mother's age seems to have no predictive value:

In [63]:

model = smf.logit('boy ~ mager9', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
mager9                 105.1   105.0

Out[63]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998173
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.129e-07
Time:	14:18:28	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	0.4290

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0496	0.004	13.550	0.000	0.042 0.057
mager9	-0.0007	0.001	-0.791	0.429	-0.002 0.001

The estimated ratios for young mothers is higher, and the ratio for older mothers is lower, but neither is statistically significant.

In [64]:

model = smf.logit('boy ~ youngm + oldm', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
youngm[T.True]         104.8   104.9  
oldm[T.True]           104.8   103.9

Out[64]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998172
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.813e-07
Time:	14:18:33	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	0.3478

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0470	0.001	44.772	0.000	0.045 0.049
youngm[T.True]	0.0010	0.004	0.240	0.810	-0.007 0.009
oldm[T.True]	-0.0084	0.006	-1.421	0.155	-0.020 0.003

Whether the mother was born in the U.S. has no predictive value

In [65]:

model = smf.logit('boy ~ C(mnativ)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
C(mnativ)[T.2.0]       104.8   104.9

Out[65]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3988351
Model:	Logit	Df Residuals:	3988349
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.566e-08
Time:	14:19:00	Log-Likelihood:	-2.7634e+06
converged:	True	LL-Null:	-2.7634e+06
		LLR p-value:	0.6154

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0466	0.001	41.050	0.000	0.044 0.049
C(mnativ)[T.2.0]	0.0012	0.002	0.502	0.615	-0.004 0.006

Neither does residence status

In [66]:

model = smf.logit('boy ~ C(restatus)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
C(restatus)[T.2]       104.8   104.7  
C(restatus)[T.3]       104.8   106.0  
C(restatus)[T.4]       104.8   106.2

Out[66]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998171
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	6.716e-07
Time:	14:19:28	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	0.2932

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0468	0.001	39.653	0.000	0.044 0.049
C(restatus)[T.2]	-0.0010	0.002	-0.418	0.676	-0.005 0.004
C(restatus)[T.3]	0.0117	0.007	1.718	0.086	-0.002 0.025
C(restatus)[T.4]	0.0132	0.020	0.663	0.507	-0.026 0.052

Mother's race seems to have predictive value. Relative to whites, black and Native American mothers have more girls; Asians have more boys.

In [67]:

model = smf.logit('boy ~ C(mbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
C(mbrace)[T.2]         105.1   102.9 *
C(mbrace)[T.3]         105.1   103.1 *
C(mbrace)[T.4]         105.1   106.3 *

Out[67]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998171
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.401e-05
Time:	14:19:55	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	1.007e-16

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0497	0.001	43.250	0.000	0.047 0.052
C(mbrace)[T.2]	-0.0214	0.003	-7.770	0.000	-0.027 -0.016
C(mbrace)[T.3]	-0.0195	0.010	-2.049	0.041	-0.038 -0.001
C(mbrace)[T.4]	0.0109	0.004	2.777	0.005	0.003 0.019

Hispanic mothers have more girls.

In [68]:

model = smf.logit('boy ~ mhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
mhisp                  105.0   104.1 *

Out[68]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3967498
Model:	Logit	Df Residuals:	3967496
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.998e-06
Time:	14:19:59	Log-Likelihood:	-2.7490e+06
converged:	True	LL-Null:	-2.7490e+06
		LLR p-value:	0.0009174

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0485	0.001	42.263	0.000	0.046 0.051
mhisp	-0.0079	0.002	-3.315	0.001	-0.013 -0.003

If the mother is married or unmarried but paternity is acknowledged, the sex ratio is higher (more boys)

In [69]:

model = smf.logit('boy ~ C(mar_p)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
C(mar_p)[T.Y]          102.8   105.1 *

Out[69]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3849169
Model:	Logit	Df Residuals:	3849167
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	9.129e-06
Time:	14:20:27	Log-Likelihood:	-2.6670e+06
converged:	True	LL-Null:	-2.6670e+06
		LLR p-value:	2.990e-12

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0278	0.003	9.446	0.000	0.022 0.034
C(mar_p)[T.Y]	0.0219	0.003	6.978	0.000	0.016 0.028

Being unmarried predicts more girls.

In [70]:

model = smf.logit('boy ~ C(dmar)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692871
         Iterations 3
C(dmar)[T.2]           105.1   104.3 *

Out[70]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998173
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.001e-06
Time:	14:20:54	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	4.555e-05

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0502	0.001	38.789	0.000	0.048 0.053
C(dmar)[T.2]	-0.0083	0.002	-4.077	0.000	-0.012 -0.004

Each level of mother's education predicts a small increase in the probability of a boy.

In [71]:

model = smf.logit('boy ~ meduc', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
meduc                  104.1   104.2 *

Out[71]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3810525
Model:	Logit	Df Residuals:	3810523
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.416e-06
Time:	14:20:59	Log-Likelihood:	-2.6402e+06
converged:	True	LL-Null:	-2.6402e+06
		LLR p-value:	0.006248

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0398	0.003	14.711	0.000	0.034 0.045
meduc	0.0016	0.001	2.734	0.006	0.000 0.003

In [72]:

model = smf.logit('boy ~ lowed', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
lowed                  104.9   104.1 *

Out[72]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3810525
Model:	Logit	Df Residuals:	3810523
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.431e-06
Time:	14:21:03	Log-Likelihood:	-2.6402e+06
converged:	True	LL-Null:	-2.6402e+06
		LLR p-value:	0.005983

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0478	0.001	43.002	0.000	0.046 0.050
lowed	-0.0079	0.003	-2.749	0.006	-0.013 -0.002

Older fathers are slightly more likely to have girls (but this apparent effect could be due to chance).

In [73]:

model = smf.logit('boy ~ fagerrec11', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
fagerrec11             105.9   105.7 *

Out[73]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3501060
Model:	Logit	Df Residuals:	3501058
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.226e-07
Time:	14:21:08	Log-Likelihood:	-2.4257e+06
converged:	True	LL-Null:	-2.4257e+06
		LLR p-value:	0.04575

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0570	0.004	14.707	0.000	0.049 0.065
fagerrec11	-0.0015	0.001	-1.998	0.046	-0.003 -2.9e-05

In [74]:

model = smf.logit('boy ~ youngf + oldf', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
youngf                 105.1   106.3  
oldf                   105.1   105.0

Out[74]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3501060
Model:	Logit	Df Residuals:	3501057
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.807e-07
Time:	14:21:12	Log-Likelihood:	-2.4257e+06
converged:	True	LL-Null:	-2.4257e+06
		LLR p-value:	0.2445

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0493	0.001	44.656	0.000	0.047 0.051
youngf	0.0116	0.007	1.673	0.094	-0.002 0.025
oldf	-0.0005	0.006	-0.086	0.932	-0.012 0.011

Predictions based on father's race are similar to those based on mother's race: more girls for black and Native American fathers; more boys for Asian fathers.

In [75]:

model = smf.logit('boy ~ C(fbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692818
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.1 *
C(fbrace)[T.3.0]       105.5   102.9 *
C(fbrace)[T.4.0]       105.5   106.6 *

Out[75]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3254136
Model:	Logit	Df Residuals:	3254132
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.504e-05
Time:	14:21:38	Log-Likelihood:	-2.2545e+06
converged:	True	LL-Null:	-2.2546e+06
		LLR p-value:	1.256e-14

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0533	0.001	42.144	0.000	0.051 0.056
C(fbrace)[T.2.0]	-0.0227	0.003	-7.221	0.000	-0.029 -0.017
C(fbrace)[T.3.0]	-0.0250	0.011	-2.335	0.020	-0.046 -0.004
C(fbrace)[T.4.0]	0.0106	0.004	2.479	0.013	0.002 0.019

If the father is Hispanic, that predicts more girls.

In [76]:

model = smf.logit('boy ~ fhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692839
         Iterations 3
fhisp                  105.4   104.0 *

Out[76]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3453052
Model:	Logit	Df Residuals:	3453050
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.800e-06
Time:	14:21:42	Log-Likelihood:	-2.3924e+06
converged:	True	LL-Null:	-2.3924e+06
		LLR p-value:	1.378e-07

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0525	0.001	42.696	0.000	0.050 0.055
fhisp	-0.0134	0.003	-5.268	0.000	-0.018 -0.008

Father's education level might predict more boys, but the apparent effect could be due to chance.

In [77]:

model = smf.logit('boy ~ feduc', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
feduc                  104.6   104.7

Out[77]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3279126
Model:	Logit	Df Residuals:	3279124
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.046e-07
Time:	14:21:46	Log-Likelihood:	-2.2719e+06
converged:	True	LL-Null:	-2.2719e+06
		LLR p-value:	0.05587

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0445	0.003	15.630	0.000	0.039 0.050
feduc	0.0012	0.001	1.912	0.056	-3.02e-05 0.002

Babies with high birth order are slightly more likely to be girls.

In [78]:

model = smf.logit('boy ~ lbo_rec', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
lbo_rec                105.3   105.1 *

Out[78]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3978150
Model:	Logit	Df Residuals:	3978148
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.576e-06
Time:	14:21:51	Log-Likelihood:	-2.7563e+06
converged:	True	LL-Null:	-2.7564e+06
		LLR p-value:	0.003206

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0518	0.002	26.529	0.000	0.048 0.056
lbo_rec	-0.0023	0.001	-2.947	0.003	-0.004 -0.001

In [79]:

model = smf.logit('boy ~ highbo', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
highbo                 104.9   103.4 *

Out[79]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3978150
Model:	Logit	Df Residuals:	3978148
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.647e-06
Time:	14:21:56	Log-Likelihood:	-2.7563e+06
converged:	True	LL-Null:	-2.7564e+06
		LLR p-value:	0.002584

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0475	0.001	46.200	0.000	0.046 0.050
highbo	-0.0139	0.005	-3.013	0.003	-0.023 -0.005

Strangely, prenatal visits are associated with an increased probability of girls.

In [80]:

model = smf.logit('boy ~ previs', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692847
         Iterations 3
previs                 104.6   103.8 *

Out[80]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3888065
Model:	Logit	Df Residuals:	3888063
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.975e-05
Time:	14:22:01	Log-Likelihood:	-2.6938e+06
converged:	True	LL-Null:	-2.6939e+06
		LLR p-value:	1.677e-48

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0449	0.001	43.933	0.000	0.043 0.047
previs	-0.0079	0.001	-14.634	0.000	-0.009 -0.007

The effect seems to be non-linear at zero, so I'm adding a boolean for no prenatal visits.

In [81]:

model = smf.logit('boy ~ no_previs + previs', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692842
         Iterations 3
no_previs              104.6   98.9 *
previs                 104.6   103.7 *

Out[81]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3888065
Model:	Logit	Df Residuals:	3888062
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.717e-05
Time:	14:22:07	Log-Likelihood:	-2.6938e+06
converged:	True	LL-Null:	-2.6939e+06
		LLR p-value:	6.538e-56

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0454	0.001	44.310	0.000	0.043 0.047
no_previs	-0.0564	0.009	-6.322	0.000	-0.074 -0.039
previs	-0.0093	0.001	-15.938	0.000	-0.010 -0.008

If the mother qualifies for food stamps, she is more likely to have a girl.

In [82]:

model = smf.logit('boy ~ wic', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692869
         Iterations 3
wic[T.Y]               105.2   104.3 *

Out[82]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3759121
Model:	Logit	Df Residuals:	3759119
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.051e-06
Time:	14:22:35	Log-Likelihood:	-2.6046e+06
converged:	True	LL-Null:	-2.6046e+06
		LLR p-value:	6.700e-05

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0506	0.001	36.886	0.000	0.048 0.053
wic[T.Y]	-0.0083	0.002	-3.987	0.000	-0.012 -0.004

Mother's height seems to have no predictive value.

In [83]:

model = smf.logit('boy ~ height', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
height                 102.4   102.5

Out[83]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3790892
Model:	Logit	Df Residuals:	3790890
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.853e-07
Time:	14:22:39	Log-Likelihood:	-2.6266e+06
converged:	True	LL-Null:	-2.6266e+06
		LLR p-value:	0.3238

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0240	0.023	1.038	0.299	-0.021 0.069
height	0.0004	0.000	0.987	0.324	-0.000 0.001

In [84]:

model = smf.logit('boy ~ mtall + mshort', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
mtall                  104.8   104.1  
mshort                 104.8   104.3

Out[84]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3790892
Model:	Logit	Df Residuals:	3790889
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.560e-07
Time:	14:22:43	Log-Likelihood:	-2.6266e+06
converged:	True	LL-Null:	-2.6266e+06
		LLR p-value:	0.3019

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0473	0.001	44.433	0.000	0.045 0.049
mtall	-0.0071	0.006	-1.212	0.226	-0.018 0.004
mshort	-0.0056	0.006	-1.005	0.315	-0.016 0.005

Mother's with higher BMI are more likely to have girls.

In [85]:

model = smf.logit('boy ~ bmi_r', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692870
         Iterations 3
bmi_r                  105.7   105.4 *

Out[85]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3709225
Model:	Logit	Df Residuals:	3709223
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.168e-06
Time:	14:22:48	Log-Likelihood:	-2.5700e+06
converged:	True	LL-Null:	-2.5700e+06
		LLR p-value:	0.0008442

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0554	0.003	20.336	0.000	0.050 0.061
bmi_r	-0.0029	0.001	-3.338	0.001	-0.005 -0.001

In [86]:

model = smf.logit('boy ~ obese', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692870
         Iterations 3
obese                  105.0   104.2 *

Out[86]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3709225
Model:	Logit	Df Residuals:	3709223
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.347e-06
Time:	14:22:53	Log-Likelihood:	-2.5700e+06
converged:	True	LL-Null:	-2.5700e+06
		LLR p-value:	0.0005139

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0491	0.001	40.976	0.000	0.047 0.051
obese	-0.0084	0.002	-3.473	0.001	-0.013 -0.004

If payment was made by Medicaid, the baby is more likely to be a girl. Private insurance, self-payment, and other payment method are associated with more boys.

In [87]:

model = smf.logit('boy ~ C(pay_rec)', data=df)    
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692869
         Iterations 3
C(pay_rec)[T.2.0]      104.2   105.1 *
C(pay_rec)[T.3.0]      104.2   106.6 *
C(pay_rec)[T.4.0]      104.2   104.7

Out[87]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3819768
Model:	Logit	Df Residuals:	3819764
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.306e-06
Time:	14:23:19	Log-Likelihood:	-2.6466e+06
converged:	True	LL-Null:	-2.6466e+06
		LLR p-value:	3.482e-06

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0416	0.002	26.840	0.000	0.039 0.045
C(pay_rec)[T.2.0]	0.0085	0.002	3.982	0.000	0.004 0.013
C(pay_rec)[T.3.0]	0.0222	0.005	4.272	0.000	0.012 0.032
C(pay_rec)[T.4.0]	0.0047	0.005	0.925	0.355	-0.005 0.015

Adding controls¶

However, none of the previous results should be taken too seriously. We only tested one variable at a time, and many of these apparent effects disappear when we add control variables.

In particular, if we control for father's race and Hispanic origin, the mother's race has no additional predictive value.

In [88]:

formula = ('boy ~ C(fbrace) + fhisp + C(mbrace) + mhisp')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.1 *
C(fbrace)[T.3.0]       105.8   103.5  
C(fbrace)[T.4.0]       105.8   106.9  
C(mbrace)[T.2]         105.8   105.9  
C(mbrace)[T.3]         105.8   104.5  
C(mbrace)[T.4]         105.8   105.6  
fhisp                  105.8   104.2 *
mhisp                  105.8   106.0

Out[88]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3231530
Model:	Logit	Df Residuals:	3231521
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.087e-05
Time:	14:24:08	Log-Likelihood:	-2.2389e+06
converged:	True	LL-Null:	-2.2389e+06
		LLR p-value:	9.292e-17

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0566	0.001	38.234	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0260	0.006	-4.668	0.000	-0.037 -0.015
C(fbrace)[T.3.0]	-0.0221	0.012	-1.793	0.073	-0.046 0.002
C(fbrace)[T.4.0]	0.0097	0.007	1.344	0.179	-0.004 0.024
C(mbrace)[T.2]	0.0004	0.006	0.075	0.940	-0.011 0.012
C(mbrace)[T.3]	-0.0130	0.013	-0.994	0.320	-0.039 0.013
C(mbrace)[T.4]	-0.0026	0.007	-0.375	0.708	-0.016 0.011
fhisp	-0.0156	0.004	-3.591	0.000	-0.024 -0.007
mhisp	0.0018	0.004	0.422	0.673	-0.007 0.010

In fact, once we control for father's race and Hispanic origin, almost every other variable becomes statistically insignificant, including acknowledged paternity.

In [89]:

formula = ('boy ~ C(fbrace) + fhisp + mar_p')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692814
         Iterations 3
C(fbrace)[T.2.0]       108.2   105.5 *
C(fbrace)[T.3.0]       108.2   105.2 *
C(fbrace)[T.4.0]       108.2   109.1  
mar_p[T.Y]             108.2   105.8  
fhisp                  108.2   106.7 *

Out[89]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3112362
Model:	Logit	Df Residuals:	3112356
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.117e-05
Time:	14:24:56	Log-Likelihood:	-2.1563e+06
converged:	True	LL-Null:	-2.1563e+06
		LLR p-value:	3.558e-18

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0792	0.015	5.155	0.000	0.049 0.109
C(fbrace)[T.2.0]	-0.0258	0.003	-7.860	0.000	-0.032 -0.019
C(fbrace)[T.3.0]	-0.0283	0.011	-2.594	0.009	-0.050 -0.007
C(fbrace)[T.4.0]	0.0074	0.004	1.662	0.097	-0.001 0.016
mar_p[T.Y]	-0.0225	0.015	-1.464	0.143	-0.053 0.008
fhisp	-0.0148	0.003	-4.982	0.000	-0.021 -0.009

Being married still predicts more boys.

In [90]:

formula = ('boy ~ C(fbrace) + fhisp + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692814
         Iterations 3
C(fbrace)[T.2.0]       105.0   102.2 *
C(fbrace)[T.3.0]       105.0   101.9 *
C(fbrace)[T.4.0]       105.0   105.9  
fhisp                  105.0   103.4 *
dmar                   105.0   105.7 *

Out[90]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3235798
Model:	Logit	Df Residuals:	3235792
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.183e-05
Time:	14:25:22	Log-Likelihood:	-2.2418e+06
converged:	True	LL-Null:	-2.2419e+06
		LLR p-value:	1.485e-19

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0492	0.003	14.375	0.000	0.042 0.056
C(fbrace)[T.2.0]	-0.0278	0.003	-8.324	0.000	-0.034 -0.021
C(fbrace)[T.3.0]	-0.0301	0.011	-2.778	0.005	-0.051 -0.009
C(fbrace)[T.4.0]	0.0081	0.004	1.871	0.061	-0.000 0.017
fhisp	-0.0156	0.003	-5.270	0.000	-0.021 -0.010
dmar	0.0062	0.003	2.416	0.016	0.001 0.011

The effect of education disappears.

In [91]:

formula = ('boy ~ C(fbrace) + fhisp + lowed')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.1 *
C(fbrace)[T.3.0]       105.8   102.8 *
C(fbrace)[T.4.0]       105.8   106.5  
fhisp                  105.8   104.2 *
lowed                  105.8   106.0

Out[91]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3091385
Model:	Logit	Df Residuals:	3091379
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.076e-05
Time:	14:25:47	Log-Likelihood:	-2.1418e+06
converged:	True	LL-Null:	-2.1418e+06
		LLR p-value:	1.130e-17

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0566	0.001	37.993	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0259	0.003	-7.838	0.000	-0.032 -0.019
C(fbrace)[T.3.0]	-0.0287	0.011	-2.624	0.009	-0.050 -0.007
C(fbrace)[T.4.0]	0.0067	0.004	1.487	0.137	-0.002 0.015
fhisp	-0.0152	0.003	-4.927	0.000	-0.021 -0.009
lowed	0.0017	0.004	0.462	0.644	-0.006 0.009

The effect of birth order disappears.

In [92]:

formula = ('boy ~ C(fbrace) + fhisp + highbo')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.2 *
C(fbrace)[T.3.0]       105.8   102.9 *
C(fbrace)[T.4.0]       105.8   106.6  
fhisp                  105.8   104.4 *
highbo                 105.8   105.6

Out[92]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3221819
Model:	Logit	Df Residuals:	3221813
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.029e-05
Time:	14:26:13	Log-Likelihood:	-2.2321e+06
converged:	True	LL-Null:	-2.2322e+06
		LLR p-value:	5.072e-18

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0566	0.001	38.815	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0253	0.003	-7.841	0.000	-0.032 -0.019
C(fbrace)[T.3.0]	-0.0284	0.011	-2.616	0.009	-0.050 -0.007
C(fbrace)[T.4.0]	0.0077	0.004	1.758	0.079	-0.001 0.016
fhisp	-0.0139	0.003	-4.785	0.000	-0.020 -0.008
highbo	-0.0026	0.005	-0.483	0.629	-0.013 0.008

WIC is no longer associated with more girls.

In [93]:

formula = ('boy ~ C(fbrace) + fhisp + wic')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692813
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.0 *
C(fbrace)[T.3.0]       105.8   103.0 *
C(fbrace)[T.4.0]       105.8   106.6  
wic[T.Y]               105.8   106.1  
fhisp                  105.8   104.1 *

Out[93]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3040527
Model:	Logit	Df Residuals:	3040521
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.175e-05
Time:	14:27:01	Log-Likelihood:	-2.1065e+06
converged:	True	LL-Null:	-2.1066e+06
		LLR p-value:	3.031e-18

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0564	0.002	34.772	0.000	0.053 0.060
C(fbrace)[T.2.0]	-0.0271	0.003	-7.892	0.000	-0.034 -0.020
C(fbrace)[T.3.0]	-0.0267	0.011	-2.405	0.016	-0.048 -0.005
C(fbrace)[T.4.0]	0.0076	0.005	1.670	0.095	-0.001 0.016
wic[T.Y]	0.0025	0.003	0.975	0.330	-0.002 0.007
fhisp	-0.0161	0.003	-5.153	0.000	-0.022 -0.010

The effect of obesity disappears.

In [94]:

formula = ('boy ~ C(fbrace) + fhisp + obese')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 3
C(fbrace)[T.2.0]       105.9   103.3 *
C(fbrace)[T.3.0]       105.9   103.1 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.3 *
obese                  105.9   105.7

Out[94]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3005073
Model:	Logit	Df Residuals:	3005067
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.947e-05
Time:	14:27:26	Log-Likelihood:	-2.0820e+06
converged:	True	LL-Null:	-2.0820e+06
		LLR p-value:	5.013e-16

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0571	0.002	35.622	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0247	0.003	-7.305	0.000	-0.031 -0.018
C(fbrace)[T.3.0]	-0.0266	0.011	-2.410	0.016	-0.048 -0.005
C(fbrace)[T.4.0]	0.0056	0.005	1.217	0.224	-0.003 0.015
fhisp	-0.0151	0.003	-4.996	0.000	-0.021 -0.009
obese	-0.0014	0.003	-0.524	0.600	-0.007 0.004

The effect of payment method is diminished, but self-payment is still associated with more boys.

In [95]:

formula = ('boy ~ C(fbrace) + fhisp + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692812
         Iterations 3
C(fbrace)[T.2.0]       106.1   103.3 *
C(fbrace)[T.3.0]       106.1   103.0 *
C(fbrace)[T.4.0]       106.1   106.7  
C(pay_rec)[T.2.0]      106.1   105.7  
C(pay_rec)[T.3.0]      106.1   108.3 *
C(pay_rec)[T.4.0]      106.1   105.4  
fhisp                  106.1   104.4 *

Out[95]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3086812
Model:	Logit	Df Residuals:	3086804
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.500e-05
Time:	14:28:14	Log-Likelihood:	-2.1386e+06
converged:	True	LL-Null:	-2.1386e+06
		LLR p-value:	3.965e-20

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0593	0.002	25.249	0.000	0.055 0.064
C(fbrace)[T.2.0]	-0.0271	0.003	-7.980	0.000	-0.034 -0.020
C(fbrace)[T.3.0]	-0.0297	0.011	-2.696	0.007	-0.051 -0.008
C(fbrace)[T.4.0]	0.0056	0.004	1.239	0.216	-0.003 0.014
C(pay_rec)[T.2.0]	-0.0043	0.003	-1.680	0.093	-0.009 0.001
C(pay_rec)[T.3.0]	0.0203	0.006	3.331	0.001	0.008 0.032
C(pay_rec)[T.4.0]	-0.0063	0.006	-1.094	0.274	-0.018 0.005
fhisp	-0.0167	0.003	-5.378	0.000	-0.023 -0.011

But the effect of prenatal visits is still a strong predictor of more girls.

In [96]:

formula = ('boy ~ C(fbrace) + fhisp + previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692778
         Iterations 3
C(fbrace)[T.2.0]       105.8   102.8 *
C(fbrace)[T.3.0]       105.8   102.3 *
C(fbrace)[T.4.0]       105.8   106.4  
fhisp                  105.8   104.0 *
previs                 105.8   104.8 *

Out[96]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155434
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	7.997e-05
Time:	14:28:40	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	2.081e-73

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0567	0.001	38.800	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0295	0.003	-9.008	0.000	-0.036 -0.023
C(fbrace)[T.3.0]	-0.0341	0.011	-3.114	0.002	-0.056 -0.013
C(fbrace)[T.4.0]	0.0058	0.004	1.314	0.189	-0.003 0.014
fhisp	-0.0172	0.003	-5.862	0.000	-0.023 -0.011
previs	-0.0102	0.001	-16.235	0.000	-0.011 -0.009

And the effect is even stronger if we add a boolean to capture the nonlinearity at 0 visits.

In [97]:

formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692776
         Iterations 3
C(fbrace)[T.2.0]       105.9   102.8 *
C(fbrace)[T.3.0]       105.9   102.3 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.1 *
previs                 105.9   104.7 *
no_previs              105.9   101.0 *

Out[97]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155433
Method:	MLE	Df Model:	6
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.351e-05
Time:	14:29:06	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	8.674e-76

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0570	0.001	38.973	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0294	0.003	-8.984	0.000	-0.036 -0.023
C(fbrace)[T.3.0]	-0.0342	0.011	-3.123	0.002	-0.056 -0.013
C(fbrace)[T.4.0]	0.0056	0.004	1.270	0.204	-0.003 0.014
fhisp	-0.0171	0.003	-5.817	0.000	-0.023 -0.011
previs	-0.0111	0.001	-16.625	0.000	-0.012 -0.010
no_previs	-0.0469	0.012	-3.936	0.000	-0.070 -0.024

More controls¶

Now if we control for father's race and Hispanic origin as well as number of prenatal visits, the effect of marriage disappears.

In [98]:

formula = ('boy ~ C(fbrace) + fhisp + previs + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692778
         Iterations 3
C(fbrace)[T.2.0]       105.3   102.1 *
C(fbrace)[T.3.0]       105.3   101.7 *
C(fbrace)[T.4.0]       105.3   106.0  
fhisp                  105.3   103.5 *
previs                 105.3   104.3 *
dmar                   105.3   105.7

Out[98]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155433
Method:	MLE	Df Model:	6
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.045e-05
Time:	14:29:32	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	6.525e-73

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0521	0.003	15.015	0.000	0.045 0.059
C(fbrace)[T.2.0]	-0.0309	0.003	-9.058	0.000	-0.038 -0.024
C(fbrace)[T.3.0]	-0.0353	0.011	-3.210	0.001	-0.057 -0.014
C(fbrace)[T.4.0]	0.0062	0.004	1.394	0.163	-0.002 0.015
fhisp	-0.0181	0.003	-6.033	0.000	-0.024 -0.012
previs	-0.0102	0.001	-16.122	0.000	-0.011 -0.009
dmar	0.0037	0.003	1.446	0.148	-0.001 0.009

The effect of payment method disappears.

In [99]:

formula = ('boy ~ C(fbrace) + fhisp + previs + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692777
         Iterations 3
C(fbrace)[T.2.0]       105.8   102.8 *
C(fbrace)[T.3.0]       105.8   102.2 *
C(fbrace)[T.4.0]       105.8   106.3  
C(pay_rec)[T.2.0]      105.8   105.9  
C(pay_rec)[T.3.0]      105.8   106.9  
C(pay_rec)[T.4.0]      105.8   105.0  
fhisp                  105.8   104.0 *
previs                 105.8   104.8 *

Out[99]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3009712
Model:	Logit	Df Residuals:	3009703
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.163e-05
Time:	14:30:20	Log-Likelihood:	-2.0851e+06
converged:	True	LL-Null:	-2.0852e+06
		LLR p-value:	1.004e-68

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0566	0.002	23.765	0.000	0.052 0.061
C(fbrace)[T.2.0]	-0.0295	0.003	-8.509	0.000	-0.036 -0.023
C(fbrace)[T.3.0]	-0.0345	0.011	-3.090	0.002	-0.056 -0.013
C(fbrace)[T.4.0]	0.0046	0.005	1.012	0.312	-0.004 0.014
C(pay_rec)[T.2.0]	0.0005	0.003	0.174	0.862	-0.005 0.006
C(pay_rec)[T.3.0]	0.0100	0.006	1.619	0.105	-0.002 0.022
C(pay_rec)[T.4.0]	-0.0074	0.006	-1.260	0.208	-0.019 0.004
fhisp	-0.0178	0.003	-5.687	0.000	-0.024 -0.012
previs	-0.0101	0.001	-15.540	0.000	-0.011 -0.009

Here's a version with the addition of a boolean for no prenatal visits.

In [100]:

formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692776
         Iterations 3
C(fbrace)[T.2.0]       105.9   102.8 *
C(fbrace)[T.3.0]       105.9   102.3 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.1 *
previs                 105.9   104.7 *
no_previs              105.9   101.0 *

Out[100]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155433
Method:	MLE	Df Model:	6
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.351e-05
Time:	14:30:47	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	8.674e-76

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0570	0.001	38.973	0.000	0.054 0.060
C(fbrace)[T.2.0]	-0.0294	0.003	-8.984	0.000	-0.036 -0.023
C(fbrace)[T.3.0]	-0.0342	0.011	-3.123	0.002	-0.056 -0.013
C(fbrace)[T.4.0]	0.0056	0.004	1.270	0.204	-0.003 0.014
fhisp	-0.0171	0.003	-5.817	0.000	-0.023 -0.011
previs	-0.0111	0.001	-16.625	0.000	-0.012 -0.010
no_previs	-0.0469	0.012	-3.936	0.000	-0.070 -0.024

Now, surprisingly, the mother's age has a small effect.

In [101]:

formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + mager9')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692775
         Iterations 3
C(fbrace)[T.2.0]       106.8   103.6 *
C(fbrace)[T.3.0]       106.8   103.1 *
C(fbrace)[T.4.0]       106.8   107.4  
fhisp                  106.8   104.9 *
previs                 106.8   105.6 *
no_previs              106.8   101.9 *
mager9                 106.8   106.6 *

Out[101]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155432
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.440e-05
Time:	14:31:14	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	1.043e-75

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0656	0.005	14.344	0.000	0.057 0.075
C(fbrace)[T.2.0]	-0.0300	0.003	-9.123	0.000	-0.036 -0.024
C(fbrace)[T.3.0]	-0.0351	0.011	-3.200	0.001	-0.057 -0.014
C(fbrace)[T.4.0]	0.0062	0.004	1.413	0.158	-0.002 0.015
fhisp	-0.0176	0.003	-5.974	0.000	-0.023 -0.012
previs	-0.0110	0.001	-16.456	0.000	-0.012 -0.010
no_previs	-0.0468	0.012	-3.926	0.000	-0.070 -0.023
mager9	-0.0019	0.001	-1.970	0.049	-0.004 -9.69e-06

So does the father's age. But both age effects are small and borderline significant.

In [104]:

formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + fagerrec11')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692775
         Iterations 3
C(fbrace)[T.2.0]       106.9   103.7 *
C(fbrace)[T.3.0]       106.9   103.2 *
C(fbrace)[T.4.0]       106.9   107.6  
fhisp                  106.9   105.0 *
previs                 106.9   105.7 *
no_previs              106.9   101.8 *
fagerrec11             106.9   106.7 *

Out[104]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	3148537
Model:	Logit	Df Residuals:	3148529
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.517e-05
Time:	14:32:34	Log-Likelihood:	-2.1812e+06
converged:	True	LL-Null:	-2.1814e+06
		LLR p-value:	2.924e-76

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0663	0.004	15.399	0.000	0.058 0.075
C(fbrace)[T.2.0]	-0.0299	0.003	-9.100	0.000	-0.036 -0.023
C(fbrace)[T.3.0]	-0.0348	0.011	-3.170	0.002	-0.056 -0.013
C(fbrace)[T.4.0]	0.0067	0.004	1.518	0.129	-0.002 0.015
fhisp	-0.0176	0.003	-5.974	0.000	-0.023 -0.012
previs	-0.0110	0.001	-16.545	0.000	-0.012 -0.010
no_previs	-0.0483	0.012	-4.039	0.000	-0.072 -0.025
fagerrec11	-0.0019	0.001	-2.278	0.023	-0.003 -0.000

What's up with prenatal visits?¶

The predictive power of prenatal visits is still surprising to me. To make sure we're controlled for race, I'll select cases where both parents are white:

In [110]:

white = df[(df.mbrace==1) & (df.fbrace==1)]
len(white)

Out[110]:

And compute sex ratios for each level of previs

In [111]:

var = 'previs'
white[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Out[111]:

	boy
previs
-6	107
-5	110
-4	108
-3	110
-2	108
-1	107
0	105
1	103
2	103
3	102
4	103

The effect holds up. People with fewer than average prenatal visits are substantially more likely to have boys.

In [112]:

formula = ('boy ~ previs + no_previs')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692749
         Iterations 3
previs                 105.5   104.3 *
no_previs              105.5   100.4 *

Out[112]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2346785
Model:	Logit	Df Residuals:	2346782
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	6.418e-05
Time:	14:40:39	Log-Likelihood:	-1.6257e+06
converged:	True	LL-Null:	-1.6258e+06
		LLR p-value:	4.790e-46

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0534	0.001	40.728	0.000	0.051 0.056
previs	-0.0113	0.001	-14.378	0.000	-0.013 -0.010
no_previs	-0.0490	0.015	-3.352	0.001	-0.078 -0.020

In [113]:

inter = results.params['Intercept']
slope = results.params['previs']
inter, slope

Out[113]:

(0.053449172473506806, -0.011302385985286368)

In [114]:

previs = np.arange(-5, 5)
logodds = inter + slope * previs
odds = np.exp(logodds)
odds * 100

Out[114]:

array([ 111.62346508,  110.36895641,  109.12854687,  107.90207798,
        106.68939307,  105.49033723,  104.30475728,  103.13250177,
        101.97342096,  100.82736677])

In [116]:

formula = ('boy ~ dmar')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
dmar                   105.3   105.5

Out[116]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2400787
Model:	Logit	Df Residuals:	2400785
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	7.406e-08
Time:	15:27:21	Log-Likelihood:	-1.6632e+06
converged:	True	LL-Null:	-1.6632e+06
		LLR p-value:	0.6196

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0518	0.004	13.234	0.000	0.044 0.059
dmar	0.0014	0.003	0.496	0.620	-0.004 0.007

In [117]:

formula = ('boy ~ lowed')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
lowed                  105.6   105.0

Out[117]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2301234
Model:	Logit	Df Residuals:	2301232
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.759e-07
Time:	15:28:01	Log-Likelihood:	-1.5943e+06
converged:	True	LL-Null:	-1.5943e+06
		LLR p-value:	0.2180

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0542	0.001	38.603	0.000	0.051 0.057
lowed	-0.0051	0.004	-1.232	0.218	-0.013 0.003

In [118]:

formula = ('boy ~ highbo')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
highbo                 105.5   105.6

Out[118]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2391630
Model:	Logit	Df Residuals:	2391628
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.564e-09
Time:	15:28:25	Log-Likelihood:	-1.6569e+06
converged:	True	LL-Null:	-1.6569e+06
		LLR p-value:	0.9021

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0535	0.001	40.493	0.000	0.051 0.056
highbo	0.0008	0.006	0.123	0.902	-0.012 0.013

In [119]:

formula = ('boy ~ wic')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
wic[T.Y]               105.6   105.3

Out[119]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2266424
Model:	Logit	Df Residuals:	2266422
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.840e-07
Time:	15:28:57	Log-Likelihood:	-1.5701e+06
converged:	True	LL-Null:	-1.5701e+06
		LLR p-value:	0.2721

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0548	0.002	33.369	0.000	0.052 0.058
wic[T.Y]	-0.0031	0.003	-1.098	0.272	-0.009 0.002

In [120]:

formula = ('boy ~ obese')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
obese                  105.6   105.3

Out[120]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2244349
Model:	Logit	Df Residuals:	2244347
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.725e-07
Time:	15:29:20	Log-Likelihood:	-1.5549e+06
converged:	True	LL-Null:	-1.5549e+06
		LLR p-value:	0.4639

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0542	0.002	35.607	0.000	0.051 0.057
obese	-0.0023	0.003	-0.732	0.464	-0.009 0.004

In [123]:

formula = ('boy ~ C(pay_rec)')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
C(pay_rec)[T.2.0]      105.4   105.5  
C(pay_rec)[T.3.0]      105.4   107.1 *
C(pay_rec)[T.4.0]      105.4   105.3

Out[123]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2295681
Model:	Logit	Df Residuals:	2295677
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.666e-06
Time:	15:30:06	Log-Likelihood:	-1.5904e+06
converged:	True	LL-Null:	-1.5904e+06
		LLR p-value:	0.1511

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0529	0.002	23.356	0.000	0.048 0.057
C(pay_rec)[T.2.0]	0.0004	0.003	0.147	0.883	-0.005 0.006
C(pay_rec)[T.3.0]	0.0159	0.007	2.235	0.025	0.002 0.030
C(pay_rec)[T.4.0]	-0.0013	0.007	-0.197	0.844	-0.015 0.012

In [124]:

formula = ('boy ~ mager9')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
mager9                 107.0   106.7 *

Out[124]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2400787
Model:	Logit	Df Residuals:	2400785
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.516e-06
Time:	15:30:32	Log-Likelihood:	-1.6632e+06
converged:	True	LL-Null:	-1.6632e+06
		LLR p-value:	0.003813

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0677	0.005	13.452	0.000	0.058 0.078
mager9	-0.0032	0.001	-2.893	0.004	-0.005 -0.001

In [125]:

formula = ('boy ~ youngm + oldm')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692787
         Iterations 3
youngm[T.True]         105.6   105.5  
oldm[T.True]           105.6   103.8 *

Out[125]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2400787
Model:	Logit	Df Residuals:	2400784
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.549e-06
Time:	15:31:04	Log-Likelihood:	-1.6632e+06
converged:	True	LL-Null:	-1.6632e+06
		LLR p-value:	0.07608

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0542	0.001	40.370	0.000	0.052 0.057
youngm[T.True]	-0.0011	0.006	-0.170	0.865	-0.013 0.011
oldm[T.True]	-0.0173	0.008	-2.268	0.023	-0.032 -0.002

In [126]:

formula = ('boy ~ youngf + oldf')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()

Optimization terminated successfully.
         Current function value: 0.692787
         Iterations 3
youngf                 105.5   106.4  
oldf                   105.5   105.7

Out[126]:

Logit Regression Results
Dep. Variable:	boy	No. Observations:	2396141
Model:	Logit	Df Residuals:	2396138
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.717e-07
Time:	15:31:50	Log-Likelihood:	-1.6600e+06
converged:	True	LL-Null:	-1.6600e+06
		LLR p-value:	0.6370

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0534	0.001	40.229	0.000	0.051 0.056
youngf	0.0082	0.009	0.924	0.355	-0.009 0.026
oldf	0.0018	0.008	0.242	0.809	-0.013 0.017

In [ ]: