Validation of Articulation tests¶

Contents

Problem background
Data Prep
Students who took one test
Students who took both tests

Questions and Background¶

Students can either take the Arizona or Goldman-Fristoe test
Question: Can these two tests be combined?
Similar studentss should score the same.

How should we define "similar students"?
For now I will consider the variable included in our population characteristics summary table

Covariate	Redcap Variable
Age (Arizona)	`age_test_aaps`
Age (Goldman Firstoe)	`age_test_gf2`
Type of hearing loss	`type_hl_as`, `type_hl_ad`
Degree of hearing loss	`degree_hl`
Technology	`tech_as`, `tech_ad`
Gender	`male`
Race	`race`
Primary language spoken	`prim_lang`
Number of children in home	`sibs`
Age of hearing loss diagnosis	`onset_1`
Age first amplified	`age_amp`
Age of intervention	`age_int`
Age enrolled in OPTION	`age`
Length of time in OPTION	`time`
Known cause of hearing loss	`med_cause=0`
Known syndrome	`synd_cause=0`
Another diagnosed disability	`etiology`
Another disability impacts ability to learn	`etiology_2`
Other services outside of OPTION	`otherserv`

How will we asses this?

Linear model with score as outcome and test type as predictor, controlling for predictors noted above
If a student took both tests we could asses how well one test score predicts the other

Statistical considerations:

Many covariates are listed above. Asses sample size to determine degrees of freedom.
How to handle students with multiple observations?
- For now will take last observation
How to handle students with both tests?
- For now only looking at students who took one test
- Could try to predict Arizona score form Goldman-Fristoe and vice versa

Data Prep¶

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

CHRIS = False

In [2]:

# %pwd

In [3]:

if CHRIS:
    print("change path to demographics file")
else:
    %run '/Users/alicetoll/Documents/OPTION/LSL-DR/Demographics.ipynb'

False    11601
True      2567
dtype: int64
There are 716 null values for non_english
_mother_ed:
6    5181
4    3022
3    2037
5    1628
2    1421
1     490
0     213
dtype: int64
mother_ed:
1    3458
2    3022
3    1628
0     703
dtype: int64

There are 6073 null values for mother_ed
There are 3494 null values for premature_weeks
oad:
0    4627
1       4
2       2
dtype: int64
There are 1715 null values for OAD

hearing_aid:
2    2162
0    1638
1     802
dtype: int64
There are 1768 null values for hearing_aid

cochlear:
0    3081
2     918
1     634
dtype: int64
There are 1715 null values for cochlear
14884
There are 0 null values for tech
_race:
0    7766
2    2528
1    1358
3    1042
6     721
8     527
7     239
4      65
5      33
dtype: int64
race:
0    7766
2    2528
1    1358
4    1346
3    1042
dtype: int64
There are 844 null values for race

/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (46) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [4]:

if CHRIS:
    demographic = pd.read_csv('demographic.csv')
else:
    demographic = pd.read_pickle('demographic.pkl')

In [5]:

covar_fields = ['study_id', 'redcap_event_name', 'male', 'race', 'prim_lang', 'sib', 'onset_1',
                'age_amp', 'age_int', 'age', 'time', 'med_cause', 'synd_cause', 'etiology', 'etiology_2',
                'otherserv', 'type_hl_ad', 'type_hl_as']

covar = demographic[covar_fields]

# Need to create vars and add: 
# - type of hearing loss
# - degree of hearing loss
# - technology
# - CI Manufacturer

# Do we want to collapse race?

# Age is in articulation_fields
# - age_test_aaps = Arizona
# - age_test_gf2 = Goldman-Fristoe

In [6]:

# Type of hearing loss
# - Bilateral SNHL
# - Unilateral SNHL
# - Bilateral Auditory Neuropathy
# - Unilateral Auditory Neuropathy

covar['bilateral_snhl'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 0), 'bilateral_snhl'] = 1

covar['unilateral_snhl'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 4), 'unilateral_snhl'] = 1
covar.loc[(covar.type_hl_ad == 4) & (covar.type_hl_as == 0), 'unilateral_snhl'] = 1

covar['bilateral_an'] = 0
covar.loc[(covar.type_hl_ad == 3) & (covar.type_hl_as == 3), 'bilateral_an'] = 1

covar['unilateral_an'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 3), 'unilateral_an'] = 1
covar.loc[(covar.type_hl_ad == 3) & (covar.type_hl_as == 0), 'unilateral_an'] = 1

/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/core/indexing.py:415: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [7]:

# - Bilateral SNHL
# - Unilateral SNHL
# - Bilateral Auditory Neuropathy
# - Unilateral Auditory Neuropathy

In [8]:

bi_snhl = covar.bilateral_snhl == 1
uni_snhl = covar.unilateral_snhl == 1
bi_an = covar.bilateral_an == 1
uni_an = covar.unilateral_an == 1

covar.loc[bi_snhl, 'type_hl'] = "bi_snhl"
covar.loc[uni_snhl, 'type_hl'] = "uni_snhl"
covar.loc[bi_an, 'type_hl'] = "bi_an"
covar.loc[uni_an, 'type_hl'] = "uni_an"

/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/core/indexing.py:249: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)

In [9]:

articulation_fields = ['study_id','redcap_event_name', 'age_test_aaps','aaps_ss','age_test_gf2','gf2_ss']
articulation = lsl_dr_project.export_records(fields=articulation_fields, format='df', 
                                           df_kwargs={'index_col':None,
                                                      'na_values':[999, 9999]})

In [10]:

# articulation.head()

In [11]:

# create indicator variable if a student took either test
articulation.loc[articulation.aaps_ss.notnull() | articulation.gf2_ss.notnull(), 'test'] = 1
# drop observations when neither test was taken
temp = articulation.dropna(subset = ['test'])
# Can drop test variable
temp = temp.drop('test', 1)
# Create new variable if student took both tests in one observation
temp['both'] = 0
temp.loc[temp.aaps_ss.notnull() & temp.gf2_ss.notnull(), 'both'] = 1
print(temp.both.value_counts()) # 73 students took both tests, 5716 took only one
# temp.head()

0    5760
1      73
dtype: int64

Merge covariates and articulation data frames

In [12]:

# merge
temp2 = pd.merge(temp, covar, on=['study_id', 'redcap_event_name'])

In [13]:

AAPS = temp2.aaps_ss.notnull()
GF2 = temp2.gf2_ss.notnull()

temp2.loc[AAPS, "test_name"] = "AAPS"
temp2.loc[GF2, "test_name"] = "GF2"

Create datasets which contain students who either took one or both tests in a single year. If a student has multiple observations, take the last observation in each dataset.

In [14]:

# One test
single = temp2.loc[temp2.both==0,]
a = single.shape[0]
single = single.groupby('study_id').last()
b = single.shape[0]
print('We have', a, 'observations where a student took one test in a single year, but only', b, 'unique students')

# Both tests
both = temp2.loc[temp2.both==1,]
a = both.shape[0]
both = both.groupby('study_id').last()
b = both.shape[0]
print('We have', a, 'observations where a student took both test in a single year, but only', b, 'unique students')

We have 5760 observations where a student took one test in a single year, but only 2730 unique students
We have 73 observations where a student took both test in a single year, but only 55 unique students

In [15]:

del temp
del temp2

Students who only took one test (within a single year)¶

In [16]:

# Create variables to hold test score and age, and indicator for test name

AAPS = single.aaps_ss.notnull()
GF2 = single.gf2_ss.notnull()

single.loc[AAPS, "test_name"] = "AAPS"
single.loc[GF2, "test_name"] = "GF2"

single.loc[AAPS, "score"] = single.aaps_ss
single.loc[GF2, "score"] = single.gf2_ss

single.loc[AAPS, "age_test"] = single.age_test_aaps
single.loc[GF2, "age_test"] = single.age_test_gf2

# temp.head()

# single = single.drop(['age_test_aaps', 'aaps_ss', 'age_test_gf2', 'gf2_ss', 'both'], 1)
# artShort1.head()

In [17]:

# Number of observations
single.shape[0]

Out[17]:

In [18]:

# Missingness
len(single.index)-single.count()

Out[18]:

redcap_event_name       0
age_test_aaps        2448
aaps_ss              2448
age_test_gf2          181
gf2_ss                180
both                    0
male                   22
race                   69
prim_lang              42
sib                   247
onset_1               913
age_amp               973
age_int              1063
age                    46
time                  170
med_cause              57
synd_cause             60
etiology               48
etiology_2            527
otherserv             974
type_hl_ad            429
type_hl_as            409
bilateral_snhl          0
unilateral_snhl         0
bilateral_an            0
unilateral_an           0
type_hl               780
test_name               0
score                   0
age_test                1
dtype: int64

Plots¶

In [19]:

%matplotlib inline
import seaborn as sns
# sns.set(style="white", color_codes=True)

In [20]:

width = 3
sns.kdeplot(np.array(single[single.test_name == 'GF2'].score), bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(single[single.test_name == 'AAPS'].score), bw=width, label = "Arizona")

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x10dcc5850>

In [21]:

single.type_hl.value_counts()

Out[21]:

bi_snhl     1811
bi_an         73
uni_snhl      57
uni_an         9
dtype: int64

We may want to consider looking at just bilateral SNHL vs other based on sample size.

In [22]:

# Kernal Density Estimates broken out by type of hearing loss

plt.figure()

# Bilateral SNHL
plt.subplot(2,2,1)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.bilateral_snhl == 1), 'score']), 
                bw=width)
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.bilateral_snhl == 1), 'score']), 
                bw=width)
plt.title('Bilateral SNHL')
plt.ylabel('Density')
plt.xlabel('Standardized Score')

# Unilateral SNHL
plt.subplot(2,2,2)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.unilateral_snhl == 1), 'score']), 
                bw=width)
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.unilateral_snhl == 1), 'score']), 
                bw=width)
plt.title('Unilateral SNHL')
plt.ylabel('Density')
plt.xlabel('Standardized Score')

# Bilateral AN
plt.subplot(2,2,3)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.bilateral_an == 1), 'score']), 
                bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.bilateral_an == 1), 'score']), 
                bw=width, label = "Arizona")
plt.title('Bilateral Auditory Neuropathy')
plt.ylabel('Density')
plt.xlabel('Standardized Score')

# Unilateral AN
plt.subplot(2,2,4)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.unilateral_an == 1), 'score']), 
                bw=width)
# no obs for Arizona with Unilateral AN
# sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.unilateral_an == 1), 'score']), 
#                bw=width, label = "Arizona")
plt.title('Unilateral Auditory Neuropathy')
plt.ylabel('Density')
plt.xlabel('Standardized Score')


plt.tight_layout()

Want to make this function¶

Linear Model¶

In [23]:

import statsmodels.formula.api as smf

Linear model predicting score from test name, age, gender, and type of hearing loss.

In [24]:

lm = smf.ols(formula="""score ~ test_name + age_test + male + 
                        bilateral_snhl + unilateral_snhl + bilateral_an + unilateral_an""", 
             data=single).fit()
lm.summary()

Out[24]:

OLS Regression Results
Dep. Variable:	score	R-squared:	0.019
Model:	OLS	Adj. R-squared:	0.016
Method:	Least Squares	F-statistic:	7.470
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	6.29e-09
Time:	18:46:52	Log-Likelihood:	-12016.
No. Observations:	2707	AIC:	2.405e+04
Df Residuals:	2699	BIC:	2.409e+04
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	87.8178	1.979	44.369	0.000	83.937 91.699
test_name[T.GF2]	1.4962	1.589	0.942	0.346	-1.619 4.612
age_test	-0.0570	0.013	-4.545	0.000	-0.082 -0.032
male	0.1904	0.792	0.241	0.810	-1.362 1.742
bilateral_snhl	-0.5525	0.880	-0.628	0.530	-2.279 1.174
unilateral_snhl	9.0884	2.868	3.169	0.002	3.465 14.711
bilateral_an	-7.9264	2.547	-3.112	0.002	-12.921 -2.932
unilateral_an	-17.2806	6.879	-2.512	0.012	-30.769 -3.792

Omnibus:	180.329	Durbin-Watson:	1.756
Prob(Omnibus):	0.000	Jarque-Bera (JB):	204.964
Skew:	-0.653	Prob(JB):	3.11e-45
Kurtosis:	2.666	Cond. No.	1.38e+03

From the output above, we see test_name is not significant (p = 0.334). This tells us that after accounting for the student's age when taking the test, gender, and type of hearing loss, knowing which test the student took does not help us to predict there score. This suggests that similar students score the same regardless of which test they took.

We need to determine which variables should be included in our model.

Students who only took both tests (within a single year)¶

Introduction¶

We currently treat standardized Arizona and Goldman-Fristoe tests scores the same. The goal of this analysis is to relate the two tests so they can be transformed for future modeling purposes.

We only have 55 observations (before accounting for missingness). That only leaves us with 3-6 degress of freedom for modeling. A continuous variable will use up one degree of freedom, and the number of deggrees of freedom used by a categorical variable is one less than the number of categories.

My initial intuition was to model Arizona score as a function of Goldman-Fristoe score, age, type of hearing loss, and whether or not the student receives other learning servies. Unfortunatly on this small sample size the observations for type of hearing loss and other serivces are rather homongenous, so I do not believe they will be very helpful in predicted Arizona score.

We may want to allow for nonlinearity by including a spline term for the Goldman-Fristoe score. A spline with 3 knots would use 3 degrees of freedom.

The candidate models considered are sumarized in the subsection titled Models.

Variables that were eliminated from consideration due to homogenity

type_hl (47 out of 55 students have bilateral SNHL)
etiology (42 out of 55 have another diagnosed disability)
otherserv (35 out of 55 do not recieve other educational services, 12 out of 55 are missing)

Response: Arizona scores (aaps_ss)

Considered Covariate	Redcap Variable
Goldman Fristoe Score	gf2_ss	1
Goldman Fristoe Score, spline	gf2_ss	3
Age (Arizona)	`age_test_aaps`	1
Age (Goldman Firstoe)	`age_test_gf2`	1
Gender	`male`	1
Race (White/Non-White)	`white`	1
Age of hearing loss diagnosis	`onset_1`	1
Length of time in OPTION	`time`	1
Another disability impacts ability to learn	`etiology_impact`	1

Potential variables for modelings¶

In [25]:

# With 55 observations (before missingness), we really only have 3-6 degrees of freedom to use in a model. 
both.shape

Out[25]:

(55, 28)

In [26]:

# Arizona is more conservative towards "normal" (85)
# May want to consider a spline for Goldman-Fristoe
width = 4
sns.kdeplot(np.array(both.gf2_ss), bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(both.aaps_ss), bw=width, label = "Arizona")
plt.ylabel('Density')
plt.xlabel('Standardized Score')

Out[26]:

<matplotlib.text.Text at 0x10c25ae10>

In [27]:

# Large age range (in months).  May want to use a spline for age
# Don't have enough degrees of freedom to spline both variables
sns.kdeplot(both.age_test_gf2, bw=5, label = "Goldman-Fristoe")
sns.kdeplot(both.age_test_aaps, bw=5, label = "Arizona")
plt.ylabel('Density')
plt.xlabel('Age (months)')

Out[27]:

<matplotlib.text.Text at 0x1110c81d0>

In [28]:

# Race has a lot of categories
# To preserve degrees of freedom we can reduce to white/non-white
both.race.value_counts(dropna=False, sort=False)
# 0 = Cacucasian
# 1 = Black or African American
# 2 = Hispanic or Latino
# 3 = Asian
# 4 = Native Hawaiians/Pacific Islanders
# 5 = American Indian/Alaska Natives
# 6 = Multiracial
# 7 = Unknown
# 8 = Other

Out[28]:

NaN     2
 0     26
 1      6
 2      8
 3      4
 4      9
dtype: int64

In [29]:

both['white'] = np.nan
both.loc[(both.race == 0), 'white'] = 1
both.loc[(both.race !=0) & (pd.notnull(both.race)), 'white'] = 0
both.white.value_counts(dropna=False, sort=False)

Out[29]:

NaN     2
 0     27
 1     26
dtype: int64

In [30]:

# Etiology 2: impact of the other condition(s) that contribute to the child's overall learning abilities
# To preserve degrees of freedom we can reduce to impacts (1, 2, 3) and doesn't imact (0, 4)
both.etiology_2.value_counts(dropna=False, sort=False)
# 0 = None
# 1 = Mild
# 2 = Moderate
# 3 = Severe
# 4 = N/A (no additional disability suspected or identified)

Out[30]:

NaN     1
 0     11
 1      5
 2     12
 4     26
dtype: int64

In [31]:

both['etiology_impact'] = np.nan
both.loc[(both.etiology_2 == 0) | (both.etiology_2 == 4), 'etiology_impact'] = 0
both.loc[(both.etiology_2 == 1) | (both.etiology_2 == 2) | (both.etiology_2 == 3), 'etiology_impact'] = 1
both.etiology_impact.value_counts(dropna=False)

Out[31]:

 0     37
 1     17
NaN     1
dtype: int64

In [32]:

# Amount of time enrolled in OPTION: 2 missing obs
len(both.time.index)-both.time.count()

Out[32]:

In [33]:

# Age of onset: 10 missing obs
# This is a lot of observations to sacrifice
# Would be good if we could use multiple imputation
len(both.onset_1.index)-both.onset_1.count()

Out[33]:

Variables eliminated due to heterogeneity¶

In [34]:

# Most of the students who took both tests have bilateral SNHL
# 4 observations are missing
# Because a large majority of the students only have bilateral SNHL, it will not be a useful predictor.
both.type_hl.value_counts(dropna=False)

Out[34]:

bi_snhl     47
NaN          4
bi_an        3
uni_snhl     1
dtype: int64

In [35]:

# 42 out of 55 do not have another diagnosed disability
# note: this doesn't agree with etiology_2 coding
both.etiology.value_counts(dropna=False, sort=False)

Out[35]:

0     9
1    42
2     1
3     3
dtype: int64

In [36]:

# Other services: 12 missing obs
# Most students do not recieve other services
both.otherserv.value_counts(dropna=False, sort=False)
# 0 = Yes
# 1 = No
# 2 = Unknown

Out[36]:

NaN    12
 0      3
 1     35
 2      5
dtype: int64

Models¶

Model Name	Variables	n obs	df	R-squared
m1	gf2_ss + age_test_gf2 + age_test_aaps + male + white + etiology_impact	52	6	0.782
m2	gf2_ss + age_test_gf2 + age_test_aaps	55	3	0.771
m3	bs(gf2_ss, 3) + age_test_gf2 + age_test_aaps	55	5	0.781
m4	cr(gf2_ss, df=3, constraints='center') + age_test_gf2 + age_test_aaps	55	5	0.780
m5	gf2_ss + age_test_gf2 + age_test_aaps + time + onset_1	45	5	0.877
m6	gf2_ss + age_test_gf2 + age_test_aaps + time	52	4	0.819

In [37]:

m1 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps + 
                        male + white + etiology_impact""", 
             data=both).fit()
m1.summary()

Out[37]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.782
Model:	OLS	Adj. R-squared:	0.753
Method:	Least Squares	F-statistic:	26.85
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	2.44e-13
Time:	18:46:54	Log-Likelihood:	-158.30
No. Observations:	52	AIC:	330.6
Df Residuals:	45	BIC:	344.3
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	53.1924	5.154	10.320	0.000	42.811 63.574
gf2_ss	0.4439	0.038	11.787	0.000	0.368 0.520
age_test_gf2	-0.3120	0.147	-2.125	0.039	-0.608 -0.016
age_test_aaps	0.2512	0.134	1.881	0.066	-0.018 0.520
male	0.5687	1.645	0.346	0.731	-2.745 3.882
white	1.7043	1.638	1.041	0.304	-1.594 5.003
etiology_impact	-2.0195	1.702	-1.187	0.242	-5.447 1.408

Omnibus:	2.316	Durbin-Watson:	2.265
Prob(Omnibus):	0.314	Jarque-Bera (JB):	1.555
Skew:	-0.403	Prob(JB):	0.460
Kurtosis:	3.263	Cond. No.	880.

In [38]:

m2 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps""", 
             data=both).fit()
m2.summary()

Out[38]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.771
Model:	OLS	Adj. R-squared:	0.757
Method:	Least Squares	F-statistic:	57.12
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	2.52e-16
Time:	18:46:54	Log-Likelihood:	-168.24
No. Observations:	55	AIC:	344.5
Df Residuals:	51	BIC:	352.5
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	54.8203	4.437	12.354	0.000	45.912 63.729
gf2_ss	0.4339	0.034	12.623	0.000	0.365 0.503
age_test_gf2	-0.3497	0.135	-2.588	0.013	-0.621 -0.078
age_test_aaps	0.2834	0.124	2.280	0.027	0.034 0.533

Omnibus:	2.865	Durbin-Watson:	2.272
Prob(Omnibus):	0.239	Jarque-Bera (JB):	2.027
Skew:	-0.442	Prob(JB):	0.363
Kurtosis:	3.320	Cond. No.	798.

In [39]:

# Use to print anova tables (for when we use splines)
import statsmodels.api as sm

In [40]:

# bs() B-spline
m3 = smf.ols(formula="""aaps_ss ~ bs(gf2_ss, 3) + age_test_gf2 + age_test_aaps""", 
             data=both).fit()
m3.summary()

Out[40]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.781
Model:	OLS	Adj. R-squared:	0.759
Method:	Least Squares	F-statistic:	35.01
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	4.62e-15
Time:	18:46:54	Log-Likelihood:	-166.93
No. Observations:	55	AIC:	345.9
Df Residuals:	49	BIC:	357.9
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	75.4249	4.181	18.040	0.000	67.023 83.827
bs(gf2_ss, 3)[0]	-1.3268	7.914	-0.168	0.868	-17.230 14.576
bs(gf2_ss, 3)[1]	22.9797	5.155	4.458	0.000	12.620 33.339
bs(gf2_ss, 3)[2]	27.7088	4.170	6.645	0.000	19.330 36.088
age_test_gf2	-0.3629	0.136	-2.671	0.010	-0.636 -0.090
age_test_aaps	0.2951	0.126	2.342	0.023	0.042 0.548

Omnibus:	3.445	Durbin-Watson:	2.254
Prob(Omnibus):	0.179	Jarque-Bera (JB):	2.473
Skew:	-0.464	Prob(JB):	0.290
Kurtosis:	3.465	Cond. No.	1.30e+03

In [41]:

sm.stats.anova_lm(m3, typ=2)

Out[41]:

	sum_sq	df	F	PR(>F)
bs(gf2_ss, 3)	4634.360674	3	54.312079	1.308633e-15
age_test_gf2	202.877916	1	7.132842	1.024204e-02
age_test_aaps	156.014750	1	5.485213	2.328615e-02
Residual	1393.696551	49	NaN	NaN

In [42]:

# cr() natural cubic spline
m4 = smf.ols(formula="""aaps_ss ~ cr(gf2_ss, df=3, constraints='center') + age_test_gf2 + age_test_aaps""", 
             data=both).fit()
m4.summary()

Out[42]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.780
Model:	OLS	Adj. R-squared:	0.757
Method:	Least Squares	F-statistic:	34.68
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	5.52e-15
Time:	18:46:54	Log-Likelihood:	-167.14
No. Observations:	55	AIC:	346.3
Df Residuals:	49	BIC:	358.3
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	90.0725	3.236	27.838	0.000	83.570 96.575
cr(gf2_ss, df=3, constraints='center')[0]	-0.6435	1.915	-0.336	0.738	-4.491 3.204
cr(gf2_ss, df=3, constraints='center')[1]	14.6513	1.804	8.121	0.000	11.026 18.277
cr(gf2_ss, df=3, constraints='center')[2]	16.6059	2.660	6.243	0.000	11.261 21.951
age_test_gf2	-0.3580	0.137	-2.621	0.012	-0.633 -0.083
age_test_aaps	0.2940	0.127	2.321	0.025	0.039 0.549

Omnibus:	2.991	Durbin-Watson:	2.239
Prob(Omnibus):	0.224	Jarque-Bera (JB):	2.083
Skew:	-0.432	Prob(JB):	0.353
Kurtosis:	3.404	Cond. No.	460.

In [43]:

sm.stats.anova_lm(m4, typ=2)

Out[43]:

	sum_sq	df	F	PR(>F)
cr(gf2_ss, df=3, constraints='center')	4624.005992	3	53.791079	1.567175e-15
age_test_gf2	196.774953	1	6.867251	1.165756e-02
age_test_aaps	154.307597	1	5.385183	2.451412e-02
Residual	1404.051233	49	NaN	NaN

In [44]:

m5 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps + 
                        time + onset_1""", 
             data=both).fit()
m5.summary()

Out[44]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.877
Model:	OLS	Adj. R-squared:	0.862
Method:	Least Squares	F-statistic:	55.82
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	9.93e-17
Time:	18:46:54	Log-Likelihood:	-122.00
No. Observations:	45	AIC:	256.0
Df Residuals:	39	BIC:	266.8
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	63.0834	3.795	16.622	0.000	55.407 70.760
gf2_ss	0.3897	0.029	13.285	0.000	0.330 0.449
age_test_gf2	-0.5494	0.113	-4.871	0.000	-0.778 -0.321
age_test_aaps	0.3424	0.103	3.327	0.002	0.134 0.551
time	1.8769	0.523	3.590	0.001	0.819 2.934
onset_1	0.0459	0.049	0.941	0.353	-0.053 0.144

Omnibus:	0.832	Durbin-Watson:	2.356
Prob(Omnibus):	0.660	Jarque-Bera (JB):	0.209
Skew:	-0.008	Prob(JB):	0.901
Kurtosis:	3.333	Cond. No.	847.

In [45]:

m6 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps + time""", 
             data=both).fit()
m6.summary()

Out[45]:

OLS Regression Results
Dep. Variable:	aaps_ss	R-squared:	0.819
Model:	OLS	Adj. R-squared:	0.804
Method:	Least Squares	F-statistic:	54.32
Date:	Sun, 17 Apr 2016	Prob (F-statistic):	3.13e-17
Time:	18:46:54	Log-Likelihood:	-155.96
No. Observations:	53	AIC:	321.9
Df Residuals:	48	BIC:	331.8
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	58.6568	4.169	14.069	0.000	50.274 67.040
gf2_ss	0.3976	0.033	12.071	0.000	0.331 0.464
age_test_gf2	-0.4447	0.127	-3.510	0.001	-0.699 -0.190
age_test_aaps	0.3094	0.114	2.719	0.009	0.081 0.538
time	1.5959	0.539	2.963	0.005	0.513 2.679

Omnibus:	5.319	Durbin-Watson:	2.411
Prob(Omnibus):	0.070	Jarque-Bera (JB):	4.242
Skew:	-0.632	Prob(JB):	0.120
Kurtosis:	3.567	Cond. No.	819.

Plots¶

Seaborn Jointplot: https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html

In [46]:

import scipy
# (correlation, p-value)
pc = scipy.stats.pearsonr(both.gf2_ss, both.aaps_ss)
print("Peasron Corelation: " "%.3f" % pc[0])
sc = scipy.stats.spearmanr(both.gf2_ss, both.aaps_ss)
print("Spearman Corelation: " "%.3f" % sc[0])

Peasron Corelation: 0.860
Spearman Corelation: 0.848

In [47]:

sns.lmplot("gf2_ss", "aaps_ss", data=both, fit_reg=False)
sns.regplot(both.gf2_ss, both.aaps_ss, scatter=False, lowess=True, label='LOWESS')
plt.plot([50, 110], [50, 110], 'red', label='Identity');
legend = plt.legend(loc='upper left')

In [48]:

# pearson correlation
g = sns.jointplot(x="gf2_ss", y="aaps_ss", data=both, marginal_kws=dict(bins=15, rug=True))
    # for spearman correlation, add 'stat_func=spearmanr' to joinplot()
x0, x1 = g.ax_joint.get_xlim()
y0, y1 = g.ax_joint.get_ylim()
lims = [max(x0, y0)-10, min(x1, y1)+10]
g.ax_joint.plot(lims, lims, ':k') 

Out[48]:

[<matplotlib.lines.Line2D at 0x10f7d9f10>]

In [49]:

# regression line with confidence bands
sns.jointplot(x="gf2_ss", y="aaps_ss", data=both, kind="reg", marginal_kws=dict(bins=15, rug=True));

In [50]:

fig = plt.figure(figsize=(12,15))
fig = sm.graphics.plot_regress_exog(m1, 'gf2_ss', fig=fig)

In [51]:

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(np.array(both.aaps_ss), m3.fittedvalues, 'r.')
plt.plot([65, 105], [65, 105], 'blue', label='Identity')
legend = plt.legend(loc='upper left')
plt.xlabel('Observed Arizona Score')
plt.ylabel('Predicted Arizona Score')
plt.title('Model 3')

Out[51]:

<matplotlib.text.Text at 0x10ec3e6d0>

In [52]:

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(np.array(both.gf2_ss), m3.fittedvalues, 'r.', label='Predicted')
ax.plot(np.array(both.gf2_ss), np.array(both.aaps_ss), 'g.', label='Observed')
plt.plot([35, 110], [35, 110], 'blue', label='Identity');
legend = plt.legend(loc='upper left')
plt.xlabel('Goldman-Fristoe Score')
plt.ylabel('Arizona Score')
plt.title('Model 3')
ax.set_xlim([30,120])
ax.set_ylim([60,110])

Out[52]:

(60, 110)

In [53]:

aaps_m6 = both[both['time'].notnull()].aaps_ss
gf2_m6 = both[both['time'].notnull()].gf2_ss

In [54]:

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(aaps_m6, m6.fittedvalues, 'r.')
plt.plot([65, 105], [65, 105], 'blue', label='Identity')
legend = plt.legend(loc='upper left')
plt.xlabel('Observed Arizona Score')
plt.ylabel('Predicted Arizona Score')
plt.title('Model 6')

Out[54]:

<matplotlib.text.Text at 0x10f10d190>

In [55]:

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(gf2_m6, m6.fittedvalues, 'r.', label='Predicted')
ax.plot(gf2_m6, aaps_m6, 'g.', label='Observed')
plt.plot([35, 110], [35, 110], 'blue', label='Identity');
legend = plt.legend(loc='upper left')
plt.xlabel('Goldman-Fristoe Score')
plt.ylabel('Arizona Score')
plt.title('Model 6')
ax.set_xlim([30,120])
ax.set_ylim([60,110])

Out[55]:

(60, 110)

Future Considerations¶

In an effort to increase sample size, we could model all observations with both scores (73) and use clustered robust standard errors or build a repeated measures model.

For robust standard errors see http://statsmodels.sourceforge.net/0.6.0/stats.html
sandwich_covariance.cov_cluster(results, group)