Students can either take the Arizona or Goldman-Fristoe test
Question: Can these two tests be combined?
Similar studentss should score the same.
How should we define "similar students"?
For now I will consider the variable included in our population characteristics summary table
Covariate | Redcap Variable |
---|---|
Age (Arizona) | age_test_aaps |
Age (Goldman Firstoe) | age_test_gf2 |
Type of hearing loss | type_hl_as , type_hl_ad |
Degree of hearing loss | degree_hl |
Technology | tech_as , tech_ad |
Gender | male |
Race | race |
Primary language spoken | prim_lang |
Number of children in home | sibs |
Age of hearing loss diagnosis | onset_1 |
Age first amplified | age_amp |
Age of intervention | age_int |
Age enrolled in OPTION | age |
Length of time in OPTION | time |
Known cause of hearing loss | med_cause=0 |
Known syndrome | synd_cause=0 |
Another diagnosed disability | etiology |
Another disability impacts ability to learn | etiology_2 |
Other services outside of OPTION | otherserv |
How will we asses this?
Statistical considerations:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
CHRIS = False
# %pwd
if CHRIS:
print("change path to demographics file")
else:
%run '/Users/alicetoll/Documents/OPTION/LSL-DR/Demographics.ipynb'
False 11601 True 2567 dtype: int64 There are 716 null values for non_english _mother_ed: 6 5181 4 3022 3 2037 5 1628 2 1421 1 490 0 213 dtype: int64 mother_ed: 1 3458 2 3022 3 1628 0 703 dtype: int64 There are 6073 null values for mother_ed There are 3494 null values for premature_weeks oad: 0 4627 1 4 2 2 dtype: int64 There are 1715 null values for OAD hearing_aid: 2 2162 0 1638 1 802 dtype: int64 There are 1768 null values for hearing_aid cochlear: 0 3081 2 918 1 634 dtype: int64 There are 1715 null values for cochlear 14884 There are 0 null values for tech _race: 0 7766 2 2528 1 1358 3 1042 6 721 8 527 7 239 4 65 5 33 dtype: int64 race: 0 7766 2 2528 1 1358 4 1346 3 1042 dtype: int64 There are 844 null values for race
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (46) have mixed types. Specify dtype option on import or set low_memory=False. data = self._reader.read(nrows)
if CHRIS:
demographic = pd.read_csv('demographic.csv')
else:
demographic = pd.read_pickle('demographic.pkl')
covar_fields = ['study_id', 'redcap_event_name', 'male', 'race', 'prim_lang', 'sib', 'onset_1',
'age_amp', 'age_int', 'age', 'time', 'med_cause', 'synd_cause', 'etiology', 'etiology_2',
'otherserv', 'type_hl_ad', 'type_hl_as']
covar = demographic[covar_fields]
# Need to create vars and add:
# - type of hearing loss
# - degree of hearing loss
# - technology
# - CI Manufacturer
# Do we want to collapse race?
# Age is in articulation_fields
# - age_test_aaps = Arizona
# - age_test_gf2 = Goldman-Fristoe
# Type of hearing loss
# - Bilateral SNHL
# - Unilateral SNHL
# - Bilateral Auditory Neuropathy
# - Unilateral Auditory Neuropathy
covar['bilateral_snhl'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 0), 'bilateral_snhl'] = 1
covar['unilateral_snhl'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 4), 'unilateral_snhl'] = 1
covar.loc[(covar.type_hl_ad == 4) & (covar.type_hl_as == 0), 'unilateral_snhl'] = 1
covar['bilateral_an'] = 0
covar.loc[(covar.type_hl_ad == 3) & (covar.type_hl_as == 3), 'bilateral_an'] = 1
covar['unilateral_an'] = 0
covar.loc[(covar.type_hl_ad == 0) & (covar.type_hl_as == 3), 'unilateral_an'] = 1
covar.loc[(covar.type_hl_ad == 3) & (covar.type_hl_as == 0), 'unilateral_an'] = 1
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/core/indexing.py:415: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s /Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/IPython/kernel/__main__.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# - Bilateral SNHL
# - Unilateral SNHL
# - Bilateral Auditory Neuropathy
# - Unilateral Auditory Neuropathy
bi_snhl = covar.bilateral_snhl == 1
uni_snhl = covar.unilateral_snhl == 1
bi_an = covar.bilateral_an == 1
uni_an = covar.unilateral_an == 1
covar.loc[bi_snhl, 'type_hl'] = "bi_snhl"
covar.loc[uni_snhl, 'type_hl'] = "uni_snhl"
covar.loc[bi_an, 'type_hl'] = "bi_an"
covar.loc[uni_an, 'type_hl'] = "uni_an"
/Users/alicetoll/anaconda/envs/py34/lib/python3.3/site-packages/pandas/core/indexing.py:249: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[key] = _infer_fill_value(value)
articulation_fields = ['study_id','redcap_event_name', 'age_test_aaps','aaps_ss','age_test_gf2','gf2_ss']
articulation = lsl_dr_project.export_records(fields=articulation_fields, format='df',
df_kwargs={'index_col':None,
'na_values':[999, 9999]})
# articulation.head()
# create indicator variable if a student took either test
articulation.loc[articulation.aaps_ss.notnull() | articulation.gf2_ss.notnull(), 'test'] = 1
# drop observations when neither test was taken
temp = articulation.dropna(subset = ['test'])
# Can drop test variable
temp = temp.drop('test', 1)
# Create new variable if student took both tests in one observation
temp['both'] = 0
temp.loc[temp.aaps_ss.notnull() & temp.gf2_ss.notnull(), 'both'] = 1
print(temp.both.value_counts()) # 73 students took both tests, 5716 took only one
# temp.head()
0 5760 1 73 dtype: int64
Merge covariates and articulation data frames
# merge
temp2 = pd.merge(temp, covar, on=['study_id', 'redcap_event_name'])
AAPS = temp2.aaps_ss.notnull()
GF2 = temp2.gf2_ss.notnull()
temp2.loc[AAPS, "test_name"] = "AAPS"
temp2.loc[GF2, "test_name"] = "GF2"
Create datasets which contain students who either took one or both tests in a single year. If a student has multiple observations, take the last observation in each dataset.
# One test
single = temp2.loc[temp2.both==0,]
a = single.shape[0]
single = single.groupby('study_id').last()
b = single.shape[0]
print('We have', a, 'observations where a student took one test in a single year, but only', b, 'unique students')
# Both tests
both = temp2.loc[temp2.both==1,]
a = both.shape[0]
both = both.groupby('study_id').last()
b = both.shape[0]
print('We have', a, 'observations where a student took both test in a single year, but only', b, 'unique students')
We have 5760 observations where a student took one test in a single year, but only 2730 unique students We have 73 observations where a student took both test in a single year, but only 55 unique students
del temp
del temp2
# Create variables to hold test score and age, and indicator for test name
AAPS = single.aaps_ss.notnull()
GF2 = single.gf2_ss.notnull()
single.loc[AAPS, "test_name"] = "AAPS"
single.loc[GF2, "test_name"] = "GF2"
single.loc[AAPS, "score"] = single.aaps_ss
single.loc[GF2, "score"] = single.gf2_ss
single.loc[AAPS, "age_test"] = single.age_test_aaps
single.loc[GF2, "age_test"] = single.age_test_gf2
# temp.head()
# single = single.drop(['age_test_aaps', 'aaps_ss', 'age_test_gf2', 'gf2_ss', 'both'], 1)
# artShort1.head()
# Number of observations
single.shape[0]
2730
# Missingness
len(single.index)-single.count()
redcap_event_name 0 age_test_aaps 2448 aaps_ss 2448 age_test_gf2 181 gf2_ss 180 both 0 male 22 race 69 prim_lang 42 sib 247 onset_1 913 age_amp 973 age_int 1063 age 46 time 170 med_cause 57 synd_cause 60 etiology 48 etiology_2 527 otherserv 974 type_hl_ad 429 type_hl_as 409 bilateral_snhl 0 unilateral_snhl 0 bilateral_an 0 unilateral_an 0 type_hl 780 test_name 0 score 0 age_test 1 dtype: int64
%matplotlib inline
import seaborn as sns
# sns.set(style="white", color_codes=True)
width = 3
sns.kdeplot(np.array(single[single.test_name == 'GF2'].score), bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(single[single.test_name == 'AAPS'].score), bw=width, label = "Arizona")
<matplotlib.axes._subplots.AxesSubplot at 0x10dcc5850>
single.type_hl.value_counts()
bi_snhl 1811 bi_an 73 uni_snhl 57 uni_an 9 dtype: int64
We may want to consider looking at just bilateral SNHL vs other based on sample size.
# Kernal Density Estimates broken out by type of hearing loss
plt.figure()
# Bilateral SNHL
plt.subplot(2,2,1)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.bilateral_snhl == 1), 'score']),
bw=width)
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.bilateral_snhl == 1), 'score']),
bw=width)
plt.title('Bilateral SNHL')
plt.ylabel('Density')
plt.xlabel('Standardized Score')
# Unilateral SNHL
plt.subplot(2,2,2)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.unilateral_snhl == 1), 'score']),
bw=width)
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.unilateral_snhl == 1), 'score']),
bw=width)
plt.title('Unilateral SNHL')
plt.ylabel('Density')
plt.xlabel('Standardized Score')
# Bilateral AN
plt.subplot(2,2,3)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.bilateral_an == 1), 'score']),
bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.bilateral_an == 1), 'score']),
bw=width, label = "Arizona")
plt.title('Bilateral Auditory Neuropathy')
plt.ylabel('Density')
plt.xlabel('Standardized Score')
# Unilateral AN
plt.subplot(2,2,4)
sns.kdeplot(np.array(single.loc[(single.test_name == 'GF2') & (single.unilateral_an == 1), 'score']),
bw=width)
# no obs for Arizona with Unilateral AN
# sns.kdeplot(np.array(single.loc[(single.test_name == 'AAPS') & (single.unilateral_an == 1), 'score']),
# bw=width, label = "Arizona")
plt.title('Unilateral Auditory Neuropathy')
plt.ylabel('Density')
plt.xlabel('Standardized Score')
plt.tight_layout()
import statsmodels.formula.api as smf
Linear model predicting score from test name, age, gender, and type of hearing loss.
lm = smf.ols(formula="""score ~ test_name + age_test + male +
bilateral_snhl + unilateral_snhl + bilateral_an + unilateral_an""",
data=single).fit()
lm.summary()
Dep. Variable: | score | R-squared: | 0.019 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.016 |
Method: | Least Squares | F-statistic: | 7.470 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 6.29e-09 |
Time: | 18:46:52 | Log-Likelihood: | -12016. |
No. Observations: | 2707 | AIC: | 2.405e+04 |
Df Residuals: | 2699 | BIC: | 2.409e+04 |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 87.8178 | 1.979 | 44.369 | 0.000 | 83.937 91.699 |
test_name[T.GF2] | 1.4962 | 1.589 | 0.942 | 0.346 | -1.619 4.612 |
age_test | -0.0570 | 0.013 | -4.545 | 0.000 | -0.082 -0.032 |
male | 0.1904 | 0.792 | 0.241 | 0.810 | -1.362 1.742 |
bilateral_snhl | -0.5525 | 0.880 | -0.628 | 0.530 | -2.279 1.174 |
unilateral_snhl | 9.0884 | 2.868 | 3.169 | 0.002 | 3.465 14.711 |
bilateral_an | -7.9264 | 2.547 | -3.112 | 0.002 | -12.921 -2.932 |
unilateral_an | -17.2806 | 6.879 | -2.512 | 0.012 | -30.769 -3.792 |
Omnibus: | 180.329 | Durbin-Watson: | 1.756 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 204.964 |
Skew: | -0.653 | Prob(JB): | 3.11e-45 |
Kurtosis: | 2.666 | Cond. No. | 1.38e+03 |
From the output above, we see test_name
is not significant (p = 0.334). This tells us that after accounting for the student's age when taking the test, gender, and type of hearing loss, knowing which test the student took does not help us to predict there score. This suggests that similar students score the same regardless of which test they took.
We need to determine which variables should be included in our model.
We currently treat standardized Arizona and Goldman-Fristoe tests scores the same. The goal of this analysis is to relate the two tests so they can be transformed for future modeling purposes.
We only have 55 observations (before accounting for missingness). That only leaves us with 3-6 degress of freedom for modeling. A continuous variable will use up one degree of freedom, and the number of deggrees of freedom used by a categorical variable is one less than the number of categories.
My initial intuition was to model Arizona score as a function of Goldman-Fristoe score, age, type of hearing loss, and whether or not the student receives other learning servies. Unfortunatly on this small sample size the observations for type of hearing loss and other serivces are rather homongenous, so I do not believe they will be very helpful in predicted Arizona score.
We may want to allow for nonlinearity by including a spline term for the Goldman-Fristoe score. A spline with 3 knots would use 3 degrees of freedom.
The candidate models considered are sumarized in the subsection titled Models.
Variables that were eliminated from consideration due to homogenity
type_hl
(47 out of 55 students have bilateral SNHL)etiology
(42 out of 55 have another diagnosed disability)otherserv
(35 out of 55 do not recieve other educational services, 12 out of 55 are missing)Response: Arizona scores (aaps_ss
)
Considered Covariate | Redcap Variable | |
---|---|---|
Goldman Fristoe Score | gf2_ss | 1 |
Goldman Fristoe Score, spline | gf2_ss | 3 |
Age (Arizona) | age_test_aaps |
1 |
Age (Goldman Firstoe) | age_test_gf2 |
1 |
Gender | male |
1 |
Race (White/Non-White) | white |
1 |
Age of hearing loss diagnosis | onset_1 |
1 |
Length of time in OPTION | time |
1 |
Another disability impacts ability to learn | etiology_impact |
1 |
# With 55 observations (before missingness), we really only have 3-6 degrees of freedom to use in a model.
both.shape
(55, 28)
# Arizona is more conservative towards "normal" (85)
# May want to consider a spline for Goldman-Fristoe
width = 4
sns.kdeplot(np.array(both.gf2_ss), bw=width, label = "Goldman-Fristoe")
sns.kdeplot(np.array(both.aaps_ss), bw=width, label = "Arizona")
plt.ylabel('Density')
plt.xlabel('Standardized Score')
<matplotlib.text.Text at 0x10c25ae10>
# Large age range (in months). May want to use a spline for age
# Don't have enough degrees of freedom to spline both variables
sns.kdeplot(both.age_test_gf2, bw=5, label = "Goldman-Fristoe")
sns.kdeplot(both.age_test_aaps, bw=5, label = "Arizona")
plt.ylabel('Density')
plt.xlabel('Age (months)')
<matplotlib.text.Text at 0x1110c81d0>
# Race has a lot of categories
# To preserve degrees of freedom we can reduce to white/non-white
both.race.value_counts(dropna=False, sort=False)
# 0 = Cacucasian
# 1 = Black or African American
# 2 = Hispanic or Latino
# 3 = Asian
# 4 = Native Hawaiians/Pacific Islanders
# 5 = American Indian/Alaska Natives
# 6 = Multiracial
# 7 = Unknown
# 8 = Other
NaN 2 0 26 1 6 2 8 3 4 4 9 dtype: int64
both['white'] = np.nan
both.loc[(both.race == 0), 'white'] = 1
both.loc[(both.race !=0) & (pd.notnull(both.race)), 'white'] = 0
both.white.value_counts(dropna=False, sort=False)
NaN 2 0 27 1 26 dtype: int64
# Etiology 2: impact of the other condition(s) that contribute to the child's overall learning abilities
# To preserve degrees of freedom we can reduce to impacts (1, 2, 3) and doesn't imact (0, 4)
both.etiology_2.value_counts(dropna=False, sort=False)
# 0 = None
# 1 = Mild
# 2 = Moderate
# 3 = Severe
# 4 = N/A (no additional disability suspected or identified)
NaN 1 0 11 1 5 2 12 4 26 dtype: int64
both['etiology_impact'] = np.nan
both.loc[(both.etiology_2 == 0) | (both.etiology_2 == 4), 'etiology_impact'] = 0
both.loc[(both.etiology_2 == 1) | (both.etiology_2 == 2) | (both.etiology_2 == 3), 'etiology_impact'] = 1
both.etiology_impact.value_counts(dropna=False)
0 37 1 17 NaN 1 dtype: int64
# Amount of time enrolled in OPTION: 2 missing obs
len(both.time.index)-both.time.count()
2
# Age of onset: 10 missing obs
# This is a lot of observations to sacrifice
# Would be good if we could use multiple imputation
len(both.onset_1.index)-both.onset_1.count()
10
# Most of the students who took both tests have bilateral SNHL
# 4 observations are missing
# Because a large majority of the students only have bilateral SNHL, it will not be a useful predictor.
both.type_hl.value_counts(dropna=False)
bi_snhl 47 NaN 4 bi_an 3 uni_snhl 1 dtype: int64
# 42 out of 55 do not have another diagnosed disability
# note: this doesn't agree with etiology_2 coding
both.etiology.value_counts(dropna=False, sort=False)
0 9 1 42 2 1 3 3 dtype: int64
# Other services: 12 missing obs
# Most students do not recieve other services
both.otherserv.value_counts(dropna=False, sort=False)
# 0 = Yes
# 1 = No
# 2 = Unknown
NaN 12 0 3 1 35 2 5 dtype: int64
Model Name | Variables | n obs | df | R-squared |
---|---|---|---|---|
m1 | gf2_ss + age_test_gf2 + age_test_aaps + male + white + etiology_impact | 52 | 6 | 0.782 |
m2 | gf2_ss + age_test_gf2 + age_test_aaps | 55 | 3 | 0.771 |
m3 | bs(gf2_ss, 3) + age_test_gf2 + age_test_aaps | 55 | 5 | 0.781 |
m4 | cr(gf2_ss, df=3, constraints='center') + age_test_gf2 + age_test_aaps | 55 | 5 | 0.780 |
m5 | gf2_ss + age_test_gf2 + age_test_aaps + time + onset_1 | 45 | 5 | 0.877 |
m6 | gf2_ss + age_test_gf2 + age_test_aaps + time | 52 | 4 | 0.819 |
m1 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps +
male + white + etiology_impact""",
data=both).fit()
m1.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.782 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.753 |
Method: | Least Squares | F-statistic: | 26.85 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 2.44e-13 |
Time: | 18:46:54 | Log-Likelihood: | -158.30 |
No. Observations: | 52 | AIC: | 330.6 |
Df Residuals: | 45 | BIC: | 344.3 |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 53.1924 | 5.154 | 10.320 | 0.000 | 42.811 63.574 |
gf2_ss | 0.4439 | 0.038 | 11.787 | 0.000 | 0.368 0.520 |
age_test_gf2 | -0.3120 | 0.147 | -2.125 | 0.039 | -0.608 -0.016 |
age_test_aaps | 0.2512 | 0.134 | 1.881 | 0.066 | -0.018 0.520 |
male | 0.5687 | 1.645 | 0.346 | 0.731 | -2.745 3.882 |
white | 1.7043 | 1.638 | 1.041 | 0.304 | -1.594 5.003 |
etiology_impact | -2.0195 | 1.702 | -1.187 | 0.242 | -5.447 1.408 |
Omnibus: | 2.316 | Durbin-Watson: | 2.265 |
---|---|---|---|
Prob(Omnibus): | 0.314 | Jarque-Bera (JB): | 1.555 |
Skew: | -0.403 | Prob(JB): | 0.460 |
Kurtosis: | 3.263 | Cond. No. | 880. |
m2 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps""",
data=both).fit()
m2.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.771 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.757 |
Method: | Least Squares | F-statistic: | 57.12 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 2.52e-16 |
Time: | 18:46:54 | Log-Likelihood: | -168.24 |
No. Observations: | 55 | AIC: | 344.5 |
Df Residuals: | 51 | BIC: | 352.5 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 54.8203 | 4.437 | 12.354 | 0.000 | 45.912 63.729 |
gf2_ss | 0.4339 | 0.034 | 12.623 | 0.000 | 0.365 0.503 |
age_test_gf2 | -0.3497 | 0.135 | -2.588 | 0.013 | -0.621 -0.078 |
age_test_aaps | 0.2834 | 0.124 | 2.280 | 0.027 | 0.034 0.533 |
Omnibus: | 2.865 | Durbin-Watson: | 2.272 |
---|---|---|---|
Prob(Omnibus): | 0.239 | Jarque-Bera (JB): | 2.027 |
Skew: | -0.442 | Prob(JB): | 0.363 |
Kurtosis: | 3.320 | Cond. No. | 798. |
# Use to print anova tables (for when we use splines)
import statsmodels.api as sm
# bs() B-spline
m3 = smf.ols(formula="""aaps_ss ~ bs(gf2_ss, 3) + age_test_gf2 + age_test_aaps""",
data=both).fit()
m3.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.781 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.759 |
Method: | Least Squares | F-statistic: | 35.01 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 4.62e-15 |
Time: | 18:46:54 | Log-Likelihood: | -166.93 |
No. Observations: | 55 | AIC: | 345.9 |
Df Residuals: | 49 | BIC: | 357.9 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 75.4249 | 4.181 | 18.040 | 0.000 | 67.023 83.827 |
bs(gf2_ss, 3)[0] | -1.3268 | 7.914 | -0.168 | 0.868 | -17.230 14.576 |
bs(gf2_ss, 3)[1] | 22.9797 | 5.155 | 4.458 | 0.000 | 12.620 33.339 |
bs(gf2_ss, 3)[2] | 27.7088 | 4.170 | 6.645 | 0.000 | 19.330 36.088 |
age_test_gf2 | -0.3629 | 0.136 | -2.671 | 0.010 | -0.636 -0.090 |
age_test_aaps | 0.2951 | 0.126 | 2.342 | 0.023 | 0.042 0.548 |
Omnibus: | 3.445 | Durbin-Watson: | 2.254 |
---|---|---|---|
Prob(Omnibus): | 0.179 | Jarque-Bera (JB): | 2.473 |
Skew: | -0.464 | Prob(JB): | 0.290 |
Kurtosis: | 3.465 | Cond. No. | 1.30e+03 |
sm.stats.anova_lm(m3, typ=2)
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
bs(gf2_ss, 3) | 4634.360674 | 3 | 54.312079 | 1.308633e-15 |
age_test_gf2 | 202.877916 | 1 | 7.132842 | 1.024204e-02 |
age_test_aaps | 156.014750 | 1 | 5.485213 | 2.328615e-02 |
Residual | 1393.696551 | 49 | NaN | NaN |
# cr() natural cubic spline
m4 = smf.ols(formula="""aaps_ss ~ cr(gf2_ss, df=3, constraints='center') + age_test_gf2 + age_test_aaps""",
data=both).fit()
m4.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.780 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.757 |
Method: | Least Squares | F-statistic: | 34.68 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 5.52e-15 |
Time: | 18:46:54 | Log-Likelihood: | -167.14 |
No. Observations: | 55 | AIC: | 346.3 |
Df Residuals: | 49 | BIC: | 358.3 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 90.0725 | 3.236 | 27.838 | 0.000 | 83.570 96.575 |
cr(gf2_ss, df=3, constraints='center')[0] | -0.6435 | 1.915 | -0.336 | 0.738 | -4.491 3.204 |
cr(gf2_ss, df=3, constraints='center')[1] | 14.6513 | 1.804 | 8.121 | 0.000 | 11.026 18.277 |
cr(gf2_ss, df=3, constraints='center')[2] | 16.6059 | 2.660 | 6.243 | 0.000 | 11.261 21.951 |
age_test_gf2 | -0.3580 | 0.137 | -2.621 | 0.012 | -0.633 -0.083 |
age_test_aaps | 0.2940 | 0.127 | 2.321 | 0.025 | 0.039 0.549 |
Omnibus: | 2.991 | Durbin-Watson: | 2.239 |
---|---|---|---|
Prob(Omnibus): | 0.224 | Jarque-Bera (JB): | 2.083 |
Skew: | -0.432 | Prob(JB): | 0.353 |
Kurtosis: | 3.404 | Cond. No. | 460. |
sm.stats.anova_lm(m4, typ=2)
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
cr(gf2_ss, df=3, constraints='center') | 4624.005992 | 3 | 53.791079 | 1.567175e-15 |
age_test_gf2 | 196.774953 | 1 | 6.867251 | 1.165756e-02 |
age_test_aaps | 154.307597 | 1 | 5.385183 | 2.451412e-02 |
Residual | 1404.051233 | 49 | NaN | NaN |
m5 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps +
time + onset_1""",
data=both).fit()
m5.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.877 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.862 |
Method: | Least Squares | F-statistic: | 55.82 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 9.93e-17 |
Time: | 18:46:54 | Log-Likelihood: | -122.00 |
No. Observations: | 45 | AIC: | 256.0 |
Df Residuals: | 39 | BIC: | 266.8 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 63.0834 | 3.795 | 16.622 | 0.000 | 55.407 70.760 |
gf2_ss | 0.3897 | 0.029 | 13.285 | 0.000 | 0.330 0.449 |
age_test_gf2 | -0.5494 | 0.113 | -4.871 | 0.000 | -0.778 -0.321 |
age_test_aaps | 0.3424 | 0.103 | 3.327 | 0.002 | 0.134 0.551 |
time | 1.8769 | 0.523 | 3.590 | 0.001 | 0.819 2.934 |
onset_1 | 0.0459 | 0.049 | 0.941 | 0.353 | -0.053 0.144 |
Omnibus: | 0.832 | Durbin-Watson: | 2.356 |
---|---|---|---|
Prob(Omnibus): | 0.660 | Jarque-Bera (JB): | 0.209 |
Skew: | -0.008 | Prob(JB): | 0.901 |
Kurtosis: | 3.333 | Cond. No. | 847. |
m6 = smf.ols(formula="""aaps_ss ~ gf2_ss + age_test_gf2 + age_test_aaps + time""",
data=both).fit()
m6.summary()
Dep. Variable: | aaps_ss | R-squared: | 0.819 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.804 |
Method: | Least Squares | F-statistic: | 54.32 |
Date: | Sun, 17 Apr 2016 | Prob (F-statistic): | 3.13e-17 |
Time: | 18:46:54 | Log-Likelihood: | -155.96 |
No. Observations: | 53 | AIC: | 321.9 |
Df Residuals: | 48 | BIC: | 331.8 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 58.6568 | 4.169 | 14.069 | 0.000 | 50.274 67.040 |
gf2_ss | 0.3976 | 0.033 | 12.071 | 0.000 | 0.331 0.464 |
age_test_gf2 | -0.4447 | 0.127 | -3.510 | 0.001 | -0.699 -0.190 |
age_test_aaps | 0.3094 | 0.114 | 2.719 | 0.009 | 0.081 0.538 |
time | 1.5959 | 0.539 | 2.963 | 0.005 | 0.513 2.679 |
Omnibus: | 5.319 | Durbin-Watson: | 2.411 |
---|---|---|---|
Prob(Omnibus): | 0.070 | Jarque-Bera (JB): | 4.242 |
Skew: | -0.632 | Prob(JB): | 0.120 |
Kurtosis: | 3.567 | Cond. No. | 819. |
import scipy
# (correlation, p-value)
pc = scipy.stats.pearsonr(both.gf2_ss, both.aaps_ss)
print("Peasron Corelation: " "%.3f" % pc[0])
sc = scipy.stats.spearmanr(both.gf2_ss, both.aaps_ss)
print("Spearman Corelation: " "%.3f" % sc[0])
Peasron Corelation: 0.860 Spearman Corelation: 0.848
sns.lmplot("gf2_ss", "aaps_ss", data=both, fit_reg=False)
sns.regplot(both.gf2_ss, both.aaps_ss, scatter=False, lowess=True, label='LOWESS')
plt.plot([50, 110], [50, 110], 'red', label='Identity');
legend = plt.legend(loc='upper left')
# pearson correlation
g = sns.jointplot(x="gf2_ss", y="aaps_ss", data=both, marginal_kws=dict(bins=15, rug=True))
# for spearman correlation, add 'stat_func=spearmanr' to joinplot()
x0, x1 = g.ax_joint.get_xlim()
y0, y1 = g.ax_joint.get_ylim()
lims = [max(x0, y0)-10, min(x1, y1)+10]
g.ax_joint.plot(lims, lims, ':k')
[<matplotlib.lines.Line2D at 0x10f7d9f10>]
# regression line with confidence bands
sns.jointplot(x="gf2_ss", y="aaps_ss", data=both, kind="reg", marginal_kws=dict(bins=15, rug=True));
fig = plt.figure(figsize=(12,15))
fig = sm.graphics.plot_regress_exog(m1, 'gf2_ss', fig=fig)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(np.array(both.aaps_ss), m3.fittedvalues, 'r.')
plt.plot([65, 105], [65, 105], 'blue', label='Identity')
legend = plt.legend(loc='upper left')
plt.xlabel('Observed Arizona Score')
plt.ylabel('Predicted Arizona Score')
plt.title('Model 3')
<matplotlib.text.Text at 0x10ec3e6d0>
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(np.array(both.gf2_ss), m3.fittedvalues, 'r.', label='Predicted')
ax.plot(np.array(both.gf2_ss), np.array(both.aaps_ss), 'g.', label='Observed')
plt.plot([35, 110], [35, 110], 'blue', label='Identity');
legend = plt.legend(loc='upper left')
plt.xlabel('Goldman-Fristoe Score')
plt.ylabel('Arizona Score')
plt.title('Model 3')
ax.set_xlim([30,120])
ax.set_ylim([60,110])
(60, 110)
aaps_m6 = both[both['time'].notnull()].aaps_ss
gf2_m6 = both[both['time'].notnull()].gf2_ss
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(aaps_m6, m6.fittedvalues, 'r.')
plt.plot([65, 105], [65, 105], 'blue', label='Identity')
legend = plt.legend(loc='upper left')
plt.xlabel('Observed Arizona Score')
plt.ylabel('Predicted Arizona Score')
plt.title('Model 6')
<matplotlib.text.Text at 0x10f10d190>
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(gf2_m6, m6.fittedvalues, 'r.', label='Predicted')
ax.plot(gf2_m6, aaps_m6, 'g.', label='Observed')
plt.plot([35, 110], [35, 110], 'blue', label='Identity');
legend = plt.legend(loc='upper left')
plt.xlabel('Goldman-Fristoe Score')
plt.ylabel('Arizona Score')
plt.title('Model 6')
ax.set_xlim([30,120])
ax.set_ylim([60,110])
(60, 110)
In an effort to increase sample size, we could model all observations with both scores (73) and use clustered robust standard errors or build a repeated measures model.
For robust standard errors see http://statsmodels.sourceforge.net/0.6.0/stats.html
sandwich_covariance.cov_cluster(results, group)