import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)
df = pd.read_csv("https://raw.githubusercontent.com/weilixiang/sta2453_project1/master/merged_data.csv")
import statsmodels.api as sm
import statsmodels.formula.api as smf
full_md = smf.mixedlm("PerformanceScore ~ SessionType + Duration + RPE + SessionLoad + \
DailyLoad + AcuteLoad + ChronicLoad + AcuteChronicRatio + Load_Status + \
Outcome + PointsDiff + Pain + Illness + Menstruation + \
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = full_md.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ============================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.6021 Min. group size: 3 Likelihood: -1956.4791 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------------- Intercept 50.766 8.679 5.849 0.000 33.755 67.777 SessionType[T.Game] -0.501 2.385 -0.210 0.834 -5.175 4.173 SessionType[T.Mobility/Recovery] 0.331 2.774 0.119 0.905 -5.106 5.769 SessionType[T.Skills] 1.829 2.469 0.741 0.459 -3.010 6.668 SessionType[T.Speed] 2.498 3.187 0.784 0.433 -3.748 8.744 SessionType[T.Strength] 5.037 3.126 1.611 0.107 -1.090 11.163 Load_Status[T.normal] -6.476 2.475 -2.617 0.009 -11.327 -1.626 Load_Status[T.recovering] -4.472 3.045 -1.468 0.142 -10.440 1.497 Outcome[T.W] -2.591 1.320 -1.963 0.050 -5.178 -0.004 Pain[T.Yes] 3.424 1.951 1.755 0.079 -0.399 7.248 Illness[T.Slightly Off] 4.863 1.910 2.547 0.011 1.120 8.606 Illness[T.Yes] -10.834 5.806 -1.866 0.062 -22.215 0.546 Menstruation[T.Yes] -0.170 1.288 -0.132 0.895 -2.695 2.355 Nutrition[T.Okay] 4.690 1.452 3.230 0.001 1.844 7.537 NutritionAdjustment[T.No] 6.456 4.876 1.324 0.185 -3.100 16.013 NutritionAdjustment[T.Yes] 8.217 3.508 2.342 0.019 1.341 15.092 Duration 0.026 0.042 0.626 0.531 -0.056 0.109 RPE 0.419 0.345 1.215 0.224 -0.257 1.095 SessionLoad -0.014 0.007 -1.926 0.054 -0.028 0.000 DailyLoad 0.009 0.002 3.813 0.000 0.004 0.013 AcuteLoad -0.034 0.016 -2.043 0.041 -0.066 -0.001 ChronicLoad 0.000 0.013 0.010 0.992 -0.025 0.026 AcuteChronicRatio 6.841 5.802 1.179 0.238 -4.532 18.213 PointsDiff 0.125 0.039 3.182 0.001 0.048 0.202 EWMScore -0.412 0.244 -1.685 0.092 -0.891 0.067 Group Var 79.990 4.328 =============================================================================
As we can see, ChronicLoad has p-value 0.992, which is significantly larger than 0.1, we will drop it for next model. Keep in mind, the REML is about 58.6.
md1 = smf.mixedlm("PerformanceScore ~ SessionType + Duration + RPE + SessionLoad + \
DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
Outcome + PointsDiff + Pain + Illness + Menstruation + \
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md1.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ============================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.4976 Min. group size: 3 Likelihood: -1953.0431 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------------- Intercept 50.784 8.618 5.893 0.000 33.893 67.674 SessionType[T.Game] -0.500 2.380 -0.210 0.834 -5.165 4.165 SessionType[T.Mobility/Recovery] 0.334 2.765 0.121 0.904 -5.086 5.754 SessionType[T.Skills] 1.832 2.462 0.744 0.457 -2.995 6.658 SessionType[T.Speed] 2.501 3.169 0.789 0.430 -3.710 8.711 SessionType[T.Strength] 5.038 3.119 1.615 0.106 -1.076 11.152 Load_Status[T.normal] -6.476 2.471 -2.621 0.009 -11.320 -1.633 Load_Status[T.recovering] -4.471 3.039 -1.471 0.141 -10.428 1.485 Outcome[T.W] -2.591 1.318 -1.966 0.049 -5.174 -0.008 Pain[T.Yes] 3.424 1.942 1.763 0.078 -0.382 7.229 Illness[T.Slightly Off] 4.862 1.892 2.570 0.010 1.154 8.569 Illness[T.Yes] -10.838 5.797 -1.870 0.062 -22.200 0.524 Menstruation[T.Yes] -0.169 1.286 -0.132 0.895 -2.690 2.351 Nutrition[T.Okay] 4.690 1.449 3.237 0.001 1.850 7.529 NutritionAdjustment[T.No] 6.454 4.870 1.325 0.185 -3.090 15.999 NutritionAdjustment[T.Yes] 8.216 3.504 2.345 0.019 1.348 15.084 Duration 0.026 0.042 0.629 0.529 -0.056 0.108 RPE 0.419 0.344 1.218 0.223 -0.256 1.094 SessionLoad -0.014 0.007 -1.928 0.054 -0.028 0.000 DailyLoad 0.009 0.002 3.820 0.000 0.004 0.013 AcuteLoad -0.034 0.008 -4.386 0.000 -0.049 -0.019 AcuteChronicRatio 6.795 3.263 2.082 0.037 0.399 13.190 PointsDiff 0.125 0.039 3.186 0.001 0.048 0.202 EWMScore -0.411 0.223 -1.844 0.065 -0.848 0.026 Group Var 79.628 4.269 =============================================================================
As we can see, SessionType has p-value 0.904, which is significantly larger than 0.1, also all levels are insignificant, we will drop it for next model. Keep in mind, the REML is about 58.5.
md2 = smf.mixedlm("PerformanceScore ~ Duration + RPE + SessionLoad + \
DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
Outcome + PointsDiff + Pain + Illness + Menstruation + \
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md2.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ======================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.6317 Min. group size: 3 Likelihood: -1963.8942 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------- Intercept 52.713 8.434 6.250 0.000 36.182 69.244 Load_Status[T.normal] -6.701 2.449 -2.737 0.006 -11.500 -1.902 Load_Status[T.recovering] -4.881 2.982 -1.637 0.102 -10.726 0.963 Outcome[T.W] -2.659 1.315 -2.022 0.043 -5.236 -0.082 Pain[T.Yes] 3.811 1.871 2.037 0.042 0.144 7.478 Illness[T.Slightly Off] 5.203 1.886 2.759 0.006 1.507 8.900 Illness[T.Yes] -10.215 5.790 -1.764 0.078 -21.564 1.134 Menstruation[T.Yes] -0.157 1.284 -0.122 0.903 -2.673 2.359 Nutrition[T.Okay] 5.078 1.417 3.584 0.000 2.301 7.855 NutritionAdjustment[T.No] 6.560 4.851 1.352 0.176 -2.948 16.068 NutritionAdjustment[T.Yes] 8.765 3.495 2.508 0.012 1.914 15.615 Duration 0.058 0.037 1.544 0.123 -0.016 0.131 RPE 0.328 0.255 1.288 0.198 -0.171 0.828 SessionLoad -0.014 0.007 -1.908 0.056 -0.028 0.000 DailyLoad 0.008 0.002 3.532 0.000 0.003 0.012 AcuteLoad -0.031 0.007 -4.137 0.000 -0.046 -0.016 AcuteChronicRatio 5.729 3.176 1.803 0.071 -0.497 11.954 PointsDiff 0.126 0.039 3.219 0.001 0.049 0.203 EWMScore -0.471 0.219 -2.151 0.031 -0.901 -0.042 Group Var 77.071 4.148 =======================================================================
As we can see, Menstruation has p-value 0.903, which is significantly larger than 0.1, also all levels are insignificant, we will drop it for next model. The REML is about 58.6.
md3 = smf.mixedlm("PerformanceScore ~ Duration + RPE + SessionLoad + \
DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
Outcome + PointsDiff + Pain + Illness +\
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md3.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ======================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.5248 Min. group size: 3 Likelihood: -1965.0699 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------- Intercept 52.513 8.271 6.349 0.000 36.301 68.724 Load_Status[T.normal] -6.693 2.445 -2.737 0.006 -11.486 -1.900 Load_Status[T.recovering] -4.864 2.976 -1.635 0.102 -10.696 0.968 Outcome[T.W] -2.674 1.308 -2.045 0.041 -5.237 -0.111 Pain[T.Yes] 3.800 1.867 2.035 0.042 0.140 7.459 Illness[T.Slightly Off] 5.272 1.798 2.933 0.003 1.749 8.796 Illness[T.Yes] -10.191 5.782 -1.763 0.078 -21.524 1.141 Nutrition[T.Okay] 5.061 1.409 3.593 0.000 2.300 7.822 NutritionAdjustment[T.No] 6.614 4.827 1.370 0.171 -2.846 16.075 NutritionAdjustment[T.Yes] 8.793 3.485 2.524 0.012 1.964 15.623 Duration 0.058 0.037 1.548 0.122 -0.015 0.131 RPE 0.330 0.254 1.297 0.195 -0.169 0.829 SessionLoad -0.014 0.007 -1.913 0.056 -0.028 0.000 DailyLoad 0.008 0.002 3.539 0.000 0.003 0.012 AcuteLoad -0.031 0.007 -4.234 0.000 -0.045 -0.016 AcuteChronicRatio 5.686 3.154 1.803 0.071 -0.497 11.868 PointsDiff 0.127 0.039 3.258 0.001 0.051 0.203 EWMScore -0.467 0.216 -2.161 0.031 -0.891 -0.043 Group Var 76.964 4.146 =======================================================================
As we can see, RPE has p-value 0.195, which is larger than 0.1, we will drop it for next model. The REML is about 58.5.
md4 = smf.mixedlm("PerformanceScore ~ Duration + SessionLoad +DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status + Outcome + PointsDiff + Pain + Illness +\
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md4.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ======================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.6464 Min. group size: 3 Likelihood: -1965.4590 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------- Intercept 53.926 8.193 6.582 0.000 37.869 69.983 Load_Status[T.normal] -6.552 2.445 -2.680 0.007 -11.344 -1.760 Load_Status[T.recovering] -4.635 2.973 -1.559 0.119 -10.461 1.191 Outcome[T.W] -2.662 1.309 -2.034 0.042 -5.228 -0.097 Pain[T.Yes] 3.725 1.868 1.994 0.046 0.063 7.386 Illness[T.Slightly Off] 5.373 1.798 2.989 0.003 1.850 8.896 Illness[T.Yes] -10.085 5.787 -1.743 0.081 -21.427 1.257 Nutrition[T.Okay] 4.882 1.401 3.483 0.000 2.135 7.629 NutritionAdjustment[T.No] 6.457 4.825 1.338 0.181 -2.999 15.914 NutritionAdjustment[T.Yes] 8.640 3.486 2.479 0.013 1.808 15.471 Duration 0.023 0.026 0.880 0.379 -0.028 0.073 SessionLoad -0.006 0.004 -1.464 0.143 -0.014 0.002 DailyLoad 0.007 0.002 3.342 0.001 0.003 0.011 AcuteLoad -0.029 0.007 -4.046 0.000 -0.043 -0.015 AcuteChronicRatio 5.601 3.157 1.774 0.076 -0.586 11.787 PointsDiff 0.127 0.039 3.262 0.001 0.051 0.204 EWMScore -0.466 0.216 -2.154 0.031 -0.889 -0.042 Group Var 74.825 4.029 =======================================================================
As we can see, Duration has p-value 0.379, which is larger than 0.1, we will drop it for next model. The REML is about 58.6.
md5 = smf.mixedlm("PerformanceScore ~ SessionLoad +DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status + Outcome + PointsDiff + Pain + Illness +\
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md5.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ====================================================================== Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.6255 Min. group size: 3 Likelihood: -1963.0989 Max. group size: 80 Converged: Yes Mean group size: 33.1 ---------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ---------------------------------------------------------------------- Intercept 55.145 8.072 6.832 0.000 39.324 70.966 Load_Status[T.normal] -6.703 2.438 -2.749 0.006 -11.482 -1.924 Load_Status[T.recovering] -4.845 2.963 -1.635 0.102 -10.652 0.961 Outcome[T.W] -2.631 1.308 -2.011 0.044 -5.195 -0.066 Pain[T.Yes] 3.778 1.867 2.023 0.043 0.119 7.436 Illness[T.Slightly Off] 5.426 1.796 3.021 0.003 1.905 8.946 Illness[T.Yes] -9.987 5.785 -1.726 0.084 -21.325 1.351 Nutrition[T.Okay] 4.922 1.401 3.514 0.000 2.177 7.668 NutritionAdjustment[T.No] 6.252 4.817 1.298 0.194 -3.190 15.694 NutritionAdjustment[T.Yes] 8.622 3.485 2.474 0.013 1.792 15.453 SessionLoad -0.003 0.002 -1.322 0.186 -0.008 0.001 DailyLoad 0.007 0.002 3.240 0.001 0.003 0.011 AcuteLoad -0.028 0.007 -3.983 0.000 -0.042 -0.014 AcuteChronicRatio 5.308 3.138 1.691 0.091 -0.843 11.459 PointsDiff 0.129 0.039 3.298 0.001 0.052 0.205 EWMScore -0.483 0.215 -2.243 0.025 -0.904 -0.061 Group Var 74.643 4.026 ======================================================================
As we can see, Sessionload has p-value 0.186, which is larger than 0.1, we will drop it for next model. The REML is about 58.6.
md6 = smf.mixedlm("PerformanceScore ~ DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status + Outcome + PointsDiff + Pain + Illness +\
Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
mdf = md6.fit()
print(mdf.summary())
Mixed Linear Model Regression Results ======================================================================= Model: MixedLM Dependent Variable: PerformanceScore No. Observations: 562 Method: REML No. Groups: 17 Scale: 58.7169 Min. group size: 3 Likelihood: -1958.8323 Max. group size: 80 Converged: Yes Mean group size: 33.1 ----------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------------- Intercept 54.331 8.054 6.746 0.000 38.546 70.116 Load_Status[T.normal] -6.640 2.440 -2.722 0.006 -11.422 -1.859 Load_Status[T.recovering] -4.826 2.965 -1.628 0.104 -10.636 0.984 Outcome[T.W] -2.610 1.309 -1.994 0.046 -5.176 -0.044 Pain[T.Yes] 3.772 1.868 2.019 0.044 0.110 7.434 Illness[T.Slightly Off] 5.420 1.797 3.016 0.003 1.898 8.943 Illness[T.Yes] -10.046 5.789 -1.735 0.083 -21.392 1.301 Nutrition[T.Okay] 4.800 1.398 3.433 0.001 2.060 7.541 NutritionAdjustment[T.No] 5.764 4.806 1.199 0.230 -3.655 15.183 NutritionAdjustment[T.Yes] 8.274 3.478 2.379 0.017 1.458 15.090 DailyLoad 0.006 0.002 3.060 0.002 0.002 0.010 AcuteLoad -0.028 0.007 -3.953 0.000 -0.042 -0.014 AcuteChronicRatio 5.410 3.140 1.723 0.085 -0.743 11.564 PointsDiff 0.125 0.039 3.211 0.001 0.049 0.201 EWMScore -0.459 0.215 -2.139 0.032 -0.880 -0.039 Group Var 74.236 4.005 =======================================================================
Up until here, all p-values are less than 0.1, or at least some levels in categorical variable are significant. We will stop here, and use it as our final model.