The original wellness data of athletes consists of 19 columns. Our aim for this dataset is to come up with a quantitative summary measure(s). For complete understanding as to what each variable means, please refer to the CodeBook.
We'll first look at the dataset in depth, to possibly merge highly correlated measures or come up with new measures by transforming the existing ones. The summary of our wellness dataset is as follows:
import numpy as np
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns
dfWell = pd.read_csv('https://www.dropbox.com/s/170bc3dimgn8ru8/wellness.csv?dl=1')
dfWell.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5011 entries, 0 to 5010 Data columns (total 19 columns): Date 5011 non-null object PlayerID 5011 non-null int64 Fatigue 5011 non-null int64 Soreness 5011 non-null int64 Desire 5011 non-null int64 Irritability 5011 non-null int64 BedTime 5011 non-null object WakeTime 5011 non-null object SleepHours 5011 non-null float64 SleepQuality 5011 non-null int64 MonitoringScore 5011 non-null int64 Pain 5011 non-null object Illness 5011 non-null object Menstruation 4995 non-null object Nutrition 4174 non-null object NutritionAdjustment 4266 non-null object USGMeasurement 4843 non-null object USG 629 non-null float64 TrainingReadiness 5011 non-null object dtypes: float64(2), int64(7), object(10) memory usage: 743.9+ KB
As we can see, there are some null values in our dataset. After close inspection, we decided to fill the null values of Menstruation, Nutrition and NutritionAdjustment with their modes and drop the USG and USGMeasurement columns as they were ~85% null. The resulting dataset looks as follows:
# As Menstruation, Nutrition and Nutrition Adjustment are categorical, fill NA with their Modes
dfWell.Menstruation.fillna(dfWell.Menstruation.mode()[0],inplace=True)
dfWell.Nutrition.fillna(dfWell.Nutrition.mode()[0],inplace=True)
dfWell.NutritionAdjustment.fillna(dfWell.NutritionAdjustment.mode()[0],inplace=True)
dfWell.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5011 entries, 0 to 5010 Data columns (total 19 columns): Date 5011 non-null object PlayerID 5011 non-null int64 Fatigue 5011 non-null int64 Soreness 5011 non-null int64 Desire 5011 non-null int64 Irritability 5011 non-null int64 BedTime 5011 non-null object WakeTime 5011 non-null object SleepHours 5011 non-null float64 SleepQuality 5011 non-null int64 MonitoringScore 5011 non-null int64 Pain 5011 non-null object Illness 5011 non-null object Menstruation 5011 non-null object Nutrition 5011 non-null object NutritionAdjustment 5011 non-null object USGMeasurement 4843 non-null object USG 629 non-null float64 TrainingReadiness 5011 non-null object dtypes: float64(2), int64(7), object(10) memory usage: 743.9+ KB
# As USG measurement has almost 87% NA, it doesn't make sense to fill NA with random or mean values.
# Hence, drop USG related columns
dfWell = dfWell.drop(['USGMeasurement', 'USG'], axis=1)
dfWell.head()
Date | PlayerID | Fatigue | Soreness | Desire | Irritability | BedTime | WakeTime | SleepHours | SleepQuality | MonitoringScore | Pain | Illness | Menstruation | Nutrition | NutritionAdjustment | TrainingReadiness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-07-21 | 1 | 3 | 3 | 2 | 3 | 23:00:00 | 07:00:00 | 8.00 | 2 | 13 | No | No | Yes | Excellent | Yes | 0% |
1 | 2018-07-21 | 2 | 4 | 3 | 4 | 4 | 23:00:00 | 07:00:00 | 8.00 | 4 | 19 | Yes | No | Yes | Excellent | Yes | 0% |
2 | 2018-07-21 | 3 | 3 | 3 | 5 | 4 | 22:30:00 | 06:30:00 | 8.00 | 4 | 19 | No | No | No | Excellent | Yes | 100% |
3 | 2018-07-21 | 4 | 2 | 3 | 5 | 4 | 00:30:00 | 07:00:00 | 6.50 | 1 | 15 | No | No | Yes | Excellent | Yes | 95% |
4 | 2018-07-21 | 5 | 5 | 3 | 4 | 4 | 23:45:00 | 07:00:00 | 7.25 | 4 | 20 | No | No | No | Okay | Yes | 100% |
Next, we drop the BedTime and WakeTime as this information is captured by the SleepHours already. We are also going to convert the TrainingReadiness and SleepTime to a scale of 1 to 7 to match all the other scales which are captured in the MonitoringScore.
# We do not require Bed Time and Wake Time as well. We'll be using SleepHours instead
dfWell = dfWell.drop(['BedTime', 'WakeTime'], axis=1)
# Convert TrainingReadiness from String to fraction
dfWell['TrainingReadiness'] = 1 + (dfWell['TrainingReadiness'].str.rstrip('%').astype('float') / 100.0 * 6)
# Use MinMaxScaler to convert sleeptime to a number from 1 to 7
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
dfWell["SleepHoursScaled"] = 1 + (min_max_scaler.fit_transform(dfWell[["SleepHours"]]) * 6)
dfWell
Date | PlayerID | Fatigue | Soreness | Desire | Irritability | SleepHours | SleepQuality | MonitoringScore | Pain | Illness | Menstruation | Nutrition | NutritionAdjustment | TrainingReadiness | SleepHoursScaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-07-21 | 1 | 3 | 3 | 2 | 3 | 8.00 | 2 | 13 | No | No | Yes | Excellent | Yes | 1.0 | 3.000000 |
1 | 2018-07-21 | 2 | 4 | 3 | 4 | 4 | 8.00 | 4 | 19 | Yes | No | Yes | Excellent | Yes | 1.0 | 3.000000 |
2 | 2018-07-21 | 3 | 3 | 3 | 5 | 4 | 8.00 | 4 | 19 | No | No | No | Excellent | Yes | 7.0 | 3.000000 |
3 | 2018-07-21 | 4 | 2 | 3 | 5 | 4 | 6.50 | 1 | 15 | No | No | Yes | Excellent | Yes | 6.7 | 2.586207 |
4 | 2018-07-21 | 5 | 5 | 3 | 4 | 4 | 7.25 | 4 | 20 | No | No | No | Okay | Yes | 7.0 | 2.793103 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5006 | 2017-08-01 | 8 | 6 | 7 | 7 | 7 | 6.50 | 6 | 33 | No | No | No | Poor | No | 4.0 | 2.586207 |
5007 | 2017-08-01 | 10 | 3 | 4 | 4 | 4 | 6.75 | 3 | 18 | No | No | No | Excellent | I Don't Know | 6.7 | 2.655172 |
5008 | 2017-08-01 | 12 | 5 | 4 | 5 | 4 | 7.50 | 3 | 21 | No | No | Yes | Okay | I Don't Know | 6.4 | 2.862069 |
5009 | 2017-08-01 | 13 | 4 | 4 | 4 | 4 | 7.50 | 4 | 20 | No | No | No | Okay | Yes | 7.0 | 2.862069 |
5010 | 2017-08-01 | 14 | 4 | 4 | 5 | 5 | 8.00 | 6 | 24 | No | No | No | Okay | Yes | 7.0 | 3.000000 |
5011 rows × 16 columns
We merge the MonitoringScore with TrainingReadiness and SleepHoursScaled to define a new measure called the WellnessScore. Our dataset looks a lot cleaner and concise now:
# Create a WellnessScore which is the sum of all the quantitative values
dfWell["WellnessScore"] = dfWell["MonitoringScore"] + dfWell["TrainingReadiness"] + dfWell["SleepHoursScaled"]
# Drop all the quantitative values as we'll be using only the WellnessScore from now on
dfWell = dfWell.drop(['Fatigue','Soreness','Desire','Irritability','SleepHours','SleepQuality','MonitoringScore',\
'TrainingReadiness','SleepHoursScaled'],axis=1)
dfWell.head()
Date | PlayerID | Pain | Illness | Menstruation | Nutrition | NutritionAdjustment | WellnessScore | |
---|---|---|---|---|---|---|---|---|
0 | 2018-07-21 | 1 | No | No | Yes | Excellent | Yes | 17.000000 |
1 | 2018-07-21 | 2 | Yes | No | Yes | Excellent | Yes | 23.000000 |
2 | 2018-07-21 | 3 | No | No | No | Excellent | Yes | 29.000000 |
3 | 2018-07-21 | 4 | No | No | Yes | Excellent | Yes | 24.286207 |
4 | 2018-07-21 | 5 | No | No | No | Okay | Yes | 29.793103 |
But the data we have here isn't completely independent. Intuitively, doesn't our wellness today depend on how we felt yesterday or even a week before? So, we define a new score by applying exponential moving average to the WellnessScore. Now, each player's 'EWMScore' will depend on a week of WellnessScores with exponentially decreasing weights as me move further back in time. The resulting sample of rows like this:
# Create a Exponential Moving Average for 7 days span
dfWell = dfWell.sort_values(by='Date')
dfWell['EWMScore'] = dfWell.groupby('PlayerID')['WellnessScore'].transform(lambda x: x.ewm(span=7 , min_periods=1).mean())
dfWell.reset_index(drop=True, inplace=True)
dfWell
Date | PlayerID | Pain | Illness | Menstruation | Nutrition | NutritionAdjustment | WellnessScore | EWMScore | |
---|---|---|---|---|---|---|---|---|---|
0 | 2017-08-01 | 14 | No | No | No | Okay | Yes | 34.000000 | 34.000000 |
1 | 2017-08-01 | 2 | No | No | No | Excellent | Yes | 35.824138 | 35.824138 |
2 | 2017-08-01 | 3 | No | No | No | Excellent | Yes | 30.000000 | 30.000000 |
3 | 2017-08-01 | 5 | No | No | No | Excellent | Yes | 31.596552 | 31.596552 |
4 | 2017-08-01 | 13 | No | No | No | Okay | Yes | 29.862069 | 29.862069 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5006 | 2018-07-21 | 14 | No | No | No | Excellent | Yes | 28.206897 | 28.557073 |
5007 | 2018-07-21 | 15 | No | No | No | Excellent | Yes | 34.837931 | 35.176056 |
5008 | 2018-07-21 | 16 | No | No | No | Okay | Yes | 25.000000 | 26.956113 |
5009 | 2018-07-21 | 10 | No | Slightly Off | No | Excellent | Yes | 22.993103 | 24.044602 |
5010 | 2018-07-21 | 1 | No | No | Yes | Excellent | Yes | 17.000000 | 20.872639 |
5011 rows × 9 columns
We now have the final cleaned and summarized wellness dataset which we'll be using for our model.
dfWell.to_csv("Clean Data/WellnessMeasure.csv", index = None, header=True)