Wellness Data¶

The original wellness data of athletes consists of 19 columns. Our aim for this dataset is to come up with a quantitative summary measure(s). For complete understanding as to what each variable means, please refer to the CodeBook.

We'll first look at the dataset in depth, to possibly merge highly correlated measures or come up with new measures by transforming the existing ones. The summary of our wellness dataset is as follows:

In [16]:

import numpy as np
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns

dfWell = pd.read_csv('https://www.dropbox.com/s/170bc3dimgn8ru8/wellness.csv?dl=1')
dfWell.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5011 entries, 0 to 5010
Data columns (total 19 columns):
Date                   5011 non-null object
PlayerID               5011 non-null int64
Fatigue                5011 non-null int64
Soreness               5011 non-null int64
Desire                 5011 non-null int64
Irritability           5011 non-null int64
BedTime                5011 non-null object
WakeTime               5011 non-null object
SleepHours             5011 non-null float64
SleepQuality           5011 non-null int64
MonitoringScore        5011 non-null int64
Pain                   5011 non-null object
Illness                5011 non-null object
Menstruation           4995 non-null object
Nutrition              4174 non-null object
NutritionAdjustment    4266 non-null object
USGMeasurement         4843 non-null object
USG                    629 non-null float64
TrainingReadiness      5011 non-null object
dtypes: float64(2), int64(7), object(10)
memory usage: 743.9+ KB

As we can see, there are some null values in our dataset. After close inspection, we decided to fill the null values of Menstruation, Nutrition and NutritionAdjustment with their modes and drop the USG and USGMeasurement columns as they were ~85% null. The resulting dataset looks as follows:

In [17]:

# As Menstruation, Nutrition and Nutrition Adjustment are categorical, fill NA with their Modes
dfWell.Menstruation.fillna(dfWell.Menstruation.mode()[0],inplace=True)
dfWell.Nutrition.fillna(dfWell.Nutrition.mode()[0],inplace=True)
dfWell.NutritionAdjustment.fillna(dfWell.NutritionAdjustment.mode()[0],inplace=True)
dfWell.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5011 entries, 0 to 5010
Data columns (total 19 columns):
Date                   5011 non-null object
PlayerID               5011 non-null int64
Fatigue                5011 non-null int64
Soreness               5011 non-null int64
Desire                 5011 non-null int64
Irritability           5011 non-null int64
BedTime                5011 non-null object
WakeTime               5011 non-null object
SleepHours             5011 non-null float64
SleepQuality           5011 non-null int64
MonitoringScore        5011 non-null int64
Pain                   5011 non-null object
Illness                5011 non-null object
Menstruation           5011 non-null object
Nutrition              5011 non-null object
NutritionAdjustment    5011 non-null object
USGMeasurement         4843 non-null object
USG                    629 non-null float64
TrainingReadiness      5011 non-null object
dtypes: float64(2), int64(7), object(10)
memory usage: 743.9+ KB

In [18]:

# As USG measurement has almost 87% NA, it doesn't make sense to fill NA with random or mean values. 
# Hence, drop USG related columns
dfWell = dfWell.drop(['USGMeasurement', 'USG'], axis=1)
dfWell.head()

Out[18]:

	Date	PlayerID	Fatigue	Soreness	Desire	Irritability	BedTime	WakeTime	SleepHours	SleepQuality	MonitoringScore	Pain	Illness	Menstruation	Nutrition	NutritionAdjustment	TrainingReadiness
0	2018-07-21	1	3	3	2	3	23:00:00	07:00:00	8.00	2	13	No	No	Yes	Excellent	Yes	0%
1	2018-07-21	2	4	3	4	4	23:00:00	07:00:00	8.00	4	19	Yes	No	Yes	Excellent	Yes	0%
2	2018-07-21	3	3	3	5	4	22:30:00	06:30:00	8.00	4	19	No	No	No	Excellent	Yes	100%
3	2018-07-21	4	2	3	5	4	00:30:00	07:00:00	6.50	1	15	No	No	Yes	Excellent	Yes	95%
4	2018-07-21	5	5	3	4	4	23:45:00	07:00:00	7.25	4	20	No	No	No	Okay	Yes	100%

Next, we drop the BedTime and WakeTime as this information is captured by the SleepHours already. We are also going to convert the TrainingReadiness and SleepTime to a scale of 1 to 7 to match all the other scales which are captured in the MonitoringScore.

In [19]:

# We do not require Bed Time and Wake Time as well. We'll be using SleepHours instead
dfWell = dfWell.drop(['BedTime', 'WakeTime'], axis=1)
# Convert TrainingReadiness from String to fraction
dfWell['TrainingReadiness'] = 1 + (dfWell['TrainingReadiness'].str.rstrip('%').astype('float') / 100.0 * 6)
# Use MinMaxScaler to convert sleeptime to a number from 1 to 7
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
dfWell["SleepHoursScaled"] = 1 + (min_max_scaler.fit_transform(dfWell[["SleepHours"]]) * 6)
dfWell

Out[19]:

	Date	PlayerID	Fatigue	Soreness	Desire	Irritability	SleepHours	SleepQuality	MonitoringScore	Pain	Illness	Menstruation	Nutrition	NutritionAdjustment	TrainingReadiness	SleepHoursScaled
0	2018-07-21	1	3	3	2	3	8.00	2	13	No	No	Yes	Excellent	Yes	1.0	3.000000
1	2018-07-21	2	4	3	4	4	8.00	4	19	Yes	No	Yes	Excellent	Yes	1.0	3.000000
2	2018-07-21	3	3	3	5	4	8.00	4	19	No	No	No	Excellent	Yes	7.0	3.000000
3	2018-07-21	4	2	3	5	4	6.50	1	15	No	No	Yes	Excellent	Yes	6.7	2.586207
4	2018-07-21	5	5	3	4	4	7.25	4	20	No	No	No	Okay	Yes	7.0	2.793103
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5006	2017-08-01	8	6	7	7	7	6.50	6	33	No	No	No	Poor	No	4.0	2.586207
5007	2017-08-01	10	3	4	4	4	6.75	3	18	No	No	No	Excellent	I Don't Know	6.7	2.655172
5008	2017-08-01	12	5	4	5	4	7.50	3	21	No	No	Yes	Okay	I Don't Know	6.4	2.862069
5009	2017-08-01	13	4	4	4	4	7.50	4	20	No	No	No	Okay	Yes	7.0	2.862069
5010	2017-08-01	14	4	4	5	5	8.00	6	24	No	No	No	Okay	Yes	7.0	3.000000

5011 rows × 16 columns

We merge the MonitoringScore with TrainingReadiness and SleepHoursScaled to define a new measure called the WellnessScore. Our dataset looks a lot cleaner and concise now:

In [20]:

# Create a WellnessScore which is the sum of all the quantitative values
dfWell["WellnessScore"] = dfWell["MonitoringScore"] + dfWell["TrainingReadiness"] + dfWell["SleepHoursScaled"]
# Drop all the quantitative values as we'll be using only the WellnessScore from now on
dfWell = dfWell.drop(['Fatigue','Soreness','Desire','Irritability','SleepHours','SleepQuality','MonitoringScore',\
                      'TrainingReadiness','SleepHoursScaled'],axis=1)
dfWell.head()

Out[20]:

	Date	PlayerID	Pain	Illness	Menstruation	Nutrition	NutritionAdjustment	WellnessScore
0	2018-07-21	1	No	No	Yes	Excellent	Yes	17.000000
1	2018-07-21	2	Yes	No	Yes	Excellent	Yes	23.000000
2	2018-07-21	3	No	No	No	Excellent	Yes	29.000000
3	2018-07-21	4	No	No	Yes	Excellent	Yes	24.286207
4	2018-07-21	5	No	No	No	Okay	Yes	29.793103

But the data we have here isn't completely independent. Intuitively, doesn't our wellness today depend on how we felt yesterday or even a week before? So, we define a new score by applying exponential moving average to the WellnessScore. Now, each player's 'EWMScore' will depend on a week of WellnessScores with exponentially decreasing weights as me move further back in time. The resulting sample of rows like this:

In [21]:

# Create a Exponential Moving Average for 7 days span
dfWell = dfWell.sort_values(by='Date')
dfWell['EWMScore'] = dfWell.groupby('PlayerID')['WellnessScore'].transform(lambda x: x.ewm(span=7 , min_periods=1).mean())
dfWell.reset_index(drop=True, inplace=True)
dfWell

Out[21]:

	Date	PlayerID	Pain	Illness	Menstruation	Nutrition	NutritionAdjustment	WellnessScore	EWMScore
0	2017-08-01	14	No	No	No	Okay	Yes	34.000000	34.000000
1	2017-08-01	2	No	No	No	Excellent	Yes	35.824138	35.824138
2	2017-08-01	3	No	No	No	Excellent	Yes	30.000000	30.000000
3	2017-08-01	5	No	No	No	Excellent	Yes	31.596552	31.596552
4	2017-08-01	13	No	No	No	Okay	Yes	29.862069	29.862069
...	...	...	...	...	...	...	...	...	...
5006	2018-07-21	14	No	No	No	Excellent	Yes	28.206897	28.557073
5007	2018-07-21	15	No	No	No	Excellent	Yes	34.837931	35.176056
5008	2018-07-21	16	No	No	No	Okay	Yes	25.000000	26.956113
5009	2018-07-21	10	No	Slightly Off	No	Excellent	Yes	22.993103	24.044602
5010	2018-07-21	1	No	No	Yes	Excellent	Yes	17.000000	20.872639

5011 rows × 9 columns

We now have the final cleaned and summarized wellness dataset which we'll be using for our model.

In [23]:

dfWell.to_csv("Clean Data/WellnessMeasure.csv", index = None, header=True)