import pandas as pd
df = pd.DataFrame({'ascites': [0,1,0,1],
'edema': [0.5,0,1,0.5],
'stage': [3,4,3,4],
'cholesterol': [200.5,180.2,190.5,210.3]
})
df
ascites | edema | stage | cholesterol | |
---|---|---|---|---|
0 | 0 | 0.5 | 3 | 200.5 |
1 | 1 | 0.0 | 4 | 180.2 |
2 | 0 | 1.0 | 3 | 190.5 |
3 | 1 | 0.5 | 4 | 210.3 |
In this small sample dataset, 'ascites', 'edema', and 'stage' are categorical variables
'cholesterol' is a continuous variable, since it can be any decimal value greater than zero.
Of the categorical variables, which one should be one-hot encoded (turned into dummy variables)?
Let's see what happens when we one-hot encode the 'stage' feature.
We'll use pandas.get_dummies
df_stage = pd.get_dummies(data=df,
columns=['stage']
)
df_stage[['stage_3','stage_4']]
stage_3 | stage_4 | |
---|---|---|
0 | 1 | 0 |
1 | 0 | 1 |
2 | 1 | 0 |
3 | 0 | 1 |
What do you notice about the 'stage_3' and 'stage_4' features?
Given that stage 3 and stage 4 are the only possible values for stage,
If you know that patient 0 (row 0) has stage_3 set to 1,
what can you say about that same patient's value for the stage_4 feature?
This means that one of the feature columns is actually redundant. We should drop one of these features to avoid multicollinearity (where one feature can predict another feature).
df_stage
ascites | edema | cholesterol | stage_3 | stage_4 | |
---|---|---|---|---|---|
0 | 0 | 0.5 | 200.5 | 1 | 0 |
1 | 1 | 0.0 | 180.2 | 0 | 1 |
2 | 0 | 1.0 | 190.5 | 1 | 0 |
3 | 1 | 0.5 | 210.3 | 0 | 1 |
df_stage_drop_first = df_stage.drop(columns='stage_3')
df_stage_drop_first
ascites | edema | cholesterol | stage_4 | |
---|---|---|---|---|
0 | 0 | 0.5 | 200.5 | 0 |
1 | 1 | 0.0 | 180.2 | 1 |
2 | 0 | 1.0 | 190.5 | 0 |
3 | 1 | 0.5 | 210.3 | 1 |
Note, there's actually a parameter of pandas.get_dummies() that lets you drop the first one-hot encoded column. You'll practice doing this in this week's assignment!
We can cast the one-hot encoded values as floats by setting the data type to numpy.float64.
import numpy as np
df_stage = pd.get_dummies(data=df,
columns=['stage'],
)
df_stage[['stage_4']]
stage_4 | |
---|---|
0 | 0 |
1 | 1 |
2 | 0 |
3 | 1 |
df_stage_float64 = pd.get_dummies(data=df,
columns=['stage'],
dtype=np.float64
)
df_stage_float64[['stage_4']]
stage_4 | |
---|---|
0 | 0.0 |
1 | 1.0 |
2 | 0.0 |
3 | 1.0 |
Let's say we fit the hazard function $$ \lambda(t, x) = \lambda_0(t)e^{\theta^T X_i} $$
So that we have the coefficients $\theta$ for the features in $X_i$
If you have a new patient, let's predict their hazard $\lambda(t,x)$
import numpy as np
import pandas as pd
lambda_0 = 1
coef = np.array([0.5,2.])
coef
array([0.5, 2. ])
X = pd.DataFrame({'age': [20,30,40],
'cholesterol': [180,220,170]
})
X
age | cholesterol | |
---|---|---|
0 | 20 | 180 |
1 | 30 | 220 |
2 | 40 | 170 |
coef.shape
(2,)
X.shape
(3, 2)
It looks like the coefficient is a 1D array, so transposing it won't do anything.
So the formula looks more like this (transpose $X_i$ instead of $\theta$ $$ \lambda(t, x) = \lambda_0(t)e^{\theta X_i^T} $$
np.dot(coef,X.T)
array([370., 455., 360.])
Calculate the hazard for the three patients (there are 3 rows in X)
lambdas = lambda_0 * np.exp(np.dot(coef,X.T))
patients_df = X.copy()
patients_df['hazards'] = lambdas
patients_df
age | cholesterol | hazards | |
---|---|---|---|
0 | 20 | 180 | 4.886054e+160 |
1 | 30 | 220 | 4.017809e+197 |
2 | 40 | 170 | 2.218265e+156 |
import pandas as pd
df = pd.DataFrame({'time': [2,4,2,4,2,4,2,4],
'event': [1,1,1,1,0,1,1,0],
'risk_score': [20,40,40,20,20,40,40,20]
})
df
time | event | risk_score | |
---|---|---|---|
0 | 2 | 1 | 20 |
1 | 4 | 1 | 40 |
2 | 2 | 1 | 40 |
3 | 4 | 1 | 20 |
4 | 2 | 0 | 20 |
5 | 4 | 1 | 40 |
6 | 2 | 1 | 40 |
7 | 4 | 0 | 20 |
We made this data sample so that you can compare pairs of patients visually.
pd.concat([df.iloc[0:1],df.iloc[1:2]],axis=0)
time | event | risk_score | |
---|---|---|---|
0 | 2 | 1 | 20 |
1 | 4 | 1 | 40 |
if df['event'][0] == 1 or df['event'][1] == 1:
print(f"May be a permissible pair: 0 and 1")
else:
print(f"Definitely not permissible pair: 0 and 1")
May be a permissible pair: 0 and 1
pd.concat([df.iloc[4:5],df.iloc[7:8]],axis=0)
time | event | risk_score | |
---|---|---|---|
4 | 2 | 0 | 20 |
7 | 4 | 0 | 20 |
if df['event'][4] == 1 or df['event'][7] == 1:
print(f"May be a permissible pair: 4 and 7")
else:
print(f"Definitely not permissible pair: 4 and 7")
Definitely not permissible pair: 4 and 7
pd.concat([df.iloc[0:1],df.iloc[1:2]],axis=0)
time | event | risk_score | |
---|---|---|---|
0 | 2 | 1 | 20 |
1 | 4 | 1 | 40 |
if df['event'][0] == 1 and df['event'][1] == 1:
print(f"Definitely a permissible pair: 0 and 1")
else:
print(f"May be a permissible pair: 0 and 1")
Definitely a permissible pair: 0 and 1
pd.concat([df.iloc[6:7],df.iloc[7:8]],axis=0)
time | event | risk_score | |
---|---|---|---|
6 | 2 | 1 | 40 |
7 | 4 | 0 | 20 |
if df['time'][7] >= df['time'][6]:
print(f"Permissible pair: Censored patient 7 lasted at least as long as uncensored patient 6")
else:
print("Not a permisible pair")
Permissible pair: Censored patient 7 lasted at least as long as uncensored patient 6
pd.concat([df.iloc[4:5],df.iloc[5:6]],axis=0)
time | event | risk_score | |
---|---|---|---|
4 | 2 | 0 | 20 |
5 | 4 | 1 | 40 |
if df['time'][4] >= df['time'][5]:
print(f"Permissible pair")
else:
print("Not a permisible pair: censored patient 4 was censored before patient 5 had their event")
Not a permisible pair: censored patient 4 was censored before patient 5 had their event