import numpy as np
import pandas as pd
We'll work with data where:
Notice that these are the same numbers that you see in the lecture video about estimating survival.
df = pd.DataFrame({'Time': [10,8,60,20,12,30,15],
'Event': [1,0,1,1,0,1,0]
})
df
Time | Event | |
---|---|---|
0 | 10 | 1 |
1 | 8 | 0 |
2 | 60 | 1 |
3 | 20 | 1 |
4 | 12 | 0 |
5 | 30 | 1 |
6 | 15 | 0 |
df['Event'] == 0
0 False 1 True 2 False 3 False 4 True 5 False 6 True Name: Event, dtype: bool
Patient 1, 4 and 6 were censored.
When we sum a series of booleans, True
is treated as 1 and False
is treated as 0.
sum(df['Event'] == 0)
3
This assumes that any patient who was censored died at the time of being censored ( died immediately).
If a patient survived past time t
:
Time
of event should be greater than t
.Event
of either 1 or 0. What matters is their Time
value.t = 25
df['Time'] > t
0 False 1 False 2 True 3 False 4 False 5 True 6 False Name: Time, dtype: bool
sum(df['Time'] > t)
2
This assumes that censored patients never die.
Event
is 1) but after time t
t = 25
(df['Time'] > t) | (df['Event'] == 0)
0 False 1 True 2 True 3 False 4 True 5 True 6 True dtype: bool
sum( (df['Time'] > t) | (df['Event'] == 0) )
5
If patient was not censored before time t
:
t
, at t
, or after t
(any time)Time
occurs after time t
(they may have either died or been censored at a later time after t
)t = 25
(df['Event'] == 1) | (df['Time'] > t)
0 True 1 False 2 True 3 True 4 False 5 True 6 False dtype: bool
sum( (df['Event'] == 1) | (df['Time'] > t) )
4
The Kaplan Meier estimate of survival probability is:
$$ S(t) = \prod_{t_i \leq t} (1 - \frac{d_i}{n_i}) $$import numpy as np
import pandas as pd
df = pd.DataFrame({'Time': [3,3,2,2],
'Event': [0,1,0,1]
})
df
Time | Event | |
---|---|---|
0 | 3 | 0 |
1 | 3 | 1 |
2 | 2 | 0 |
3 | 2 | 1 |
If they survived up to time $t_i$,
Time
is either greater than $t_i$Time
can be equal to $t_i$t_i = 2
df['Time'] >= t_i
0 True 1 True 2 True 3 True Name: Time, dtype: bool
You can use this to help you calculate $n_i$
Event
value is 1.Time
should be equal to $t_i$t_i = 2
(df['Event'] == 1) & (df['Time'] == t_i)
You can use this to help you calculate $d_i$
You'll implement Kaplan Meier in this week's assignment!