You may recognize the following example from FIFE's notebook found here. Like that example notebook, we use the July 2020 edition of the Rulers, Elections, and Irregular Governance dataset (REIGN) dataset, a monthly panel of national leaders and political conditions since January 1950. We load the REIGN data directly from its online archive.
First, we import the necessary packages. In this case, we import SparkFiles which is required to read the data in from the url, in addition to several fifeforspark modules.
import pyspark
from pyspark import SparkFiles
import fifeforspark
from fifeforspark.utils import create_example_data2
from fifeforspark.processors import PanelDataProcessor
from fifeforspark.lgb_modelers import LGBSurvivalModeler
Now that we have the necessary packages loaded, we read in the data from a url:
url = "https://www.dl.dropboxusercontent.com/s/3tdswu2jfgwp4xw/REIGN_2020_7.csv?dl=0"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("REIGN_2020_7.csv"), header=True, inferSchema= True)
The data is stored in a Spark DataFrame which is different than you may expect if you are more familiar with FIFE. Let's examine our data a bit more.
df.show(10)
This isn't very pleasant to look at; however, the advantage of using a Spark DataFrame here (even though this could fit on one node) is that it's distributed. Fortunately, we can use the display() function to output a cleaner dataframe.
display(df.limit(7))
ccode | country | leader | year | month | elected | age | male | militarycareer | tenure_months | government | anticipation | ref_ant | leg_ant | exec_ant | irreg_lead_ant | election_now | election_recent | leg_recent | exec_recent | lead_recent | ref_recent | direct_recent | indirect_recent | victory_recent | defeat_recent | change_recent | nochange_recent | delayed | lastelection | loss | irregular | prev_conflict | pt_suc | pt_attempt | precip | couprisk | pctile_risk |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.0 | USA | Truman | 1950.0 | 1.0 | 1.0 | 66.0 | 1 | 0.0 | 58.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.6390574 | 5.327876 | 7.565793 | 0.0 | 0.0 | 0.0 | -0.0690575837052633 | null | null |
2.0 | USA | Truman | 1950.0 | 2.0 | 1.0 | 66.0 | 1 | 0.0 | 59.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.7080503 | 5.332719 | 7.5663114 | 0.0 | 0.0 | 0.0 | -0.11372068300939 | null | null |
2.0 | USA | Truman | 1950.0 | 3.0 | 1.0 | 66.0 | 1 | 0.0 | 60.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.7725887 | 5.3375382 | 7.5668287 | 0.0 | 0.0 | 0.0 | -0.108042069627093 | null | null |
2.0 | USA | Truman | 1950.0 | 4.0 | 1.0 | 66.0 | 1 | 0.0 | 61.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.8332133 | 5.3423343 | 7.5673456 | 0.0 | 0.0 | 0.0 | -0.0416001452793281 | null | null |
2.0 | USA | Truman | 1950.0 | 5.0 | 1.0 | 66.0 | 1 | 0.0 | 62.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.8903718 | 5.3471074 | 7.5678625 | 0.0 | 0.0 | 0.0 | -0.129702783937251 | null | null |
2.0 | USA | Truman | 1950.0 | 6.0 | 1.0 | 66.0 | 1 | 0.0 | 63.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.944439 | 5.351858 | 7.5683794 | 0.0 | 0.0 | 0.0 | -0.178496151195764 | null | null |
2.0 | USA | Truman | 1950.0 | 7.0 | 1.0 | 66.0 | 1 | 0.0 | 64.0 | Presidential Democracy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.9957323 | 5.3565865 | 7.568896 | 0.0 | 0.0 | 0.0 | -0.042660054596682 | null | null |
Much better! Let's see how many partitions we have.
df.rdd.getNumPartitions()
Wow! The data is split across 8 partitions! We can always change this number, but for this example we will leave it as 8.
Next, we make some changes to the data to prepare it for analysis
from pyspark.sql.functions import lit, lpad, col, concat, date_format
from pyspark.sql.types import DateType
df = df.withColumn('country-leader', concat(col('country'),lit(":"),col('leader')))
df = df.withColumn('year-month', concat(col('year').cast('integer').cast('string'),lit("-"), lpad(col('month').cast('integer').cast('string'), 2, "0"), lit("-"),lit("01")))
df = df.withColumn('year-month', df['year-month'].cast(DateType()))
cols = ['country-leader', 'year-month'] + [x for x in df.columns if x not in ["ccode", "country-leader", "leader", "year-month"]]
df = df.select(cols)
total_obs = df.count()
df = df.drop_duplicates(subset = ["country-leader", "year-month"])
n_duplicates = total_obs - df.count()
print(f"{n_duplicates} observations with a duplicated identifier pair deleted.")
Now that we have created unique identifiers for the individual and time, we pass the data through the Panel Data Processor, specifying a value of 4 for 'TEST_INTERVALS' as we want to test the last 4 periods. For the time being, we transform the time_id back to a numeric feature given constraints regarding datetime functionality.
test_intervals = 4
processor = PanelDataProcessor(data=df, config = {'TEST_INTERVALS': test_intervals}, shuffle_parts = 20)
processor.build_processed_data()
display(processor.data.limit(7))
country-leader | year-month | country | year | month | elected | age | male | militarycareer | tenure_months | government | anticipation | ref_ant | leg_ant | exec_ant | irreg_lead_ant | election_now | election_recent | leg_recent | exec_recent | lead_recent | ref_recent | direct_recent | indirect_recent | victory_recent | defeat_recent | change_recent | nochange_recent | delayed | lastelection | loss | irregular | prev_conflict | pt_suc | pt_attempt | precip | couprisk | pctile_risk | _period | _predict_obs | _test | _validation | _maximum_lead | _spell | _duration | _event_observed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan:Abdallah Yakta | 1967-10-01 | Afghanistan | 1967.0 | 10.0 | 0.0 | 53.0 | 1 | 0.0 | 1.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.1246834 | 6.1246834 | 6.1246834 | 0.0 | 0.0 | 0.0 | 0.0187039816454769 | null | null | 213 | false | false | true | 629 | 0 | 1 | true |
Afghanistan:Abdallah Yakta | 1967-11-01 | Afghanistan | 1967.0 | 11.0 | 0.0 | 53.0 | 1 | 0.0 | 2.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.126869 | 6.126869 | 6.126869 | 0.0 | 0.0 | 0.0 | 0.17923993006129 | null | null | 214 | false | false | true | 628 | 0 | 0 | true |
Afghanistan:Abdul Zahir | 1971-06-01 | Afghanistan | 1971.0 | 6.0 | 0.0 | 61.0 | 1 | 0.0 | 1.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.216606 | 6.216606 | 6.216606 | 0.0 | 0.0 | 0.0 | -1.80655395371697 | null | null | 257 | false | false | false | 585 | 0 | 18 | true |
Afghanistan:Abdul Zahir | 1971-07-01 | Afghanistan | 1971.0 | 7.0 | 0.0 | 61.0 | 1 | 0.0 | 2.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.2186003 | 6.2186003 | 6.2186003 | 0.0 | 0.0 | 0.0 | -1.79781203785734 | null | null | 258 | false | false | false | 584 | 0 | 17 | true |
Afghanistan:Abdul Zahir | 1971-08-01 | Afghanistan | 1971.0 | 8.0 | 0.0 | 61.0 | 1 | 0.0 | 3.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.22059 | 6.22059 | 6.22059 | 0.0 | 0.0 | 0.0 | -1.81513438987383 | null | null | 259 | false | false | false | 583 | 0 | 16 | true |
Afghanistan:Abdul Zahir | 1971-09-01 | Afghanistan | 1971.0 | 9.0 | 0.0 | 61.0 | 1 | 0.0 | 4.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.222576 | 6.222576 | 6.222576 | 0.0 | 0.0 | 0.0 | -1.84107666146101 | null | null | 260 | false | false | false | 582 | 0 | 15 | true |
Afghanistan:Abdul Zahir | 1971-10-01 | Afghanistan | 1971.0 | 10.0 | 0.0 | 61.0 | 1 | 0.0 | 5.0 | Monarchy | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.2245584 | 6.2245584 | 6.2245584 | 0.0 | 0.0 | 0.0 | -1.90343827157157 | null | null | 261 | false | false | false | 581 | 0 | 14 | true |
Now, we build the model. You can pass parameters into the model that will be passed to lightgbm as well.
modeler = LGBSurvivalModeler(data=processor.data)
modeler.build_model(n_intervals=test_intervals)
Now we want to see how well our model performs on the test data and take a look at the predictions
metrics = modeler.evaluate()
metrics
AUROC | Actual Share | Predicted Share | True Positives | False Negatives | False Positives | True Negatives | Other Metrics: | |
---|---|---|---|---|---|---|---|---|
Lead Length | ||||||||
1 | 0.883505 | 0.974874 | 0.988642 | 194 | 0 | 4 | 1 | |
2 | 0.932292 | 0.964824 | 0.969939 | 190 | 2 | 5 | 2 | |
3 | 0.888304 | 0.954774 | 0.958899 | 188 | 2 | 7 | 2 | |
4 | 0.836642 | 0.934673 | 0.938473 | 183 | 3 | 10 | 3 |
And finally, we want to forecast future survival probabilities for country-leaders in the last period
forecasts = modeler.forecast()
forecasts.head(10)
1-period Survival Probability | 2-period Survival Probability | 3-period Survival Probability | 4-period Survival Probability | |
---|---|---|---|---|
0 | 0.997037 | 0.996715 | 0.995111 | 0.992604 |
1 | 0.994377 | 0.990283 | 0.986637 | 0.977831 |
2 | 0.983984 | 0.978464 | 0.971796 | 0.954448 |
3 | 0.993038 | 0.977267 | 0.966908 | 0.964620 |
4 | 0.998790 | 0.997496 | 0.995501 | 0.993531 |
5 | 0.989612 | 0.987722 | 0.984450 | 0.968102 |
6 | 0.994747 | 0.977737 | 0.976569 | 0.974783 |
7 | 0.981061 | 0.980291 | 0.976630 | 0.968736 |
8 | 0.989349 | 0.986557 | 0.979686 | 0.976169 |
9 | 0.986199 | 0.452783 | 0.447913 | 0.440717 |