Welcome to the example notebook for FIFEforSpark!¶

You may recognize the following example from FIFE's notebook found here. Like that example notebook, we use the July 2020 edition of the Rulers, Elections, and Irregular Governance dataset (REIGN) dataset, a monthly panel of national leaders and political conditions since January 1950. We load the REIGN data directly from its online archive.

First, we import the necessary packages. In this case, we import SparkFiles which is required to read the data in from the url, in addition to several fifeforspark modules.

In [0]:

import pyspark
from pyspark import SparkFiles
import fifeforspark
from fifeforspark.utils import create_example_data2
from fifeforspark.processors import PanelDataProcessor
from fifeforspark.lgb_modelers import LGBSurvivalModeler

Now that we have the necessary packages loaded, we read in the data from a url:

In [0]:

url = "https://www.dl.dropboxusercontent.com/s/3tdswu2jfgwp4xw/REIGN_2020_7.csv?dl=0"
spark.sparkContext.addFile(url)

df = spark.read.csv("file://"+SparkFiles.get("REIGN_2020_7.csv"), header=True, inferSchema= True)

The data is stored in a Spark DataFrame which is different than you may expect if you are more familiar with FIFE. Let's examine our data a bit more.

In [0]:

df.show(10)

+-----+-------+------+------+-----+-------+----+----+--------------+-------------+--------------------+------------+-------+-------+--------+--------------+------------+---------------+----------+-----------+-----------+----------+-------------+---------------+--------------+-------------+-------------+---------------+-------+------------+---------+---------+-------------+------+----------+-------------------+--------+-----------+ ccode|country|leader| year|month|elected| age|male|militarycareer|tenure_months| government|anticipation|ref_ant|leg_ant|exec_ant|irreg_lead_ant|election_now|election_recent|leg_recent|exec_recent|lead_recent|ref_recent|direct_recent|indirect_recent|victory_recent|defeat_recent|change_recent|nochange_recent|delayed|lastelection| loss|irregular|prev_conflict|pt_suc|pt_attempt| precip|couprisk|pctile_risk| +-----+-------+------+------+-----+-------+----+----+--------------+-------------+--------------------+------------+-------+-------+--------+--------------+------------+---------------+----------+-----------+-----------+----------+-------------+---------------+--------------+-------------+-------------+---------------+-------+------------+---------+---------+-------------+------+----------+-------------------+--------+-----------+ 2.0| USA|Truman|1950.0| 1.0| 1.0|66.0| 1| 0.0| 58.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.6390574| 5.327876| 7.565793| 0.0| 0.0| 0.0|-0.0690575837052633| null| null| 2.0| USA|Truman|1950.0| 2.0| 1.0|66.0| 1| 0.0| 59.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.7080503| 5.332719|7.5663114| 0.0| 0.0| 0.0| -0.11372068300939| null| null| 2.0| USA|Truman|1950.0| 3.0| 1.0|66.0| 1| 0.0| 60.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.7725887|5.3375382|7.5668287| 0.0| 0.0| 0.0| -0.108042069627093| null| null| 2.0| USA|Truman|1950.0| 4.0| 1.0|66.0| 1| 0.0| 61.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.8332133|5.3423343|7.5673456| 0.0| 0.0| 0.0|-0.0416001452793281| null| null| 2.0| USA|Truman|1950.0| 5.0| 1.0|66.0| 1| 0.0| 62.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.8903718|5.3471074|7.5678625| 0.0| 0.0| 0.0| -0.129702783937251| null| null| 2.0| USA|Truman|1950.0| 6.0| 1.0|66.0| 1| 0.0| 63.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.944439| 5.351858|7.5683794| 0.0| 0.0| 0.0| -0.178496151195764| null| null| 2.0| USA|Truman|1950.0| 7.0| 1.0|66.0| 1| 0.0| 64.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2.9957323|5.3565865| 7.568896| 0.0| 0.0| 0.0| -0.042660054596682| null| null| 2.0| USA|Truman|1950.0| 8.0| 1.0|66.0| 1| 0.0| 65.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 3.0445225|5.3612924|7.5694118| 0.0| 0.0| 0.0| -0.070590356102934| null| null| 2.0| USA|Truman|1950.0| 9.0| 1.0|66.0| 1| 0.0| 66.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 3.0910425| 5.365976|7.5699277| 0.0| 0.0| 0.0| 0.0355567077070064| null| null| 2.0| USA|Truman|1950.0| 10.0| 1.0|66.0| 1| 0.0| 67.0|Presidential Demo...| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 3.1354942| 5.370638| 7.570443| 0.0| 0.0| 0.0| -0.138817795625302| null| null| +-----+-------+------+------+-----+-------+----+----+--------------+-------------+--------------------+------------+-------+-------+--------+--------------+------------+---------------+----------+-----------+-----------+----------+-------------+---------------+--------------+-------------+-------------+---------------+-------+------------+---------+---------+-------------+------+----------+-------------------+--------+-----------+ only showing top 10 rows

This isn't very pleasant to look at; however, the advantage of using a Spark DataFrame here (even though this could fit on one node) is that it's distributed. Fortunately, we can use the display() function to output a cleaner dataframe.

In [0]:

display(df.limit(7))

ccode	country	leader	year	month	elected	age	male	tenure_months	government	lastelection	loss	irregular	precip	couprisk	pctile_risk
2.0	USA	Truman	1950.0	1.0	1.0	66.0	1	58.0	Presidential Democracy	2.6390574	5.327876	7.565793	-0.0690575837052633	null	null
2.0	USA	Truman	1950.0	2.0	1.0	66.0	1	59.0	Presidential Democracy	2.7080503	5.332719	7.5663114	-0.11372068300939	null	null
2.0	USA	Truman	1950.0	3.0	1.0	66.0	1	60.0	Presidential Democracy	2.7725887	5.3375382	7.5668287	-0.108042069627093	null	null
2.0	USA	Truman	1950.0	4.0	1.0	66.0	1	61.0	Presidential Democracy	2.8332133	5.3423343	7.5673456	-0.0416001452793281	null	null
2.0	USA	Truman	1950.0	5.0	1.0	66.0	1	62.0	Presidential Democracy	2.8903718	5.3471074	7.5678625	-0.129702783937251	null	null
2.0	USA	Truman	1950.0	6.0	1.0	66.0	1	63.0	Presidential Democracy	2.944439	5.351858	7.5683794	-0.178496151195764	null	null
2.0	USA	Truman	1950.0	7.0	1.0	66.0	1	64.0	Presidential Democracy	2.9957323	5.3565865	7.568896	-0.042660054596682	null	null

Much better! Let's see how many partitions we have.

In [0]:

df.rdd.getNumPartitions()

Out[26]: 8

Wow! The data is split across 8 partitions! We can always change this number, but for this example we will leave it as 8.

Next, we make some changes to the data to prepare it for analysis

In [0]:

from pyspark.sql.functions import lit, lpad, col, concat, date_format
from pyspark.sql.types import DateType


df = df.withColumn('country-leader', concat(col('country'),lit(":"),col('leader')))
df = df.withColumn('year-month', concat(col('year').cast('integer').cast('string'),lit("-"), lpad(col('month').cast('integer').cast('string'), 2, "0"), lit("-"),lit("01")))

df = df.withColumn('year-month', df['year-month'].cast(DateType()))

cols = ['country-leader', 'year-month'] + [x for x in df.columns if x not in ["ccode", "country-leader", "leader", "year-month"]]
df = df.select(cols)
total_obs = df.count()
df = df.drop_duplicates(subset = ["country-leader", "year-month"])
n_duplicates = total_obs - df.count()
print(f"{n_duplicates} observations with a duplicated identifier pair deleted.")

7 observations with a duplicated identifier pair deleted.

Now that we have created unique identifiers for the individual and time, we pass the data through the Panel Data Processor, specifying a value of 4 for 'TEST_INTERVALS' as we want to test the last 4 periods. For the time being, we transform the time_id back to a numeric feature given constraints regarding datetime functionality.

In [0]:

test_intervals = 4
processor = PanelDataProcessor(data=df, config = {'TEST_INTERVALS': test_intervals}, shuffle_parts = 20)
processor.build_processed_data()
display(processor.data.limit(7))

Time identifier column name not given; assumed to be second-leftmost column (year-month) Individual identifier column name not given; assumed to be leftmost column (country-leader)

country-leader	year-month	country	year	month	age	male	tenure_months	government	lastelection	loss	irregular	precip	couprisk	pctile_risk	_period	_predict_obs	_test	_validation	_maximum_lead	_duration	_event_observed
Afghanistan:Abdallah Yakta	1967-10-01	Afghanistan	1967.0	10.0	53.0	1	1.0	Monarchy	6.1246834	6.1246834	6.1246834	0.0187039816454769	null	null	213	false	false	true	629	1	true
Afghanistan:Abdallah Yakta	1967-11-01	Afghanistan	1967.0	11.0	53.0	1	2.0	Monarchy	6.126869	6.126869	6.126869	0.17923993006129	null	null	214	false	false	true	628	0	true
Afghanistan:Abdul Zahir	1971-06-01	Afghanistan	1971.0	6.0	61.0	1	1.0	Monarchy	6.216606	6.216606	6.216606	-1.80655395371697	null	null	257	false	false	false	585	18	true
Afghanistan:Abdul Zahir	1971-07-01	Afghanistan	1971.0	7.0	61.0	1	2.0	Monarchy	6.2186003	6.2186003	6.2186003	-1.79781203785734	null	null	258	false	false	false	584	17	true
Afghanistan:Abdul Zahir	1971-08-01	Afghanistan	1971.0	8.0	61.0	1	3.0	Monarchy	6.22059	6.22059	6.22059	-1.81513438987383	null	null	259	false	false	false	583	16	true
Afghanistan:Abdul Zahir	1971-09-01	Afghanistan	1971.0	9.0	61.0	1	4.0	Monarchy	6.222576	6.222576	6.222576	-1.84107666146101	null	null	260	false	false	false	582	15	true
Afghanistan:Abdul Zahir	1971-10-01	Afghanistan	1971.0	10.0	61.0	1	5.0	Monarchy	6.2245584	6.2245584	6.2245584	-1.90343827157157	null	null	261	false	false	false	581	14	true

Now, we build the model. You can pass parameters into the model that will be passed to lightgbm as well.

In [0]:

modeler = LGBSurvivalModeler(data=processor.data)
modeler.build_model(n_intervals=test_intervals)

Now we want to see how well our model performs on the test data and take a look at the predictions

In [0]:

metrics = modeler.evaluate()

Now evaluating lead length: 1 of 4 2.2871768474578857 70.40783166885376 1.4782764911651611 0.20274138450622559 3.9578638076782227 Now evaluating lead length: 2 of 4 1.112673044204712 88.06950092315674 3.842557191848755 0.30141425132751465 9.198489665985107 Now evaluating lead length: 3 of 4 1.9427433013916016 95.69927430152893 2.6116440296173096 0.323838472366333 6.349407434463501 Now evaluating lead length: 4 of 4 1.8285760879516602 91.48771524429321 3.2123091220855713 0.21567392349243164 7.031679391860962

In [0]:

metrics

Out[31]:

	AUROC	Actual Share	Predicted Share	True Positives	False Negatives	False Positives	True Negatives
Lead Length
1	0.883505	0.974874	0.988642	194	0	4	1
2	0.932292	0.964824	0.969939	190	2	5	2
3	0.888304	0.954774	0.958899	188	2	7	2
4	0.836642	0.934673	0.938473	183	3	10	3

And finally, we want to forecast future survival probabilities for country-leaders in the last period

In [0]:

forecasts = modeler.forecast()
forecasts.head(10)

Out[32]:

	1-period Survival Probability	2-period Survival Probability	3-period Survival Probability	4-period Survival Probability
0	0.997037	0.996715	0.995111	0.992604
1	0.994377	0.990283	0.986637	0.977831
2	0.983984	0.978464	0.971796	0.954448
3	0.993038	0.977267	0.966908	0.964620
4	0.998790	0.997496	0.995501	0.993531
5	0.989612	0.987722	0.984450	0.968102
6	0.994747	0.977737	0.976569	0.974783
7	0.981061	0.980291	0.976630	0.968736
8	0.989349	0.986557	0.979686	0.976169
9	0.986199	0.452783	0.447913	0.440717

In [0]: