Make sure the file_date
parameter below is set to whatever value you would like appended to file names.
Change the most_recent_860_year
parameter below to match the most up-to-date EIA-860 annual data file. As of March 2018 this is 2016.
EIA-860 (annual) excel files will need to be downloaded and unzipped to the EIA downloads
folder. Make sure that all years from 2012 through the most recent data year are available. Also download the most recent EIA-860m to EIA downloads
.
The most recent annual 860 file available (as of March 2018) represents 2016 data. When newer EIA-860 annual files are added the dictionary with pandas read_excel
parameters will need to be updated. Note that EIA puts out an Early Release version of 860 with extra header rows and columns, so be sure to appropriately adjust the skiprows
and usecols
parameters if using an Early Release file.
The entire notebook can be run at once using Run All Cells
%matplotlib inline
import matplotlib.pyplot as plt
import os
from os.path import join
import pandas as pd
from sklearn import neighbors, metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from collections import Counter
from copy import deepcopy
cwd = os.getcwd()
data_path = join(cwd, '..', 'Data storage')
This will be inserted into all filenames (reading and writing)
file_date = '2018-03-06'
most_recent_860_year = 2016
This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually.
Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon
path = os.path.join(data_path, 'Derived data',
'Facility gen fuels and CO2 {}.csv'.format(file_date))
facility_df = pd.read_csv(path)
facility_df['state'] = facility_df['geography'].str[-2:]
plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]
plants.drop_duplicates(inplace=True)
Because the 2017 facility dataframe only includes annually reporting facilities I'm going to duplicate the plant id, lat/lon, and state information from 2016.
# make a copy of 2016 (or most recent annual data year) and change the year to
plants_2017 = plants.loc[plants['year'] == most_recent_860_year, :].copy()
plants_2017.loc[:, 'year'] += 1
plants = pd.concat([plants.loc[plants['year']<=most_recent_860_year, :], plants_2017])
(set(plants.loc[plants.year==2016, 'plant id']) - set(plants.loc[plants.year==2017, 'plant id']))
set()
Current NERCS go back to 2012. Use all annual 860 files from 2012 through the most recent available. Extend the dictionary of dictionaries below with any files available after 2016. io
, skiprows
, and usecols
are all input parameters for the Pandas read_excel
function.
eia_base_path = join(data_path, 'EIA downloads')
file_860_info = {
# 2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),
# 'skiprows': 0,
# 'parse_cols': 'B,J'},
2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),
'skiprows': 0,
'usecols': 'B,J'},
2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),
'skiprows': 0,
'usecols': 'C,L'},
2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),
'skiprows': 0,
'usecols': 'C,L'},
2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),
'skiprows': 0,
'usecols': 'C,L'},
2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),
'skiprows': 0,
'usecols': 'C,L'}
}
eia_nercs = {}
for key, args in file_860_info.items():
eia_nercs[key] = pd.read_excel(**args)
eia_nercs[key].columns = ['plant id', 'nerc']
eia_nercs[key]['year'] = key
I want to assign NERC regions for every year. We have data for 2012 onward from the EIA-860 files. For the purpose of this analysis I'll assume that all years from 2001-2011 are the same NERC as 2012.
Also assume that values in 2017 are the same as in 2016. I'll fill in nerc values for plants that were built in 2017 below.
for year in range(2001, 2012):
# the pandas .copy() method is deep by default but I'm not sure in this case
df = deepcopy(eia_nercs[2012])
df['year'] = year
eia_nercs[year] = df
df = deepcopy(eia_nercs[2016])
df['year'] = 2017
eia_nercs[2017] = df
eia_nercs.keys()
dict_keys([2012, 2013, 2014, 2015, 2016, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017])
eia_nercs[2001].head()
plant id | nerc | year | |
---|---|---|---|
0 | 10867 | SERC | 2001 |
1 | 50903 | RFC | 2001 |
2 | 10671 | SPP | 2001 |
3 | 2527 | NPCC | 2001 |
4 | 3305 | SERC | 2001 |
nercs = pd.concat(eia_nercs.values())
nercs.sort_values('year', inplace=True)
nercs.head()
plant id | nerc | year | |
---|---|---|---|
5465 | 56512 | RFC | 2001 |
4866 | 1481 | NPCC | 2001 |
4865 | 1480 | NPCC | 2001 |
4864 | 805 | NPCC | 2001 |
4863 | 54451 | WECC | 2001 |
(set(nercs.loc[(nercs.nerc == 'MRO') &
(nercs.year == 2016), 'plant id'])
- set(nercs.loc[(nercs.nerc == 'MRO') &
(nercs.year == 2017), 'plant id']))
set()
nercs.year.unique()
array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017])
** This may not matter anymore if I use NERC labels for each year **
There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.
The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation.
for df_ in list(eia_nercs.values()) + [nercs]:
print('{} total records'.format(len(df_)))
print('{} unique plants'.format(len(df_['plant id'].unique())))
7289 total records 7289 unique plants 8060 total records 8060 unique plants 8520 total records 8520 unique plants 8928 total records 8928 unique plants 9711 total records 9711 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 7289 total records 7289 unique plants 9711 total records 9711 unique plants 132398 total records 10068 unique plants
dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()
dup_plants
array([56512, 1481, 1480, ..., 61001, 61003, 55160])
region_list = []
for plant in dup_plants:
regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()
# regions = regions.tolist()
region_list.append(regions)
Counter(tuple(x) for x in region_list)
Counter({('ASCC', nan): 2, ('ASCC',): 130, ('FRCC', 'HICC'): 1, ('FRCC',): 205, ('HICC',): 52, ('MRO', 'RFC'): 2, ('MRO', 'SERC'): 1, ('MRO', 'SPP'): 7, ('MRO', 'WECC'): 3, ('MRO',): 1051, ('NPCC',): 1257, ('RFC', 'MRO'): 3, ('RFC', 'SERC'): 2, ('RFC',): 1626, ('SERC', 'SPP'): 1, ('SERC',): 1832, ('SPP', 'SERC'): 2, ('SPP', 'TRE'): 1, ('SPP',): 470, ('TRE',): 461, ('WECC', 'ASCC'): 2, ('WECC', 'HICC'): 1, ('WECC',): 2892, (nan, 'WECC', 'ASCC'): 3, (nan, 'WECC', 'HICC'): 1, (nan,): 35})
(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]
.groupby('year')['generation (MWh)'].sum()
/ facility_df.loc[:, :]
.groupby('year')['generation (MWh)'].sum())
year 2001 0.993784 2002 0.994703 2003 0.993925 2004 0.996498 2005 0.997682 2006 0.998378 2007 0.998707 2008 0.998873 2009 0.999267 2010 0.999767 2011 0.999990 2012 1.000000 2013 1.000000 2014 1.000000 2015 1.000000 2016 1.000000 2017 0.999998 Name: generation (MWh), dtype: float64
This is my training data. All of these plants should still be in my plants
dataframe.
nan_plants = {}
all_nan = []
years = nercs.year.unique()
for year in years:
nan_plants[year] = nercs.loc[(nercs.year == year) &
(nercs.isnull().any(axis=1)), 'plant id'].tolist()
all_nan.extend(nan_plants[year])
# number of plants that don't have a nerc in at least one year
len(all_nan)
232
# drop all the rows without a nerc value
nercs.dropna(inplace=True)
nan_plants[2017]
[58651, 58656, 58659, 58662, 58639, 58640, 58549, 58684, 58277, 58380, 58425, 58405, 60563, 60588, 60260, 60250, 60243, 60244, 60245, 60328, 61166, 61172, 61364, 60814, 61099, 61101, 61068, 58989, 58982, 58977, 58837, 59035, 60125, 60024, 66, 70]
The EIA-860m (monthly) data file has an up-to-date list of all operating power plants and their associated balancing authority. It does not list the NERC region, so it can't be used to assign NERC labels for all plants. But in some cases knowing the state and balancing authority is enough to make a good guess about which NERC a plant is in.
Assigning NERC region labels has the lowest accuracy for plants in SPP and TRE. To compensate, I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.
Only do this for plants that come online since the most recent 860 annual data.
NOTE Because I'm looking at units that came online in 2017 some of the plant ids will already exist
path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')
# Check the excel file columns if there is a read error. They should match
# the plant id, plant state, operating year, and balancing authority code.
_m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,
usecols='C,F,P,AE', skiprows=0)
_m860.columns = _m860.columns.str.lower()
# most_recent_860_year is defined at the top of this notebook
# The goal here is to only look at plants that started operating after
# the most recent annual data. So only include units starting after
# the last annual data and that don't have plant ids in the nercs
# dataframe
m860 = _m860.loc[(_m860['operating year'] > most_recent_860_year)].copy() #&
# (~_m860['plant id'].isin(nercs['plant id'].unique()))].copy()
m860.tail()
plant id | plant state | operating year | balancing authority code | |
---|---|---|---|---|
21373 | 61632 | IA | 2017 | MISO |
21374 | 61633 | MA | 2017 | ISNE |
21375 | 61634 | MA | 2017 | ISNE |
21376 | 61635 | MA | 2017 | ISNE |
21377 | 61636 | MA | 2017 | ISNE |
m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &
(m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'
m860.loc[(m860['plant state'].isin(['TX'])) &
(m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'
# Drop all rows except the ones I've labeled as TRE or SPP
m860.dropna(inplace=True)
m860.head()
plant id | plant state | operating year | balancing authority code | nerc | |
---|---|---|---|---|---|
343 | 165 | OK | 2017 | SWPP | SPP |
344 | 165 | OK | 2017 | SWPP | SPP |
4836 | 2953 | OK | 2017 | SWPP | SPP |
4837 | 2953 | OK | 2017 | SWPP | SPP |
4838 | 2953 | OK | 2017 | SWPP | SPP |
Make lists of plant codes for SPP and TRE facilities
nercs.head()
plant id | nerc | year | |
---|---|---|---|
5465 | 56512 | RFC | 2001 |
4866 | 1481 | NPCC | 2001 |
4865 | 1480 | NPCC | 2001 |
4864 | 805 | NPCC | 2001 |
4863 | 54451 | WECC | 2001 |
# Create additional dataframes with 2017 SPP and TRE plants.
# Use these to fill in values for 2017 plants
m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']
.drop_duplicates()
.reset_index(drop=True))
additional_spp = pd.DataFrame(m860_spp_plants.copy())
# additional_spp['plant id'] = m860_spp_plants
additional_spp['nerc'] = 'SPP'
additional_spp['year'] = 2017
m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']
.drop_duplicates()
.reset_index(drop=True))
additional_tre = pd.DataFrame(m860_tre_plants)
# additional_tre['plant id'] = m860_tre_plants
additional_tre['nerc'] = 'TRE'
additional_tre['year'] = 2017
additional_spp
plant id | nerc | year | |
---|---|---|---|
0 | 165 | SPP | 2017 |
1 | 2953 | SPP | 2017 |
2 | 60414 | SPP | 2017 |
3 | 61221 | SPP | 2017 |
4 | 61261 | SPP | 2017 |
5 | 61614 | SPP | 2017 |
6 | 61615 | SPP | 2017 |
7 | 61616 | SPP | 2017 |
8 | 61617 | SPP | 2017 |
9 | 61618 | SPP | 2017 |
additional_tre
plant id | nerc | year | |
---|---|---|---|
0 | 56984 | TRE | 2017 |
1 | 59066 | TRE | 2017 |
2 | 59193 | TRE | 2017 |
3 | 59206 | TRE | 2017 |
4 | 59245 | TRE | 2017 |
5 | 59712 | TRE | 2017 |
6 | 59812 | TRE | 2017 |
7 | 60122 | TRE | 2017 |
8 | 60210 | TRE | 2017 |
9 | 60217 | TRE | 2017 |
10 | 60366 | TRE | 2017 |
11 | 60372 | TRE | 2017 |
12 | 60436 | TRE | 2017 |
13 | 60459 | TRE | 2017 |
14 | 60460 | TRE | 2017 |
15 | 60581 | TRE | 2017 |
16 | 60682 | TRE | 2017 |
17 | 60690 | TRE | 2017 |
18 | 60774 | TRE | 2017 |
19 | 60901 | TRE | 2017 |
20 | 60902 | TRE | 2017 |
21 | 60983 | TRE | 2017 |
22 | 60989 | TRE | 2017 |
23 | 61205 | TRE | 2017 |
24 | 61309 | TRE | 2017 |
25 | 61362 | TRE | 2017 |
26 | 61409 | TRE | 2017 |
27 | 61410 | TRE | 2017 |
28 | 61411 | TRE | 2017 |
nercs = pd.concat([nercs, additional_spp, additional_tre])
plants.head()
plant id | year | lat | lon | state | |
---|---|---|---|---|---|
0 | 1001 | 2017 | 39.9242 | -87.4244 | IN |
12 | 1001 | 2016 | 39.9242 | -87.4244 | IN |
24 | 1001 | 2015 | 39.9242 | -87.4244 | IN |
36 | 1001 | 2014 | 39.9242 | -87.4244 | IN |
48 | 1001 | 2013 | 39.9242 | -87.4244 | IN |
nercs.tail()
plant id | nerc | year | |
---|---|---|---|
24 | 61309 | TRE | 2017 |
25 | 61362 | TRE | 2017 |
26 | 61409 | TRE | 2017 |
27 | 61410 | TRE | 2017 |
28 | 61411 | TRE | 2017 |
Checked to make sure the type of merge doesn't matter once rows without nerc values are dropped
df = pd.merge(plants, nercs, on=['plant id', 'year'], how='left')
omitted = set(df['plant id'].unique()) - set(nercs['plant id'].unique())
df.head()
plant id | year | lat | lon | state | nerc | |
---|---|---|---|---|---|---|
0 | 1001 | 2016 | 39.9242 | -87.4244 | IN | RFC |
1 | 1001 | 2015 | 39.9242 | -87.4244 | IN | RFC |
2 | 1001 | 2014 | 39.9242 | -87.4244 | IN | RFC |
3 | 1001 | 2013 | 39.9242 | -87.4244 | IN | RFC |
4 | 1001 | 2012 | 39.9242 | -87.4244 | IN | RFC |
df.tail()
plant id | year | lat | lon | state | nerc | |
---|---|---|---|---|---|---|
100851 | 59957 | 2017 | 41.226900 | -88.683600 | IL | RFC |
100852 | 59257 | 2017 | 37.305833 | -121.755000 | CA | WECC |
100853 | 59256 | 2017 | 38.393333 | -121.927778 | CA | WECC |
100854 | 10475 | 2017 | 41.683600 | -87.423300 | IN | RFC |
100855 | 55370 | 2017 | 39.851389 | -79.070556 | PA | RFC |
df.columns
Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')
Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice.
cols = ['plant id', 'lat', 'lon', 'nerc', 'state', 'year']
df_slim = (df.loc[:, cols].dropna(subset=['lon'])
.drop_duplicates(subset=['plant id', 'year', 'nerc']))
df_slim.tail()
plant id | lat | lon | nerc | state | year | |
---|---|---|---|---|---|---|
100851 | 59957 | 41.226900 | -88.683600 | RFC | IL | 2017 |
100852 | 59257 | 37.305833 | -121.755000 | WECC | CA | 2017 |
100853 | 59256 | 38.393333 | -121.927778 | WECC | CA | 2017 |
100854 | 10475 | 41.683600 | -87.423300 | RFC | IN | 2017 |
100855 | 55370 | 39.851389 | -79.070556 | RFC | PA | 2017 |
Separate out the list of plants where we don't have NERC labels from EIA-860.
unknown = df_slim.loc[df_slim.nerc.isnull()].copy()
print("{} plants/years don't have NERC labels\n".format(len(unknown)))
print(unknown.head())
1579 plants/years don't have NERC labels plant id lat lon nerc state year 28205 3823 38.27 -78.035 NaN VA 2009 28206 3823 38.27 -78.035 NaN VA 2005 28207 3823 38.27 -78.035 NaN VA 2004 28208 3823 38.27 -78.035 NaN VA 2003 28209 3823 38.27 -78.035 NaN VA 2002
unknown.tail()
plant id | lat | lon | nerc | state | year | |
---|---|---|---|---|---|---|
100814 | 50712 | 37.734100 | -121.6516 | NaN | CA | 2017 |
100817 | 55162 | 29.096900 | -81.0689 | NaN | FL | 2017 |
100818 | 56171 | 34.037200 | -117.5569 | NaN | CA | 2017 |
100822 | 7939 | 37.771400 | -75.6342 | NaN | VA | 2017 |
100825 | 58916 | 34.248056 | -116.0600 | NaN | CA | 2017 |
X is lat/lon and year
y is the NERC label
For both, I'm only using plants where we have all data (no NaN
s). Not doing any transformation of the lat/lon at this time.
X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon', 'year']]
y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']
len(X)
99127
# Make sure that unknown and X include all records from df_slim
len(X) + len(unknown) - len(df_slim)
0
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
I previously used k-nearest neighbors with just lat/lon as input features. The problem is that some facilities don't have lat/lon data. They do usually have a state geography label though. Categorical labels don't work well in KNN, but the combination of lat/lon and a state label will work well in a tree model. RandomForest is usually a quite effective tree model and my results are more accurate with this than they were with KNN.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
params = dict(
n_estimators = [5, 10, 25, 50],
min_samples_split = [2, 5, 10],
min_samples_leaf = [1, 3, 5],
)
clf_rf = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf.fit(X_train, y_train)
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 32.3s [Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 1.3min finished
GridSearchCV(cv=None, error_score='raise', estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=False, n_jobs=-1, param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=1)
clf_rf.best_estimator_, clf_rf.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), 0.9797785431782634)
clf_rf.score(X_test, y_test)
0.9822695035460993
nerc_labels = nercs.nerc.dropna().unique()
Accuracy score by region
for region in nerc_labels:
mask = y_test == region
X_masked = X_test[mask]
y_hat_masked = clf_rf.predict(X_masked)
y_test_masked = y_test[mask]
accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)
print('{} : {}'.format(region, accuracy))
RFC : 0.9707198806415517 NPCC : 0.9907823209643111 WECC : 0.9987963672174198 SERC : 0.9690892364305428 SPP : 0.9328663164039697 TRE : 0.9903057419835943 MRO : 0.9853683148335015 FRCC : 0.9768707482993197 ASCC : 0.9965986394557823 HICC : 1.0
F1 score by region
y_hat = clf_rf.predict(X_test)
for region in nerc_labels:
f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')
print('{} : {}'.format(region, f1))
RFC : 0.9682820202771835 NPCC : 0.9916026020106445 WECC : 0.9985778361229624 SERC : 0.9707861026633492 SPP : 0.9461219656601539 TRE : 0.986627043090639 MRO : 0.9798068481123793 FRCC : 0.9835616438356164 ASCC : 0.9982964224872232 HICC : 1.0
Use just the state for plants that don't have lat/lon info. Less accurate, especially where NERC regions cross state lines, but better than nothing.
Need to start with the lon
column so I can filter to only unknown facilities that also don't have lon
cols = ['plant id', 'nerc', 'state', 'year', 'lon']
df_state_slim = (df.loc[:, cols].dropna(subset=['state']).copy())
df_state_slim.head()
plant id | nerc | state | year | lon | |
---|---|---|---|---|---|
0 | 1001 | RFC | IN | 2016 | -87.4244 |
1 | 1001 | RFC | IN | 2015 | -87.4244 |
2 | 1001 | RFC | IN | 2014 | -87.4244 |
3 | 1001 | RFC | IN | 2013 | -87.4244 |
4 | 1001 | RFC | IN | 2012 | -87.4244 |
len(df_state_slim)
100831
le = LabelEncoder()
df_state_slim.loc[:, 'enc state'] = le.fit_transform(df_state_slim.loc[:, 'state'].tolist())
len(df_state_slim)
100831
unknown_state = df_state_slim.loc[(df_state_slim.nerc.isnull()) &
(df_state_slim.lon.isnull())].copy()
len(unknown_state), len(unknown)
(116, 1579)
X_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), ['enc state', 'year']].copy()
y_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), 'nerc'].copy()
X_state_train, X_state_test, y_state_train, y_state_test = train_test_split(
X_state, y_state, test_size=0.33, random_state=42)
rf = RandomForestClassifier()
params = dict(
n_estimators = [5, 10, 25, 50],
min_samples_split = [2, 5, 10],
min_samples_leaf = [1, 3, 5],
)
clf_rf_state = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf_state.fit(X_state_train, y_state_train)
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 17.1s [Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 45.4s finished
GridSearchCV(cv=None, error_score='raise', estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=False, n_jobs=-1, param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=1)
clf_rf_state.best_estimator_, clf_rf_state.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), 0.9402882327958322)
clf_rf_state.score(X_state_test, y_state_test)
0.9387418230726906
Accuracy score by region
nerc_labels = nercs.nerc.dropna().unique()
for region in nerc_labels:
mask = y_state_test == region
X_state_masked = X_state_test[mask]
y_state_hat_masked = clf_rf_state.predict(X_state_masked)
y_state_test_masked = y_state_test[mask]
accuracy = metrics.accuracy_score(y_state_test_masked, y_state_hat_masked)
print('{} : {}'.format(region, accuracy))
RFC : 0.9243697478991597 NPCC : 0.9938519744620478 WECC : 0.9943912900032993 SERC : 0.8831431726168568 SPP : 0.5765453495089543 TRE : 1.0 MRO : 0.9614311088556204 FRCC : 0.9973992197659298 ASCC : 1.0 HICC : 1.0
F1 score by region
y_state_hat = clf_rf_state.predict(X_state_test)
for region in nerc_labels:
f1 = metrics.f1_score(y_state_test, y_state_hat, labels=[region], average='macro')
print('{} : {}'.format(region, f1))
RFC : 0.9088405397961994 NPCC : 0.9958535718516763 WECC : 0.9933534743202416 SERC : 0.8819259395387301 SPP : 0.7300658376005852 TRE : 0.8763183125599233 MRO : 0.9512929952297263 FRCC : 0.962962962962963 ASCC : 0.9991474850809888 HICC : 0.9975669099756691
unknown.loc[:, 'nerc'] = clf_rf.predict(unknown.loc[:, ['lat', 'lon', 'year']])
unknown_state.loc[:, 'nerc'] = clf_rf_state.predict(unknown_state.loc[:, ['enc state', 'year']])
Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around.
print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())
print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())
['ASCC' 'HICC' 'WECC'] ['AK' 'HI']
Counter(unknown['nerc'])
Counter({'ASCC': 44, 'FRCC': 6, 'HICC': 39, 'MRO': 67, 'NPCC': 264, 'RFC': 202, 'SERC': 349, 'SPP': 99, 'TRE': 145, 'WECC': 364})
unknown.head()
plant id | lat | lon | nerc | state | year | |
---|---|---|---|---|---|---|
28205 | 3823 | 38.27 | -78.035 | SERC | VA | 2009 |
28206 | 3823 | 38.27 | -78.035 | SERC | VA | 2005 |
28207 | 3823 | 38.27 | -78.035 | SERC | VA | 2004 |
28208 | 3823 | 38.27 | -78.035 | SERC | VA | 2003 |
28209 | 3823 | 38.27 | -78.035 | SERC | VA | 2002 |
unknown_state.head()
plant id | nerc | state | year | lon | enc state | |
---|---|---|---|---|---|---|
84867 | 10851 | RFC | NJ | 2006 | NaN | 31 |
84868 | 10851 | RFC | NJ | 2005 | NaN | 31 |
84869 | 10851 | RFC | NJ | 2004 | NaN | 31 |
84870 | 10851 | RFC | NJ | 2002 | NaN | 31 |
84871 | 10851 | RFC | NJ | 2001 | NaN | 31 |
nercs.tail()
plant id | nerc | year | |
---|---|---|---|
24 | 61309 | TRE | 2017 |
25 | 61362 | TRE | 2017 |
26 | 61409 | TRE | 2017 |
27 | 61410 | TRE | 2017 |
28 | 61411 | TRE | 2017 |
unknown.head()
plant id | lat | lon | nerc | state | year | |
---|---|---|---|---|---|---|
28205 | 3823 | 38.27 | -78.035 | SERC | VA | 2009 |
28206 | 3823 | 38.27 | -78.035 | SERC | VA | 2005 |
28207 | 3823 | 38.27 | -78.035 | SERC | VA | 2004 |
28208 | 3823 | 38.27 | -78.035 | SERC | VA | 2003 |
28209 | 3823 | 38.27 | -78.035 | SERC | VA | 2002 |
unknown_state.tail()
plant id | nerc | state | year | lon | enc state | |
---|---|---|---|---|---|---|
92615 | 10257 | WECC | CA | 2004 | NaN | 4 |
92616 | 10257 | WECC | CA | 2003 | NaN | 4 |
92617 | 10257 | WECC | CA | 2002 | NaN | 4 |
92618 | 10257 | WECC | CA | 2001 | NaN | 4 |
92782 | 56508 | WECC | CA | 2007 | NaN | 4 |
len(unknown_state['plant id'].unique())
31
df_slim.head()
plant id | lat | lon | nerc | state | year | |
---|---|---|---|---|---|---|
0 | 1001 | 39.9242 | -87.4244 | RFC | IN | 2016 |
1 | 1001 | 39.9242 | -87.4244 | RFC | IN | 2015 |
2 | 1001 | 39.9242 | -87.4244 | RFC | IN | 2014 |
3 | 1001 | 39.9242 | -87.4244 | RFC | IN | 2013 |
4 | 1001 | 39.9242 | -87.4244 | RFC | IN | 2012 |
labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)],
unknown,
unknown_state.loc[:, ['plant id', 'nerc', 'state', 'year']]])
labeled.tail()
lat | lon | nerc | plant id | state | year | |
---|---|---|---|---|---|---|
92615 | NaN | NaN | WECC | 10257 | CA | 2004 |
92616 | NaN | NaN | WECC | 10257 | CA | 2003 |
92617 | NaN | NaN | WECC | 10257 | CA | 2002 |
92618 | NaN | NaN | WECC | 10257 | CA | 2001 |
92782 | NaN | NaN | WECC | 56508 | CA | 2007 |
labeled.loc[labeled.nerc.isnull()]
lat | lon | nerc | plant id | state | year |
---|
There are 7 facilities that don't show up in my labeled data.
facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']), 'plant id'].unique()
array([57116, 57794, 58690, 58236, 58098, 57913, 57400, 57628, 61084, 60539, 60540, 60991, 61079, 60383, 61172, 61020, 61221, 61021, 61022, 60682, 60688, 61407, 60689, 61330, 61357, 60414, 60366, 60983, 60989, 60372, 60658, 60581, 60583, 60901, 60902, 60905, 60883, 60885, 60856, 60552, 60467, 60145, 60152, 59308, 59220, 59309, 59315, 60237, 60306, 60303, 60340, 59665, 59666, 59684, 59827, 59061, 60033, 59066, 60210, 60261, 61039, 61040, 61048, 60258, 61050, 60346, 59689, 59690, 59691, 59764, 61261, 61268, 61303, 59812, 59875, 60043, 59888, 61561, 60122, 59245, 59193, 59004, 60655, 60987, 61512, 59206, 60436, 60217, 59712, 59940, 60785, 61222, 61422, 55314, 55952, 7704, 6339, 55975, 55073, 50728, 60569, 60570, 61197, 60690, 60506])
len(labeled), len(nercs)
(100822, 132244)
nerc_labels
array(['RFC', 'NPCC', 'WECC', 'SERC', 'SPP', 'TRE', 'MRO', 'FRCC', 'ASCC', 'HICC'], dtype=object)
mro_2016 = set(labeled.loc[(labeled.nerc == 'MRO') &
(labeled.year == 2016), 'plant id'])
mro_2017 = set(labeled.loc[(labeled.nerc == 'MRO') &
(labeled.year == 2017), 'plant id'])
(set(nercs.loc[(nercs.nerc=='MRO') &
(nercs.year==2017),'plant id'])
- mro_2017)
{1052, 1058, 1175, 1189, 1218, 1771, 1889, 1914, 1918, 1932, 1960, 1995, 2008, 2217, 2791, 2821, 2822, 3334, 3343, 3347, 3348, 3996, 4048, 4054, 4062, 4121, 4140, 7376, 7377, 7602, 7706, 7882, 7956, 8014, 8025, 8057, 10476, 54713, 54930, 55315, 55638, 55764, 55834, 56072, 56183, 56366, 57048, 57116, 57255, 57659, 58236, 58434, 58903, 59053, 59197, 59223, 59224, 59225, 59226, 59227, 59228, 59230, 59231, 59307, 59684, 59875, 59902, 59903, 60066, 60203, 60254, 60503, 60504, 60505, 60519, 60520, 60521, 60522, 60523, 60524, 60525, 60526, 60527, 60528, 60529, 60530, 60531, 60532, 60533, 60534, 60564, 60595, 60631, 60632, 60647, 60674, 60694, 60695, 60711, 60712, 60713, 60714, 60715, 60716, 60717, 60795, 60823, 60830, 60832, 60833, 60834, 60835, 60836, 60837, 60838, 60873, 60887, 60888, 60889, 60890, 60891, 60892, 60893, 60894, 60895, 60905, 60934, 60935, 60936, 60937, 60938, 60939, 60940, 60941, 60942, 60943, 60944, 60951, 60955, 60957, 60958, 60962, 60966, 60971, 60977, 61046, 61047, 61056, 61057, 61058, 61059, 61060, 61070, 61071, 61072, 61077, 61078, 61079, 61138, 61139, 61140, 61141, 61142, 61144, 61174, 61175, 61176, 61177, 61178, 61179, 61180, 61181, 61182, 61183, 61203, 61328, 61329, 61357, 61363, 61379, 61380, 61381, 61382, 61383, 61384, 61426, 61427, 61428}
for nerc in nerc_labels:
l = len((set(labeled.loc[(labeled.nerc == nerc) &
(labeled.year == 2016), 'plant id'])
- set(labeled.loc[(labeled.nerc == nerc) &
(labeled.year == 2017), 'plant id'])))
print('{} plants dropped in {}'.format(l, nerc))
0 plants dropped in RFC 0 plants dropped in NPCC 0 plants dropped in WECC 0 plants dropped in SERC 0 plants dropped in SPP 0 plants dropped in TRE 0 plants dropped in MRO 0 plants dropped in FRCC 0 plants dropped in ASCC 0 plants dropped in HICC
(set(labeled.loc[(labeled.nerc == 'MRO') &
(labeled.year == 2016), 'plant id'])
- set(labeled.loc[(labeled.nerc == 'MRO') &
(labeled.year == 2017), 'plant id']))
set()
path = join(data_path, 'Facility labels', 'Facility locations_RF.csv')
labeled.to_csv(path, index=False)