This notebook provides an example of training an SDG text classifier using scikit-learn
and OSDG Community Dataset.
# standard library
from typing import List
# data wrangling
import numpy as np
import pandas as pd
# visualisation
import plotly.express as px
import plotly.io as pio
# nlp
import spacy
# data modelling
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, top_k_accuracy_score, f1_score
# utils
from tqdm import tqdm
# local packages
from helpers import plot_confusion_matrix, get_top_features, fix_sdg_name
print('Loaded!')
Loaded!
# other settings
pio.templates.default = 'plotly_white'
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
print('Disabled spaCy components:', nlp.disabled)
print('SpaCy version:', spacy.__version__)
Disabled spaCy components: ['senter', 'ner'] SpaCy version: 3.1.3
In this section, we will explore the data and select texts for training.
df_osdg = pd.read_csv('https://zenodo.org/record/5550238/files/osdg-community-dataset-v21-09-30.csv?download=1', sep='\t')
print('Shape:', df_osdg.shape)
display(df_osdg.head())
Shape: (32121, 7)
doi | text_id | text | sdg | labels_negative | labels_positive | agreement | |
---|---|---|---|---|---|---|---|
0 | 10.6027/9789289342698-7-en | 00021941702cd84171ff33962197ca1f | From a gender perspective, Paulgaard points ou... | 5 | 1 | 7 | 0.750000 |
1 | 10.18356/eca72908-en | 00028349a7f9b2485ff344ae44ccfd6b | Labour legislation regulates maximum working h... | 11 | 2 | 1 | 0.333333 |
2 | 10.1787/9789264289062-4-en | 0004eb64f96e1620cd852603d9cbe4d4 | The average figure also masks large difference... | 3 | 1 | 6 | 0.714286 |
3 | 10.1787/5k9b7bn5qzvd-en | 0006a887475ccfa5a7f5f51d4ac83d02 | The extent to which they are akin to corruptio... | 3 | 1 | 2 | 0.333333 |
4 | 10.1787/9789264258211-6-en | 0006d6e7593776abbdf4a6f985ea6d95 | A region reporting a higher rate will not earn... | 3 | 2 | 2 | 0.000000 |
The agreement is calculated using the following formula:
$$ agreement = \frac{|labels_{pos} - labels_{neg}|}{labels_{pos} + labels_{neg}} $$The resulting values range between 0 (equally split votes) and 1 (all votes are either positive or negative).
# calculating cumulative probability over agreement scores
df_lambda = df_osdg['agreement'].value_counts(normalize = True).sort_index().cumsum().to_frame(name = 'p_sum')
df_lambda.reset_index(inplace = True)
df_lambda.rename({'index': 'agreement'}, axis = 1, inplace = True)
print('Shape:', df_lambda.shape)
display(df_lambda.head())
Shape: (36, 2)
agreement | p_sum | |
---|---|---|
0 | 0.000000 | 0.033716 |
1 | 0.062432 | 0.033747 |
2 | 0.067982 | 0.033779 |
3 | 0.090909 | 0.033872 |
4 | 0.111111 | 0.050185 |
fig = px.line(
data_frame = df_lambda,
x = 'agreement',
y = 'p_sum',
markers = True,
labels = {
'agreement': 'Agreement Score',
'p_sum': 'Cumulative Probrability'
},
color_discrete_sequence = ['#1f77b4'],
title = 'Figure 1. Cumulative Distribution Function of the Agreement Score'
)
fig.update_traces(hovertemplate = 'Agreement score: %{x:.2f}<br>Cumulative probability: %{y:.2f}')
fig.update_layout(
xaxis = {'dtick': 0.1},
yaxis = {'dtick': 0.25}
)
fig.show()
# keeping only the texts whose suggested sdg labels is accepted and the agreement score is at least .6
print('Shape before:', df_osdg.shape)
df_osdg = df_osdg.query('agreement >= .6 and labels_positive > labels_negative').copy()
print('Shape after :', df_osdg.shape)
display(df_osdg.head())
Shape before: (32121, 7) Shape after : (17233, 7)
doi | text_id | text | sdg | labels_negative | labels_positive | agreement | |
---|---|---|---|---|---|---|---|
0 | 10.6027/9789289342698-7-en | 00021941702cd84171ff33962197ca1f | From a gender perspective, Paulgaard points ou... | 5 | 1 | 7 | 0.750000 |
2 | 10.1787/9789264289062-4-en | 0004eb64f96e1620cd852603d9cbe4d4 | The average figure also masks large difference... | 3 | 1 | 6 | 0.714286 |
7 | 10.1787/9789264117563-8-en | 000bfb17e9f3a00d4515ab59c5c487e7 | The Israel Oceanographic and Limnological Rese... | 6 | 0 | 3 | 1.000000 |
8 | 10.18356/805b1ae4-en | 001180f5dd9a821e651ed51e30d0cf8c | Previous chapters have discussed ways to make ... | 2 | 0 | 3 | 1.000000 |
11 | 10.1787/9789264310278-en | 001f1aee4013cb098da17a979c38bc57 | Prescription rates appear to be higher where l... | 8 | 0 | 3 | 1.000000 |
df_lambda = df_osdg.groupby('sdg', as_index = False).agg(count = ('text_id', 'count'))
df_lambda['share'] = df_lambda['count'].divide(df_lambda['count'].sum()).multiply(100)
print('Shape:', df_lambda.shape)
display(df_lambda.head())
Shape: (15, 3)
sdg | count | share | |
---|---|---|---|
0 | 1 | 1146 | 6.650032 |
1 | 2 | 827 | 4.798932 |
2 | 3 | 1854 | 10.758429 |
3 | 4 | 2324 | 13.485754 |
4 | 5 | 2286 | 13.265247 |
fig = px.bar(
data_frame = df_lambda,
x = 'sdg',
y = 'count',
custom_data = ['share'],
labels = {
'sdg': 'SDG',
'count': 'Count'
},
color_discrete_sequence = ['#1f77b4'],
title = 'Figure 2. Distribution of Texts (Agreement >.6) over SDGs'
)
fig.update_traces(hovertemplate = 'SDG %{x}<br>Count: %{y}<br>Share: %{customdata:.2f}%')
fig.update_layout(xaxis = {'type': 'category'})
fig.show()
Using the preselected data, we can easily train a performant binary classifier to distinguish texts relevant to a specific SDG. For a binary classification problem, we select all instances of a certain SDG and randomly undersample the remaining instances to balance things out.
Highlights:
1000
features are retained which reduces overfitting..9
.sdg = 5 # selecting an sdg of interest
# undersampling the rest of sdgs to balance out the instances
mask = df_osdg['sdg'].eq(sdg).values
df_train = df_osdg.groupby(mask).sample(mask.sum(), random_state = 42)
print('Shape:', df_train.shape)
display(df_train.head())
Shape: (4572, 7)
doi | text_id | text | sdg | labels_negative | labels_positive | agreement | |
---|---|---|---|---|---|---|---|
22034 | 10.1787/9789264096356-en | af2366805b3dcdd487b0e2d11575f372 | That said nuclear plants just as coal-fired po... | 7 | 1 | 6 | 0.714286 |
30743 | 10.6027/f960d1a8-en | f50faa4a6d449a8b1573bb6a842c9533 | The OECD average has also declined over this p... | 4 | 0 | 3 | 1.000000 |
27230 | 10.1787/9789264279551-4-en | d88c937f2f1aa2800659ece81c68fd86 | It then applies this approach at a global leve... | 6 | 0 | 5 | 1.000000 |
7419 | 10.18356/6af97a78-en | 3c68358cda382c326c8b19ae13ca80de | Conflict between Pakistan and India over distr... | 6 | 0 | 3 | 1.000000 |
5350 | 10.1787/agr/pol-2011-7-en | 2c0bdec7b8283396a8b9b44a868e1e40 | Both rural incomes and food supplies would imp... | 2 | 0 | 3 | 1.000000 |
# now the classes are
df_train['target'] = df_train['sdg'].eq(sdg)
df_train['target'].value_counts()
False 2286 True 2286 Name: target, dtype: int64
X_train, X_test, y_train, y_test = train_test_split(
df_train['text'].values,
df_train['target'].values,
test_size = .3,
random_state = 42
)
print('Shape train:', X_train.shape)
print('Shape test:', X_test.shape)
Shape train: (3200,) Shape test: (1372,)
pipe = Pipeline([
('vectoriser', CountVectorizer(
ngram_range = (1, 2),
stop_words = 'english',
max_features = 100_000,
binary = True
)),
('selector', SelectKBest(chi2, k = 1_000)),
('clf', LogisticRegression(penalty = 'l2', C = .3, random_state = 42))
])
pipe.fit(X_train, y_train)
Pipeline(steps=[('vectoriser', CountVectorizer(binary=True, max_features=100000, ngram_range=(1, 2), stop_words='english')), ('selector', SelectKBest(k=1000, score_func=<function chi2 at 0x7f85a1a2d0d0>)), ('clf', LogisticRegression(C=0.3, random_state=42))])
y_hat = pipe.predict(X_test)
plot_confusion_matrix(y_test, y_hat, (12, 6))
print(classification_report(y_test, y_hat))
precision recall f1-score support False 0.98 0.97 0.98 711 True 0.97 0.98 0.98 661 accuracy 0.98 1372 macro avg 0.98 0.98 0.98 1372 weighted avg 0.98 0.98 0.98 1372
# top predictors of the positive class
get_top_features(pipe['vectoriser'], pipe['clf'], pipe['selector'], how = 'long')
sdg | feature | coef | |
---|---|---|---|
0 | True | women | 3.311545 |
1 | True | gender | 3.008412 |
2 | True | girls | 1.857899 |
3 | True | female | 1.527694 |
4 | True | marriage | 1.135949 |
5 | True | femmes | 0.955650 |
6 | True | woman | 0.939377 |
7 | True | mothers | 0.910892 |
8 | True | sexual | 0.884498 |
9 | True | violence | 0.878740 |
10 | True | male | 0.757047 |
11 | True | rights | 0.729981 |
12 | True | family | 0.719205 |
13 | True | laws | 0.679507 |
14 | True | girl | 0.675500 |
15 | True | equality | 0.656043 |
16 | True | husband | 0.628937 |
17 | True | men | 0.613630 |
18 | True | sexes | 0.598106 |
19 | True | leave | 0.588988 |
20 | True | females | 0.563598 |
21 | True | fathers | 0.562589 |
22 | True | gender equality | 0.561477 |
23 | True | caring | 0.560895 |
24 | True | women entrepreneurs | 0.553723 |
# selecting a text that the model has not seen
sample = df_osdg.query('text_id not in @df_train.text_id').sample().to_dict(orient = 'records')[0]
print(
'- SDG: {}\n- Labels positive:{}\n- Labels Negative:{}\n\n{}'\
.format(sample['sdg'], sample['labels_positive'], sample['labels_negative'], sample['text'])
)
- SDG: 14 - Labels positive:4 - Labels Negative:0 In the Pacific bluefin fishery, the fishing area has tended to shift northward year by year. They concluded that water temperature in recent years seemed to be higher than normal, but that these changes were not necessarily related to global warming because other fluctuations over periods lasting from ten years to several decades were dominating (Yamada et al., More recently, Seo (2010) reported that the fast growth of Hokkaido chum salmon at the age of one year, which was related to global warming, would positively affect the survival rate and in turn would affect the population density-dependent growth and maturing at age two to four due to the limited carrying capacity of the Bering Sea. These are statistical downscaling, dynamic downscaling on regional scales, and dynamic global models.
pipe.predict([sample['text']]).item()
False
pipe.predict_proba([sample['text']])
array([[0.98602019, 0.01397981]])
In this section, we extend the model to a multiclass case by training a multinomial logistic regression capable of classifying text input in one of 15 different SDGs.
Highlights:
spaCy
.TfIdf
matrix.5000
features are retained..86
..94
and .97
accordingly.def preprocess_spacy(alpha: List[str]) -> List[str]:
"""
Preprocess text input using spaCy.
Parameters
----------
alpha: List[str]
a text corpus.
Returns
-------
doc: List[str]
a cleaned version of the original text corpus.
"""
docs = list()
for doc in tqdm(nlp.pipe(alpha, batch_size = 128)):
tokens = list()
for token in doc:
if token.pos_ in ['NOUN', 'VERB', 'ADJ']:
tokens.append(token.lemma_)
docs.append(' '.join(tokens))
return docs
df_osdg['docs'] = preprocess_spacy(df_osdg['text'].values)
print('Shape:', df_osdg.shape)
display(df_osdg.head())
17233it [01:47, 160.54it/s]
Shape: (17233, 8)
doi | text_id | text | sdg | labels_negative | labels_positive | agreement | docs | |
---|---|---|---|---|---|---|---|---|
0 | 10.6027/9789289342698-7-en | 00021941702cd84171ff33962197ca1f | From a gender perspective, Paulgaard points ou... | 5 | 1 | 7 | 0.750000 | gender perspective point labour market fishing... |
2 | 10.1787/9789264289062-4-en | 0004eb64f96e1620cd852603d9cbe4d4 | The average figure also masks large difference... | 3 | 1 | 6 | 0.714286 | average figure mask large difference region nu... |
7 | 10.1787/9789264117563-8-en | 000bfb17e9f3a00d4515ab59c5c487e7 | The Israel Oceanographic and Limnological Rese... | 6 | 0 | 3 | 1.000000 | station monitor quantity quality water coastli... |
8 | 10.18356/805b1ae4-en | 001180f5dd9a821e651ed51e30d0cf8c | Previous chapters have discussed ways to make ... | 2 | 0 | 3 | 1.000000 | previous chapter discuss way make food system ... |
11 | 10.1787/9789264310278-en | 001f1aee4013cb098da17a979c38bc57 | Prescription rates appear to be higher where l... | 8 | 0 | 3 | 1.000000 | prescription rate appear be high labour force ... |
X_train, X_test, y_train, y_test = train_test_split(
df_osdg['docs'].values,
df_osdg['sdg'].values,
test_size = .3,
random_state = 42
)
print('Shape train:', X_train.shape)
print('Shape test:', X_test.shape)
Shape train: (12063,) Shape test: (5170,)
pipe = Pipeline([
('vectoriser', TfidfVectorizer(
ngram_range = (1, 2),
max_df = 0.75,
min_df = 2,
max_features = 100_000
)),
('selector', SelectKBest(f_classif, k = 5_000)),
('clf', LogisticRegression(
penalty = 'l2',
C = .9,
multi_class = 'multinomial',
class_weight = 'balanced',
random_state = 42,
solver = 'newton-cg',
max_iter = 100
))
])
pipe.fit(X_train, y_train)
Pipeline(steps=[('vectoriser', TfidfVectorizer(max_df=0.75, max_features=100000, min_df=2, ngram_range=(1, 2))), ('selector', SelectKBest(k=5000)), ('clf', LogisticRegression(C=0.9, class_weight='balanced', multi_class='multinomial', random_state=42, solver='newton-cg'))])
y_hat = pipe.predict(X_test)
plot_confusion_matrix(y_test, y_hat)
print(classification_report(y_test, y_hat, zero_division = 0))
precision recall f1-score support 1 0.86 0.75 0.80 337 2 0.82 0.86 0.84 256 3 0.93 0.89 0.91 572 4 0.94 0.91 0.92 701 5 0.93 0.91 0.92 697 6 0.92 0.89 0.91 388 7 0.93 0.87 0.90 476 8 0.62 0.69 0.65 245 9 0.63 0.79 0.70 211 10 0.48 0.63 0.55 126 11 0.85 0.83 0.84 379 12 0.59 0.78 0.67 77 13 0.84 0.85 0.85 310 14 0.96 0.93 0.95 220 15 0.86 0.83 0.85 175 accuracy 0.86 5170 macro avg 0.81 0.83 0.82 5170 weighted avg 0.86 0.86 0.86 5170
for k in [2, 3]:
k_acc = top_k_accuracy_score(y_test, pipe.predict_proba(X_test), k = k, normalize = True)
print(f'Top {k} accuracy: {k_acc:.2f}')
Top 2 accuracy: 0.94 Top 3 accuracy: 0.97
df_lambda = get_top_features(pipe['vectoriser'], pipe['clf'], pipe['selector'], top_n = 15)
print('Shape:', df_lambda.shape)
display(df_lambda.head())
Shape: (225, 3)
sdg | feature | coef | |
---|---|---|---|
0 | 1 | poverty | 12.287556 |
1 | 1 | poor | 7.407398 |
2 | 1 | child | 5.055925 |
3 | 1 | income | 4.612163 |
4 | 1 | deprivation | 4.513972 |
df_lambda.sort_values(['sdg', 'coef'], ignore_index = True, inplace = True)
colors = px.colors.qualitative.Dark24[:15]
template = 'SDG: %{customdata}<br>Feature: %{y}<br>Coefficient: %{x:.2f}'
fig = px.bar(
data_frame = df_lambda,
x = 'coef',
y = 'feature',
custom_data = ['sdg'],
facet_col = 'sdg',
facet_col_wrap = 3,
facet_col_spacing = .15,
height = 1200,
labels = {
'coef': 'Coefficient',
'feature': ''
},
title = 'Figure 3. Top 15 Strongest Predictors by SDG'
)
fig.for_each_trace(lambda x: x.update(hovertemplate = template))
fig.for_each_trace(lambda x: x.update(marker_color = colors.pop(0)))
fig.for_each_annotation(lambda x: x.update(text = fix_sdg_name(x.text.split("=")[-1])))
fig.update_yaxes(matches = None, showticklabels = True)
fig.show()
sample = df_osdg.query('text_id not in @df_train.text_id').sample().to_dict(orient = 'records')[0]
print(
'- SDG: {}\n- Labels positive:{}\n- Labels Negative:{}\n\n{}'\
.format(sample['sdg'], sample['labels_positive'], sample['labels_negative'], sample['text'])
)
- SDG: 13 - Labels positive:9 - Labels Negative:0 The project was implemented over the course of 15 years using a community forestry approach both to generate income and to stabilize slopes that had become exposed as a result of environmental degradation and were consequently at risk of landslides. The assessment of the project was conducted in close consultation with communities and the results encompassed a greater diversification of livelihoods and improved watersheds, together with a decrease in the risks from landslides. This highlights the importance of management of ecosystems and livelihoods as the basis for an integrated strategy for climate change adaptation and development (Renaud, Sudmeier-Rieux and Estrella, eds.,
y_proba = pipe.predict_proba([sample['text']]).flatten()
y_hat = pipe.predict([sample['text']]).item()
y_hat
13
fig = px.bar(
x = range(1, 16),
y = y_proba,
color = y_proba == y_proba.max(),
color_discrete_map = {False: 'grey', True: '#1f77b4'},
labels = {
'x': 'SDG',
'y': 'Probability'
},
title = 'Figure 4. Predicted Probabilities of SDG Class for a Sample Text\n'
)
fig.update_traces(hovertemplate = 'SDG: %{x}<br>Predicted probability: %{y:.2f}')
fig.update_layout(
xaxis = {
'type': 'category',
'categoryorder': 'array',
'categoryarray': list(range(1, 16))
},
yaxis = {'range': [0, 1]},
showlegend = False
)
fig.show()