OSDG-CD Example: Scikit-learn Classifier¶

This notebook provides an example of training an SDG text classifier using scikit-learn and OSDG Community Dataset.

Table of Contents¶

I. Data Preparation
II. Simple Binary SDG Classifier
III. Multi-class SDG Classifier

17 SDGs

Libraries¶

In [1]:

# standard library
from typing import List

# data wrangling
import numpy as np
import pandas as pd

# visualisation
import plotly.express as px
import plotly.io as pio

# nlp
import spacy

# data modelling
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, top_k_accuracy_score, f1_score

# utils
from tqdm import tqdm

# local packages
from helpers import plot_confusion_matrix, get_top_features, fix_sdg_name

print('Loaded!')

Loaded!

In [2]:

# other settings
pio.templates.default = 'plotly_white'

spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
print('Disabled spaCy components:', nlp.disabled)
print('SpaCy version:', spacy.__version__)

Disabled spaCy components: ['senter', 'ner']
SpaCy version: 3.1.3

I. Data Preparation¶

In this section, we will explore the data and select texts for training.

In [3]:

df_osdg = pd.read_csv('https://zenodo.org/record/5550238/files/osdg-community-dataset-v21-09-30.csv?download=1', sep='\t')
print('Shape:', df_osdg.shape)
display(df_osdg.head())

Shape: (32121, 7)

	doi	text_id	text	sdg	labels_negative	labels_positive	agreement
0	10.6027/9789289342698-7-en	00021941702cd84171ff33962197ca1f	From a gender perspective, Paulgaard points ou...	5	1	7	0.750000
1	10.18356/eca72908-en	00028349a7f9b2485ff344ae44ccfd6b	Labour legislation regulates maximum working h...	11	2	1	0.333333
2	10.1787/9789264289062-4-en	0004eb64f96e1620cd852603d9cbe4d4	The average figure also masks large difference...	3	1	6	0.714286
3	10.1787/5k9b7bn5qzvd-en	0006a887475ccfa5a7f5f51d4ac83d02	The extent to which they are akin to corruptio...	3	1	2	0.333333
4	10.1787/9789264258211-6-en	0006d6e7593776abbdf4a6f985ea6d95	A region reporting a higher rate will not earn...	3	2	2	0.000000

The agreement is calculated using the following formula:

$$ agreement = \frac{|labels_{pos} - labels_{neg}|}{labels_{pos} + labels_{neg}} $$

The resulting values range between 0 (equally split votes) and 1 (all votes are either positive or negative).

In [4]:

# calculating cumulative probability over agreement scores
df_lambda = df_osdg['agreement'].value_counts(normalize = True).sort_index().cumsum().to_frame(name = 'p_sum')
df_lambda.reset_index(inplace = True)
df_lambda.rename({'index': 'agreement'}, axis = 1, inplace = True)

print('Shape:', df_lambda.shape)
display(df_lambda.head())

Shape: (36, 2)

	agreement	p_sum
0	0.000000	0.033716
1	0.062432	0.033747
2	0.067982	0.033779
3	0.090909	0.033872
4	0.111111	0.050185

In [5]:

fig = px.line(
    data_frame = df_lambda,
    x = 'agreement',
    y = 'p_sum',
    markers = True,
    labels = {
        'agreement': 'Agreement Score',
        'p_sum': 'Cumulative Probrability'
    },
    color_discrete_sequence = ['#1f77b4'],
    title = 'Figure 1. Cumulative Distribution Function of the Agreement Score'
)

fig.update_traces(hovertemplate = 'Agreement score: %{x:.2f}<br>Cumulative probability: %{y:.2f}')
fig.update_layout(
    xaxis = {'dtick': 0.1},
    yaxis = {'dtick': 0.25}
)
fig.show()

In [6]:

# keeping only the texts whose suggested sdg labels is accepted and the agreement score is at least .6
print('Shape before:', df_osdg.shape)
df_osdg = df_osdg.query('agreement >= .6 and labels_positive > labels_negative').copy()
print('Shape after :', df_osdg.shape)
display(df_osdg.head())

Shape before: (32121, 7)
Shape after : (17233, 7)

	doi	text_id	text	sdg	labels_negative	labels_positive	agreement
0	10.6027/9789289342698-7-en	00021941702cd84171ff33962197ca1f	From a gender perspective, Paulgaard points ou...	5	1	7	0.750000
2	10.1787/9789264289062-4-en	0004eb64f96e1620cd852603d9cbe4d4	The average figure also masks large difference...	3	1	6	0.714286
7	10.1787/9789264117563-8-en	000bfb17e9f3a00d4515ab59c5c487e7	The Israel Oceanographic and Limnological Rese...	6	0	3	1.000000
8	10.18356/805b1ae4-en	001180f5dd9a821e651ed51e30d0cf8c	Previous chapters have discussed ways to make ...	2	0	3	1.000000
11	10.1787/9789264310278-en	001f1aee4013cb098da17a979c38bc57	Prescription rates appear to be higher where l...	8	0	3	1.000000

In [7]:

df_lambda = df_osdg.groupby('sdg', as_index = False).agg(count = ('text_id', 'count'))
df_lambda['share'] = df_lambda['count'].divide(df_lambda['count'].sum()).multiply(100)
print('Shape:', df_lambda.shape)
display(df_lambda.head())

Shape: (15, 3)

	sdg	count	share
0	1	1146	6.650032
1	2	827	4.798932
2	3	1854	10.758429
3	4	2324	13.485754
4	5	2286	13.265247

In [8]:

fig = px.bar(
    data_frame = df_lambda,
    x = 'sdg',
    y = 'count',
    custom_data = ['share'],
    labels = {
        'sdg': 'SDG',
        'count': 'Count'
    },
    color_discrete_sequence = ['#1f77b4'],
    title = 'Figure 2. Distribution of Texts (Agreement >.6) over SDGs'
)

fig.update_traces(hovertemplate = 'SDG %{x}<br>Count: %{y}<br>Share: %{customdata:.2f}%')
fig.update_layout(xaxis = {'type': 'category'})
fig.show()

II. Simple Binary SDG Classifier¶

Using the preselected data, we can easily train a performant binary classifier to distinguish texts relevant to a specific SDG. For a binary classification problem, we select all instances of a certain SDG and randomly undersample the remaining instances to balance things out.

Highlights:

Minimum text preprocessing.
Instances are represented as a binary vector on unigrams and bigrams.
Based on univariate chi-square tests, only top 1000 features are retained which reduces overfitting.
A binary logistic regression classifier is trained to detect texts related to SDG 5: Gender Equality.
Test accuracy and (weighted) F1 scores each exceed .9.

In [9]:

sdg = 5 # selecting an sdg of interest

# undersampling the rest of sdgs to balance out the instances
mask = df_osdg['sdg'].eq(sdg).values
df_train = df_osdg.groupby(mask).sample(mask.sum(), random_state = 42)
print('Shape:', df_train.shape)
display(df_train.head())

Shape: (4572, 7)

	doi	text_id	text	sdg	labels_negative	labels_positive	agreement
22034	10.1787/9789264096356-en	af2366805b3dcdd487b0e2d11575f372	That said nuclear plants just as coal-fired po...	7	1	6	0.714286
30743	10.6027/f960d1a8-en	f50faa4a6d449a8b1573bb6a842c9533	The OECD average has also declined over this p...	4	0	3	1.000000
27230	10.1787/9789264279551-4-en	d88c937f2f1aa2800659ece81c68fd86	It then applies this approach at a global leve...	6	0	5	1.000000
7419	10.18356/6af97a78-en	3c68358cda382c326c8b19ae13ca80de	Conflict between Pakistan and India over distr...	6	0	3	1.000000
5350	10.1787/agr/pol-2011-7-en	2c0bdec7b8283396a8b9b44a868e1e40	Both rural incomes and food supplies would imp...	2	0	3	1.000000

In [10]:

# now the classes are
df_train['target'] = df_train['sdg'].eq(sdg)
df_train['target'].value_counts()

Out[10]:

False    2286
True     2286
Name: target, dtype: int64

In [11]:

X_train, X_test, y_train, y_test = train_test_split(
    df_train['text'].values,
    df_train['target'].values,
    test_size = .3,
    random_state = 42
)

print('Shape train:', X_train.shape)
print('Shape test:', X_test.shape)

Shape train: (3200,)
Shape test: (1372,)

In [12]:

pipe = Pipeline([
    ('vectoriser', CountVectorizer(
        ngram_range = (1, 2),
        stop_words = 'english',
        max_features = 100_000,
        binary = True
    )),
    ('selector', SelectKBest(chi2, k = 1_000)),
    ('clf', LogisticRegression(penalty = 'l2', C = .3, random_state = 42))
])

pipe.fit(X_train, y_train)

Out[12]:

Pipeline(steps=[('vectoriser',
                 CountVectorizer(binary=True, max_features=100000,
                                 ngram_range=(1, 2), stop_words='english')),
                ('selector',
                 SelectKBest(k=1000,
                             score_func=<function chi2 at 0x7f85a1a2d0d0>)),
                ('clf', LogisticRegression(C=0.3, random_state=42))])

In [13]:

y_hat = pipe.predict(X_test)
plot_confusion_matrix(y_test, y_hat, (12, 6))

In [14]:

print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

       False       0.98      0.97      0.98       711
        True       0.97      0.98      0.98       661

    accuracy                           0.98      1372
   macro avg       0.98      0.98      0.98      1372
weighted avg       0.98      0.98      0.98      1372

In [15]:

# top predictors of the positive class
get_top_features(pipe['vectoriser'], pipe['clf'], pipe['selector'], how = 'long')

Out[15]:

	sdg	feature	coef
0	True	women	3.311545
1	True	gender	3.008412
2	True	girls	1.857899
3	True	female	1.527694
4	True	marriage	1.135949
5	True	femmes	0.955650
6	True	woman	0.939377
7	True	mothers	0.910892
8	True	sexual	0.884498
9	True	violence	0.878740
10	True	male	0.757047
11	True	rights	0.729981
12	True	family	0.719205
13	True	laws	0.679507
14	True	girl	0.675500
15	True	equality	0.656043
16	True	husband	0.628937
17	True	men	0.613630
18	True	sexes	0.598106
19	True	leave	0.588988
20	True	females	0.563598
21	True	fathers	0.562589
22	True	gender equality	0.561477
23	True	caring	0.560895
24	True	women entrepreneurs	0.553723

In [16]:

# selecting a text that the model has not seen
sample = df_osdg.query('text_id not in @df_train.text_id').sample().to_dict(orient = 'records')[0]
print(
    '- SDG: {}\n- Labels positive:{}\n- Labels Negative:{}\n\n{}'\
      .format(sample['sdg'], sample['labels_positive'], sample['labels_negative'], sample['text'])
)

- SDG: 14
- Labels positive:4
- Labels Negative:0

In the Pacific bluefin fishery, the fishing area has tended to shift northward year by year. They concluded that water temperature in recent years seemed to be higher than normal, but that these changes were not necessarily related to global warming because other fluctuations over periods lasting from ten years to several decades were dominating (Yamada et al., More recently, Seo (2010) reported that the fast growth of Hokkaido chum salmon at the age of one year, which was related to global warming, would positively affect the survival rate and in turn would affect the population density-dependent growth and maturing at age two to four due to the limited carrying capacity of the Bering Sea. These are statistical downscaling, dynamic downscaling on regional scales, and dynamic global models.

In [17]:

pipe.predict([sample['text']]).item()

Out[17]:

False

In [18]:

pipe.predict_proba([sample['text']])

Out[18]:

array([[0.98602019, 0.01397981]])

III. Multi-class SDG Classifier¶

In this section, we extend the model to a multiclass case by training a multinomial logistic regression capable of classifying text input in one of 15 different SDGs.

Highlights:

More sophisticated preprocessing using spaCy.
Instances are represented as a real-valued vectors from a TfIdf matrix.
Based on univariate ANOVA tests, only top 5000 features are retained.
A multinomial logistic regression classifier is trained on 15 SDGs.
Test accuracy and (weighted) F1 scores both stand at .86.
Top 2 and 3 k accuracy is .94 and .97 accordingly.

In [19]:

def preprocess_spacy(alpha: List[str]) -> List[str]:
    """
    Preprocess text input using spaCy.
    
    Parameters
    ----------
    alpha: List[str]
        a text corpus.
    
    Returns
    -------
    doc: List[str]
        a cleaned version of the original text corpus.
    """
    docs = list()
    
    for doc in tqdm(nlp.pipe(alpha, batch_size = 128)):
        tokens = list()
        for token in doc:
            if token.pos_ in ['NOUN', 'VERB', 'ADJ']:
                tokens.append(token.lemma_)
        docs.append(' '.join(tokens))
        
    return docs

In [20]:

df_osdg['docs'] = preprocess_spacy(df_osdg['text'].values)
print('Shape:', df_osdg.shape)
display(df_osdg.head())

17233it [01:47, 160.54it/s]

Shape: (17233, 8)

	doi	text_id	text	sdg	labels_negative	labels_positive	agreement	docs
0	10.6027/9789289342698-7-en	00021941702cd84171ff33962197ca1f	From a gender perspective, Paulgaard points ou...	5	1	7	0.750000	gender perspective point labour market fishing...
2	10.1787/9789264289062-4-en	0004eb64f96e1620cd852603d9cbe4d4	The average figure also masks large difference...	3	1	6	0.714286	average figure mask large difference region nu...
7	10.1787/9789264117563-8-en	000bfb17e9f3a00d4515ab59c5c487e7	The Israel Oceanographic and Limnological Rese...	6	0	3	1.000000	station monitor quantity quality water coastli...
8	10.18356/805b1ae4-en	001180f5dd9a821e651ed51e30d0cf8c	Previous chapters have discussed ways to make ...	2	0	3	1.000000	previous chapter discuss way make food system ...
11	10.1787/9789264310278-en	001f1aee4013cb098da17a979c38bc57	Prescription rates appear to be higher where l...	8	0	3	1.000000	prescription rate appear be high labour force ...

In [21]:

X_train, X_test, y_train, y_test = train_test_split(
    df_osdg['docs'].values, 
    df_osdg['sdg'].values, 
    test_size = .3,
    random_state = 42
)

print('Shape train:', X_train.shape)
print('Shape test:', X_test.shape)

Shape train: (12063,)
Shape test: (5170,)

In [22]:

pipe = Pipeline([
    ('vectoriser', TfidfVectorizer(
        ngram_range = (1, 2),
        max_df = 0.75,
        min_df = 2,
        max_features = 100_000
    )),
    ('selector', SelectKBest(f_classif, k = 5_000)),
    ('clf', LogisticRegression(
        penalty = 'l2',
        C = .9,
        multi_class = 'multinomial',
        class_weight = 'balanced',
        random_state = 42,
        solver = 'newton-cg',
        max_iter = 100
    ))
])

pipe.fit(X_train, y_train)

Out[22]:

Pipeline(steps=[('vectoriser',
                 TfidfVectorizer(max_df=0.75, max_features=100000, min_df=2,
                                 ngram_range=(1, 2))),
                ('selector', SelectKBest(k=5000)),
                ('clf',
                 LogisticRegression(C=0.9, class_weight='balanced',
                                    multi_class='multinomial', random_state=42,
                                    solver='newton-cg'))])

In [23]:

y_hat = pipe.predict(X_test)
plot_confusion_matrix(y_test, y_hat)

In [24]:

print(classification_report(y_test, y_hat, zero_division = 0))

              precision    recall  f1-score   support

           1       0.86      0.75      0.80       337
           2       0.82      0.86      0.84       256
           3       0.93      0.89      0.91       572
           4       0.94      0.91      0.92       701
           5       0.93      0.91      0.92       697
           6       0.92      0.89      0.91       388
           7       0.93      0.87      0.90       476
           8       0.62      0.69      0.65       245
           9       0.63      0.79      0.70       211
          10       0.48      0.63      0.55       126
          11       0.85      0.83      0.84       379
          12       0.59      0.78      0.67        77
          13       0.84      0.85      0.85       310
          14       0.96      0.93      0.95       220
          15       0.86      0.83      0.85       175

    accuracy                           0.86      5170
   macro avg       0.81      0.83      0.82      5170
weighted avg       0.86      0.86      0.86      5170

In [25]:

for k in [2, 3]:
    k_acc = top_k_accuracy_score(y_test, pipe.predict_proba(X_test), k = k, normalize = True)
    print(f'Top {k} accuracy: {k_acc:.2f}')

Top 2 accuracy: 0.94
Top 3 accuracy: 0.97

In [26]:

df_lambda = get_top_features(pipe['vectoriser'], pipe['clf'], pipe['selector'], top_n = 15)
print('Shape:', df_lambda.shape)
display(df_lambda.head())

Shape: (225, 3)

	sdg	feature	coef
0	1	poverty	12.287556
1	1	poor	7.407398
2	1	child	5.055925
3	1	income	4.612163
4	1	deprivation	4.513972

In [27]:

df_lambda.sort_values(['sdg', 'coef'], ignore_index = True, inplace = True)

colors = px.colors.qualitative.Dark24[:15]
template = 'SDG: %{customdata}<br>Feature: %{y}<br>Coefficient: %{x:.2f}'

fig = px.bar(
    data_frame = df_lambda,
    x = 'coef',
    y = 'feature',
    custom_data = ['sdg'],
    facet_col = 'sdg',
    facet_col_wrap = 3,
    facet_col_spacing = .15,
    height = 1200,
    labels = {
        'coef': 'Coefficient',
        'feature': ''
    },
    title = 'Figure 3. Top 15 Strongest Predictors by SDG'
)

fig.for_each_trace(lambda x: x.update(hovertemplate = template))
fig.for_each_trace(lambda x: x.update(marker_color = colors.pop(0)))
fig.for_each_annotation(lambda x: x.update(text = fix_sdg_name(x.text.split("=")[-1])))
fig.update_yaxes(matches = None, showticklabels = True)

fig.show()

In [28]:

sample = df_osdg.query('text_id not in @df_train.text_id').sample().to_dict(orient = 'records')[0]
print(
    '- SDG: {}\n- Labels positive:{}\n- Labels Negative:{}\n\n{}'\
      .format(sample['sdg'], sample['labels_positive'], sample['labels_negative'], sample['text'])
)

- SDG: 13
- Labels positive:9
- Labels Negative:0

The project was implemented over the course of 15 years using a community forestry approach both to generate income and to stabilize slopes that had become exposed as a result of environmental degradation and were consequently at risk of landslides. The assessment of the project was conducted in close consultation with communities and the results encompassed a greater diversification of livelihoods and improved watersheds, together with a decrease in the risks from landslides. This highlights the importance of management of ecosystems and livelihoods as the basis for an integrated strategy for climate change adaptation and development (Renaud, Sudmeier-Rieux and Estrella, eds.,

In [29]:

y_proba = pipe.predict_proba([sample['text']]).flatten()
y_hat = pipe.predict([sample['text']]).item()
y_hat

Out[29]:

In [30]:

fig = px.bar(
    x = range(1, 16),
    y = y_proba,
    color = y_proba == y_proba.max(),
    color_discrete_map = {False: 'grey', True: '#1f77b4'},
    labels = {
        'x': 'SDG',
        'y': 'Probability'
    },
    title = 'Figure 4. Predicted Probabilities of SDG Class for a Sample Text\n'
)

fig.update_traces(hovertemplate = 'SDG: %{x}<br>Predicted probability: %{y:.2f}')
fig.update_layout(
    xaxis = {
        'type': 'category',
        'categoryorder': 'array',
        'categoryarray': list(range(1, 16))
    },
    yaxis = {'range': [0, 1]},
    showlegend = False
)

fig.show()

In [ ]: