This notebook accompanies
Hall, M (2017). Three data analytics party tricks. The Leading Edge 36 (3).
Inspired by and/or based on science concierge and Chris Clark's repo on content-based recommendation.
A version of this code is running at georx.geosci.ai where you can try it out.
This dataset is 1000 random articles from the journal Geophysics from 1936 to 2016. It represents about 10% of the total number of articles published in that time. It was collected from seg.org with permission, and processed into a CSV file of titles, abstracts, and DOIs.
import pandas as pd
df = pd.read_csv('data/title_abstract_doi.csv')
df.head()
title | abstract | doi | |
---|---|---|---|
0 | Magnetic And Gravity Anomaly Patterns Related ... | A study of the features of gravity and magneti... | 10.1190/1.1444192 |
1 | Inversion For Permeability Distribution From T... | Understanding reservoir properties plays a key... | 10.1190/geo2014-0203.1 |
2 | Quantifying Background Magnetic-Field Inhomoge... | Nuclear magnetic resonance measurements provid... | 10.1190/geo2012-0488.1 |
3 | Families Of Salt Domes In The Gulf Coastal Pro... | If two fluids of different densities are super... | 10.1190/1.1439806 |
4 | Attribute-Guided Well-Log Interpolation Applie... | Several approaches exist to use trends in 3D s... | 10.1190/1.2996302 |
len(df)
1000
This is a simple class (to learn more about classes, read up on object oriented programming). It has 4 methods (functions) that implement the workflow:
__init__()
: Instantiates the class with some 'hyperparameters'._preprocess()
: Does some basic preprocessing on the abstracts.fit
: Constructs the model, which consists of two main pieces:recommend
: Takes a list of 'liked' articles, finds their midpoint in the semantic space, and looks up the closest articles to that midpoint.import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import BallTree, KDTree
STEMMER = PorterStemmer()
TOKENIZER = RegexpTokenizer(r'\w+')
class ContentRx(object):
"""
A simple class to implement a scikit-learn-like API,
and to hold the data.
"""
def __init__(self,
components=100,
return_scores=True,
metric='euclidean',
centroid='median',
ngram_range=(1,2), # Can be very slow above (1,2)
ignore_fewer_than=0, # ignore words fewer than this
):
self.components = components
self.return_scores = return_scores
self.centroid = centroid
self.metric = metric
self.ngram_range = ngram_range
self.ignore_fewer_than = ignore_fewer_than
def _preprocess(self, text):
"""
Stem and tokenize a piece of text (e.g. an abstract).
"""
out = [STEMMER.stem(token) for token in TOKENIZER.tokenize(text)]
return ' '.join(out)
def fit(self, data):
"""
Algorithm for latent semantic analysis:
* Create a tf-idf (e.g. unigrams and bigrams) for each doc.
* Compute similarity with sklearn pairwise metrics.
* Get the 100 most-similar items.
"""
data = [self._preprocess(item) for item in data]
# Build LSA pipline: TF-IDF then normalized SVD reduction.
tfidf = TfidfVectorizer(ngram_range=self.ngram_range,
min_df=self.ignore_fewer_than,
stop_words='english',
)
svd = TruncatedSVD(n_components=self.components)
normalize = Normalizer(copy=False)
lsa = make_pipeline(tfidf, svd, normalize)
self.X = lsa.fit_transform(data)
# Build and store distance tree.
# metrics: see BallTree.valid_metrics
self.tree = KDTree(self.X, metric=self.metric)
return
def recommend(self, likes, n_recommend=10):
"""
Makes a recommendation.
"""
# Make the query from the input document idxs.
# Science Concierge uses Rocchio algorithm,
# but I don't think I care about 'dislikes'.
vecs = np.array([self.X[idx] for idx in likes])
q = np.mean(vecs, axis=0).reshape(1, -1)
# Get the matches and their distances.
dist, idx = self.tree.query(q, k=n_recommend+len(likes))
# Get rid of the original likes, which may or may not be in the result.
ind, dist = zip(*[(i, d)
for d, i in zip(np.squeeze(dist), np.squeeze(idx))
if i not in likes])
# If the likes weren't in the result, we remove the most distant results.
if self.return_scores:
return list(ind)[:n_recommend], list(1 - np.array(dist))[:n_recommend]
return list(ind)[:n_recommend]
Instantiate the model:
crx = ContentRx(ngram_range=(1,2))
Train the model by fitting to our dataset:
crx.fit(df.abstract)
The model is trained!
First, let's find some papers we like. (Remember this is only a subset of 1000 papers.)
s = [i for i, t in enumerate(df.title) if 'spectral decomp' in t.lower()]
s
[79, 127]
df.title[79], df.title[127]
('Seismic Spectral Decomposition Using Deconvolutive Short-Time Fourier Transform Spectrogram', 'Maximum Entropy Spectral Decomposition Of A Seismogram Into Its Minimum Entropy Component Plus Noise')
Now we can get our recommendations:
idx, scores = crx.recommend(likes=s, n_recommend=10)
idx
[737, 257, 718, 164, 863, 252, 721, 642, 766, 355]
df.iloc[idx]
title | abstract | doi | |
---|---|---|---|
737 | Empirical Mode Decomposition For Seismic Time-... | Time-frequency analysis plays a significant ro... | 10.1190/geo2012-0199.1 |
257 | Choice Of Operator Length For Maximum Entropy ... | Empirical evidence based on maximum entropy sp... | 10.1190/1.1440902 |
718 | Reservoir Characterization Based On Seismic Sp... | The seismic frequency spectrum provides a usef... | 10.1190/geo2011-0323.1 |
164 | Seismic Sequence Analysis And Attribute Extrac... | The variation of frequency content of a seismi... | 10.1190/1.1487136 |
863 | Ergodicity Of Stationary White Gaussian Processes | Stationary time series is an important concept... | 10.1190/1.1444502 |
252 | Sparse Time-Frequency Representation For Seism... | Attenuation of random noise is a major concern... | 10.1190/geo2015-0341.1 |
721 | Maximum‐Entropy Spatial Processing Of Array Data | The procedure of maximum‐entropy spectral anal... | 10.1190/1.1440471 |
642 | Predictive Deconvolution And The Zero‐Phase So... | Predictive deconvolution is commonly applied t... | 10.1190/1.1441674 |
766 | Spectrum Of The Potential Field Due To Randoml... | Covariance and spectral density functions of a... | 10.1190/1.1439933 |
355 | Theory Of Nonstationary Linear Filtering In Th... | A general linear theory describes the extensio... | 10.1190/1.1444318 |
We have scores (inverse distances) for each recommendation too:
for i, s in zip(idx, scores):
print('{:.1f}'.format(100*s).rjust(5), df.title[i])
22.6 Empirical Mode Decomposition For Seismic Time-Frequency Analysis 10.7 Choice Of Operator Length For Maximum Entropy Spectral Analysis 9.8 Reservoir Characterization Based On Seismic Spectral Variations 7.5 Seismic Sequence Analysis And Attribute Extraction Using Quadratic Time‐Frequency Representations 7.2 Ergodicity Of Stationary White Gaussian Processes 6.9 Sparse Time-Frequency Representation For Seismic Noise Reduction Using Low-Rank And Sparse Decomposition 3.9 Maximum‐Entropy Spatial Processing Of Array Data 1.6 Predictive Deconvolution And The Zero‐Phase Source 1.2 Spectrum Of The Potential Field Due To Randomly Distributed Sources 1.0 Theory Of Nonstationary Linear Filtering In The Fourier Domain With Application To Time‐Variant Filtering
© Agile Geoscience 2017 — licensed under Apache 2.0