geoRx: Geoscience prescription¶

This notebook accompanies

Hall, M (2017). Three data analytics party tricks. The Leading Edge 36 (3).

Inspired by and/or based on science concierge and Chris Clark's repo on content-based recommendation.

A version of this code is running at georx.geosci.ai where you can try it out.

Load data¶

This dataset is 1000 random articles from the journal Geophysics from 1936 to 2016. It represents about 10% of the total number of articles published in that time. It was collected from seg.org with permission, and processed into a CSV file of titles, abstracts, and DOIs.

In [1]:

import pandas as pd

In [2]:

df = pd.read_csv('data/title_abstract_doi.csv')

In [3]:

df.head()

Out[3]:

	title	abstract	doi
0	Magnetic And Gravity Anomaly Patterns Related ...	A study of the features of gravity and magneti...	10.1190/1.1444192
1	Inversion For Permeability Distribution From T...	Understanding reservoir properties plays a key...	10.1190/geo2014-0203.1
2	Quantifying Background Magnetic-Field Inhomoge...	Nuclear magnetic resonance measurements provid...	10.1190/geo2012-0488.1
3	Families Of Salt Domes In The Gulf Coastal Pro...	If two fluids of different densities are super...	10.1190/1.1439806
4	Attribute-Guided Well-Log Interpolation Applie...	Several approaches exist to use trends in 3D s...	10.1190/1.2996302

In [4]:

len(df)

Out[4]:

Recommender¶

This is a simple class (to learn more about classes, read up on object oriented programming). It has 4 methods (functions) that implement the workflow:

__init__(): Instantiates the class with some 'hyperparameters'.
_preprocess(): Does some basic preprocessing on the abstracts.
fit: Constructs the model, which consists of two main pieces:
- A 100-dimensional 'semantic' space, in which each document is a point with 100 coordinates.
- A look-up table of the distances between points, so we can find an article's neighbours quickly.
recommend: Takes a list of 'liked' articles, finds their midpoint in the semantic space, and looks up the closest articles to that midpoint.

In [5]:

import numpy as np

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import BallTree, KDTree

STEMMER = PorterStemmer()
TOKENIZER = RegexpTokenizer(r'\w+')

class ContentRx(object):
    """
    A simple class to implement a scikit-learn-like API,
    and to hold the data.
    """
    def __init__(self,
                 components=100,
                 return_scores=True,
                 metric='euclidean',
                 centroid='median',
                 ngram_range=(1,2),   # Can be very slow above (1,2)
                 ignore_fewer_than=0, # ignore words fewer than this
                 ):
        self.components = components
        self.return_scores = return_scores
        self.centroid = centroid
        self.metric = metric
        self.ngram_range = ngram_range
        self.ignore_fewer_than = ignore_fewer_than
        
    def _preprocess(self, text):
        """
        Stem and tokenize a piece of text (e.g. an abstract).
        """
        out = [STEMMER.stem(token) for token in TOKENIZER.tokenize(text)]
        return ' '.join(out)

    def fit(self, data):
        """
        Algorithm for latent semantic analysis:
        * Create a tf-idf (e.g. unigrams and bigrams) for each doc.
        * Compute similarity with sklearn pairwise metrics.
        * Get the 100 most-similar items.
        """
        data = [self._preprocess(item) for item in data]

        # Build LSA pipline: TF-IDF then normalized SVD reduction.
        tfidf = TfidfVectorizer(ngram_range=self.ngram_range,
                                min_df=self.ignore_fewer_than,
                                stop_words='english',
                                )
        svd = TruncatedSVD(n_components=self.components)
        normalize = Normalizer(copy=False)
        lsa = make_pipeline(tfidf, svd, normalize)
        self.X = lsa.fit_transform(data)

        # Build and store distance tree.
        # metrics: see BallTree.valid_metrics
        self.tree = KDTree(self.X, metric=self.metric)

        return

    def recommend(self, likes, n_recommend=10):
        """
        Makes a recommendation.
        """
        # Make the query from the input document idxs.
        # Science Concierge uses Rocchio algorithm,
        # but I don't think I care about 'dislikes'.
        vecs = np.array([self.X[idx] for idx in likes])
        q = np.mean(vecs, axis=0).reshape(1, -1)

        # Get the matches and their distances.
        dist, idx = self.tree.query(q, k=n_recommend+len(likes))
        
        # Get rid of the original likes, which may or may not be in the result.
        ind, dist = zip(*[(i, d)
                          for d, i in zip(np.squeeze(dist), np.squeeze(idx))
                          if i not in likes])
        
        # If the likes weren't in the result, we remove the most distant results.
        if self.return_scores:
            return list(ind)[:n_recommend], list(1 - np.array(dist))[:n_recommend]
        return list(ind)[:n_recommend]

Instantiate the model:

In [6]:

crx = ContentRx(ngram_range=(1,2))

Train the model by fitting to our dataset:

In [7]:

crx.fit(df.abstract)

The model is trained!

Make recommendations¶

First, let's find some papers we like. (Remember this is only a subset of 1000 papers.)

In [8]:

s = [i for i, t in enumerate(df.title) if 'spectral decomp' in t.lower()]
s

Out[8]:

[79, 127]

In [9]:

df.title[79], df.title[127]

Out[9]:

('Seismic Spectral Decomposition Using Deconvolutive Short-Time Fourier Transform Spectrogram',
 'Maximum Entropy Spectral Decomposition Of A Seismogram Into Its Minimum Entropy Component Plus Noise')

Now we can get our recommendations:

In [10]:

idx, scores = crx.recommend(likes=s, n_recommend=10)

In [11]:

idx

Out[11]:

[737, 257, 718, 164, 863, 252, 721, 642, 766, 355]

In [12]:

df.iloc[idx]

Out[12]:

	title	abstract	doi
737	Empirical Mode Decomposition For Seismic Time-...	Time-frequency analysis plays a significant ro...	10.1190/geo2012-0199.1
257	Choice Of Operator Length For Maximum Entropy ...	Empirical evidence based on maximum entropy sp...	10.1190/1.1440902
718	Reservoir Characterization Based On Seismic Sp...	The seismic frequency spectrum provides a usef...	10.1190/geo2011-0323.1
164	Seismic Sequence Analysis And Attribute Extrac...	The variation of frequency content of a seismi...	10.1190/1.1487136
863	Ergodicity Of Stationary White Gaussian Processes	Stationary time series is an important concept...	10.1190/1.1444502
252	Sparse Time-Frequency Representation For Seism...	Attenuation of random noise is a major concern...	10.1190/geo2015-0341.1
721	Maximum‐Entropy Spatial Processing Of Array Data	The procedure of maximum‐entropy spectral anal...	10.1190/1.1440471
642	Predictive Deconvolution And The Zero‐Phase So...	Predictive deconvolution is commonly applied t...	10.1190/1.1441674
766	Spectrum Of The Potential Field Due To Randoml...	Covariance and spectral density functions of a...	10.1190/1.1439933
355	Theory Of Nonstationary Linear Filtering In Th...	A general linear theory describes the extensio...	10.1190/1.1444318

We have scores (inverse distances) for each recommendation too:

In [13]:

for i, s in zip(idx, scores):
    print('{:.1f}'.format(100*s).rjust(5), df.title[i])

 22.6 Empirical Mode Decomposition For Seismic Time-Frequency Analysis
 10.7 Choice Of Operator Length For Maximum Entropy Spectral Analysis
  9.8 Reservoir Characterization Based On Seismic Spectral Variations
  7.5 Seismic Sequence Analysis And Attribute Extraction Using Quadratic Time‐Frequency Representations
  7.2 Ergodicity Of Stationary White Gaussian Processes
  6.9 Sparse Time-Frequency Representation For Seismic Noise Reduction Using Low-Rank And Sparse Decomposition
  3.9 Maximum‐Entropy Spatial Processing Of Array Data
  1.6 Predictive Deconvolution And The Zero‐Phase Source
  1.2 Spectrum Of The Potential Field Due To Randomly Distributed Sources
  1.0 Theory Of Nonstationary Linear Filtering In The Fourier Domain With Application To Time‐Variant Filtering