Record Linkage


This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the recordlinkage package to combine data from different sources in Python. It will lead you through all the steps necessary for a sucessful record linkage starting with data preparation including pre-processing, cleaning and standardization of data.

The Principles of Record Linkage

The goal of record linkage is to determine if pairs of records describe the same entity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.

There are several approaches to record linkage that include

- exact matching, 
- rule-based linking and 
- probabilistic linking. 
  • An example of exact matching is joining records based on a direct identifier. This is what we have already done in SQL by joining tables.
  • Rule-based matching involves applying a cascading set of rules that reflect the domain knowledge of the records being linked.
  • In probabilistic record linkages, linkage weights are estimated to calculate the probability of a certain match.

In practical applications you will need record linkage techniques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are

1. Merging two or more data files. 
2. Identifying the intersection of the two data sets. 
3. Updating data files (with the data row of the other data files) and imputing missing data.
4. Entity disambiguation and de-duplication.

Motivation: Linking Patents to University IPEDS code

In this notebook we show an example of linking a subset of patent data to universities. In both datasets we have university name and location (city, state) that we can use for the linkage. The data that we will use are stored in the university.db database.

Getting Started with Record Linkage

Import of Packages

Python provides us with some tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. So before we start we need to load the package recordlinkage. To fully function, this package uses other packages which also need to be imported. Thus we are adding more packages to the ones you are already familiar with.

In [ ]:
# data manipulation and machine learning
import pandas as pd
import scipy
import sklearn
from sqlite3 import connect

# record linkage package
import recordlinkage as rl
from recordlinkage.preprocessing import clean, phonenumbers, phonetic
In [ ]:
# to create a connection to the database, 
# we need to pass the name of the database 

DB = 'university.db'

conn = connect(DB)

Getting Patents and IPEDS Data

Before we get started on linking two datasets, we need to first bring in our datasets. We'll be linking data from two sources: uspto_org_location and ipeds_location. We'll do this by bringing in the appropriate tables from the database.

In [ ]:
# Specify the PatentsView data table 'uspto_org_location'

query = '''
SELECT *
FROM uspto
'''
# Read it into a pandas dataframe

uspto_org = pd.read_sql(query,conn)
uspto_org = uspto_org[['assignee_id','assignee_organization','assignee_city','assignee_state']]
In [ ]:
# View the table
uspto_org.head()
In [ ]:
# Load the IPEDS university data

query = '''
SELECT *
FROM ipeds
'''
# Read it into a pandas dataframe

ipeds = pd.read_sql(query,conn)
ipeds = ipeds[['unitid','instnm','city','stabbr']]
In [ ]:
# View the table
ipeds.head()

The Importance of Pre-Processing

Data pre-processing is an important step in a data analysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.

Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for an entity. For example, is Bob Miller from New York the same person as Bob Miller from Chicago in a given dataset? This determination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.

Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. The most common reason why matching projects fail is lack of time and resources for data cleaning.

In the following section we will walk you through some pre-processing steps, these include but are not limited to removing spaces, parsing fields, and standardizing strings.

Clean Patent Data

We will start by cleaning and preprocessing the patent data. We need to remove whitespaces, make sure that everything is in lower case, and harmonize all the other information we need for the linkage.

The record linkage package comes with a built-in cleaning function we can use. The clean() function removes any characters such as '-', '.', '/', '\', ':', brackets of all types, and also lowercases by default.

In [ ]:
# Cleaning names (using the record linkage package tool, see imports)

uspto_org['assignee_organization'] = clean(uspto_org['assignee_organization'])
In [ ]:
uspto_org.head()

By using .str.replace(), we can replace all instances of a white space in a name.

In [ ]:
# Concatenate strings by removing white space
uspto_org['assignee_organization'] = uspto_org['assignee_organization'].str.replace(' ','')

Let's view the finalized names in the patent data.

In [ ]:
uspto_org.head()

Now we are done with the inital data prep work for the patent file. Please keep in mind that we just provided some examples for you to demonstrate the process. You can add as many further steps to it as necessary.

Phonetic Processing

Sometimes, words or names are recorded differently because they are written down as they sound. This can result in failed matches, because the same institution or individual will technically have different written names, even though the names would sound identically when pronounced out loud. To avoid these issues, we will add one more thing: a soundex (a phonetic algorithm for indexing names by sound, as pronounced in English).

The phonetic() function is used to convert strings into their corresponding phonetic codes. This is particularly useful when comparing names where different possible spellings make it difficult to find exact matches (e.g. Jillian and Gillian).

Let's add a column called phonetic_name to our existing table, which will contain the result of applying a phonetic function to the assignee organization name (the phonetic transcription of the name). We are using a method called NYSIIS - the New York State Identification and Intelligence System phonetic code.

In [ ]:
uspto_org["phonetic_name"] = phonetic(uspto_org["assignee_organization"], method="nysiis")
In [ ]:
uspto_org.head()

Checkpoint 1: Pre-process IPEDS data

Let's do the same pre-processing steps for the IPEDS data.

In [ ]:
ipeds.head()

Use function clean() from above on the column with the university name in the IPEDS file.

In [ ]:
ipeds['instnm'] = clean(ipeds['instnm'])

Use function.str.replace() from above to replace all instances of white space.

In [ ]:
ipeds['instnm'] = ipeds['instnm'].str.replace(' ','')

Compare the results with the organization names in the patent file.

In [ ]:
ipeds.head()
In [ ]:
uspto_org.head()

Find phonetic transcriptions of university names in the IPEDS table.

In [ ]:
ipeds["phonetic_name"] = phonetic(ipeds['instnm'], method='nysiis')
In [ ]:
ipeds.head()

Record Linkage

We've done some basic pre-processing of the data, using some of the very useful functions in recordlinkage.preprocessing. Now, let's move on to the actual record linkage portion. Though we can dive right in with comparing two names and checking if they match, this process can actually have a lot of nuance to it. For example, you should consider how long this process will take if you have extremely large datasets, with millions and millions of rows to check against millions and millions of rows. In addition, you should consider how strict you want your matching to be. For example, you want to make sure you catch any typos or common misspellings, but want to avoid relaxing the match condition to the point that anything will match with anything.

Indexing

Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset.

If we had 10,000 records in data frame A and 100,000 records in data frame B, we would have 1,000,000,000 candidate links. You can see that comparing over a full index is getting inefficient when working with big data.

We can do better if we actually include our knowledge about the data to eliminate bad link from the start. This can be done through blocking. The recordlinkage package gives you multiple options for this. For example, you can block by using variables, which means that only links exactly equal on specified values will be kept.

Here we will block on city and state, to narrow down the number of candidate links.

We need to make sure that the column names that we want to block on are the same in both datasets.

Which columns do we need to rename in both datasets, if we want to link on columns city and state?

In [ ]:
uspto_org.head()
In [ ]:
# Rename the patent dataset columns
uspto_org = uspto_org.rename(columns={'assignee_city':'city'})
uspto_org = uspto_org.rename(columns={'assignee_state':'state'})

In the IPEDS data, the city column has already the target name. Rename the column stabbr to the state column.

In [ ]:
ipeds.head()
In [ ]:
ipeds = ipeds.rename(columns={'stabbr':'state'})

Now that are datasets have the same column names, we can block on them.

In [ ]:
uspto_org.head()
In [ ]:
ipeds.head()
In [ ]:
indexerBL = rl.BlockIndex(on=['city', 'state'])
candidate_links = indexerBL.index(ipeds, uspto_org)
In [ ]:
# Returns pairs of candidate records by their index number in the respective tables
candidate_links

Let's check the first pair of candidate links blocked on city and state: (1, 264)

In [ ]:
ipeds.iloc[1]
In [ ]:
uspto_org.iloc[264]

Record Comparison

After you have created a set of candidate links, you’re ready to begin comparing the records associated with each candidate link. In recordlinkage you must initiate a Compare object prior to performing any comparison functionality between records. The code block below initializes the comparison object.

In [ ]:
# Initiate compare object 
compare_cl = rl.Compare()

Compare.string() method generates a score based on well-known string-comparison algorithms. For this example, Jaro-Winkler distance is used (specifically developed with record linkage applications in mind) - words with more characters in common have a higher Jaro-Winkler value than those with fewer characters in common. The output value is normalized to fall between 0 (complete dissimilar strings) and 1 (exact match on strings). (Information about other string-comparison methods is included in the References section below).

As you remember, we already did an exact matching on city and state, when we did the blocking above and created the candidate links.

We will use the string method to compare the university names and their phonetic transcriptions.

We need to specify the respective columns with organization names in both datasets, the method, and the threshold. In this case, for all strings that have more than 85% in similarity, according to the Jaro-Winkler distance, a 1 will be returned, and otherwise 0.

In [ ]:
# Initiate compare object 
compare_cl = rl.Compare()

compare_cl.string('instnm','assignee_organization', method='jarowinkler', threshold=0.85,label='name')
compare_cl.string('phonetic_name','phonetic_name', method='jarowinkler', threshold=0.85,label='phonetic_name')

The comparing of record pairs starts when the compute method is called.

In [ ]:
## All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs.

features = compare_cl.compute(candidate_links, ipeds, uspto_org)
In [ ]:
features.head()
In [ ]:
features[features['name'] == 1]

Classification

Let's check how many records we get where one or both of comparison attributes match.

In [ ]:
## Simple Classification: Check for how many attributes records are identical by summing the comparison results.
features.sum(axis=1).value_counts().sort_index(ascending=False)

We can make a decision now, and consider matches all those records which matched on both attributes in our case.

In [ ]:
matches = features[features.sum(axis=1) > 1]  # Filter by those cases which matched on more than 1 attribute
print(len(matches))

There are 280 records, which had an exact match on city and state, and more than 85% in similarity based on university name and the phonetic transcription of the name.

In [ ]:
matches.head()

Now let's merge these matches back to original dataframes.

Our matches dataframe has MultiIndex - two indices to the left which correspond to the ipeds table and uspto_org table respectively.

We can access each matching pair individually, for example, the first one:

In [ ]:
matches.index[0]

We can also do the following: first, pull all the indices for the ipeds table.

In [ ]:
matches.index[0][0]

We will pull all corresponding rows from the ipeds table.

In [ ]:
ipeds_results = []  # Create an empty list

for match in matches.index:  # For every pair in matches (index)
    df = pd.DataFrame(ipeds.loc[[match[0]]])  # Get the location in the original table, convert to dataframe
    ipeds_results.append(df)  # Append to a list
In [ ]:
ipeds_results[0]

Now we concatenate the list of dataframes into one dataframe.

In [ ]:
ipeds_concat = pd.concat(ipeds_results)  
In [ ]:
ipeds_concat.head()

We do the same for the uspto table.

In [ ]:
uspto_results = []  # Create an empty list

for i in matches.index:  # For every pair in matches (index)
    df = pd.DataFrame(uspto_org.loc[[i[1]]])  # Get the location in the original table, convert to dataframe
    uspto_results.append(df)  # Append to a list

uspto_concat = pd.concat(uspto_results)  # Concatenate into one dataframe
In [ ]:
uspto_concat.head()

Now we need to combine two tables on the index - notice that our tables right now have indices from the original tables. We can reset the index using .reset_index().

In [ ]:
ipeds_concat = ipeds_concat.reset_index()
In [ ]:
uspto_concat = uspto_concat.reset_index()

Now our tables have the same index on which we can combine two tables.

In [ ]:
ipeds_concat.head()
In [ ]:
uspto_concat.head()

Now we concatenate these two tables using .concat().

In [ ]:
matched = pd.concat([ipeds_concat,uspto_concat],axis=1)  # Specify axis=1 to concatenate horizontally
In [ ]:
matched.head()

Now that have merged our matches together, examine them. Remember that we matched our strings on 85% similarity and we blocked on city and state - that's why in our table we can see a match in row 3 between universityofmobile and university of south alabama, as they are from the same city and state, and the names have 85% in similarity, according to the Jaro-Winkler distance. Try using a different threshold. You can also use a different string-matching algorithm (please see below in the References).

Checkpoint 2: Record Linkage Decisions

What are some decisions we had to make as we went through the record linkage process above? What if we had made different choices instead? Try doing the record linkage with a few different options and see how many matches you get as you vary the approach.

For example, you can try Levenshtein distance in the string-matching part of the notebook. To see all available methods, search for the recordlinkage package in Python documentation and navigate to the section called Comparison, which lists available methods in string matching.

Fellegi Sunter

Now let's do this with a machine learning classifier. Supervised learning algorithms need training data. Training data is data for which the true match status is known for each comparison vector.

In the example in this section, we will consider true matches those where we block (find an exact match) on the university name, city, and state.

We will need to rename the columns with organization names, so they match on both datasets:

In [ ]:
uspto_org = uspto_org.rename(columns={'assignee_organization':'organization'})
ipeds = ipeds.rename(columns={'instnm':'organization'})
In [ ]:
# Let's consider these the true matches
indexerBL = rl.BlockIndex(on=['organization','city', 'state'])
true_matches = indexerBL.index(ipeds, uspto_org)
In [ ]:
# Let's see how many true matches we get
len(true_matches)

Let's use the features of the first 100,000 features from above.

In [ ]:
## Generate Training Data and index
ml_pairs = features[0:100000]
ml_matches_index = ml_pairs.index & true_matches
In [ ]:
len(ml_matches_index)

The Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables.

In [ ]:
## Train the classifier
nb = rl.NaiveBayesClassifier()
nb.learn(ml_pairs, ml_matches_index)

## Predict the match status for all record pairs
result_nb = nb.predict(features)
In [ ]:
# Let's see how many matches were predicted by a classifier
len(result_nb)

Evaluation

The last step is to evaluate the results of the record linkage. We will cover this in more detail in the machine learning session. This is just for completeness.

In [ ]:
## Confusion matrix - we include the total number of true matches, the predicted matches, and the total number of records to predict on
conf_nb = rl.confusion_matrix(true_matches, result_nb, len(features))
conf_nb
In [ ]:
## Precision and Accuracy
precision = rl.precision(conf_nb)
accuracy = rl.accuracy(conf_nb)
In [ ]:
## Precision and Accuracy
print(precision)
print(accuracy)
In [ ]:
## The F-score for this classification is
rl.fscore(conf_nb)

Optional

Regular Expressions - regex

We can extract information from strings by using regex search commands.

When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match. For example, if we look at an author field formatted as " , ", in plain English, this is how I would explain where to find the last name: "starting from the beginning of the line, take all the characters until you see a comma."

We can build a regular expression that captures this idea from the following components:

  • ^ Matches beginning of the line
  • . Matches any character
  • .+ A modifier that means "match one or more of the preceding expression"

In a regular expression, there are special reserved characters and character classes like those in the list above. Anything that is not a special character or class is just looked for explicitly (for example, a comma is not a special character in regular expressions, so if it is in a regular expression pattern, the regular expression processor will just be looking for a comma in the string, at that point in the pattern).

Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( "." ). This results in the regular expression: ^.+,

We start at the beginning of the line ( "^" ), matching any characters ( ".+" ) until we come to the literal character of a comma ( "," ).

In python, to use a regular expression like this to search for matches in a given string, we use the built-in "re" package ( https://docs.python.org/2/library/re.html ), specifically the "re.search()" method. To use "re.search()", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within.

REGEX CHEATSHEET

- abc...     Letters
- 123...     Digits
- \d         Any Digit
- \D         Any non-Digit Character
- .          Any Character
- \.         Period
- [a,b,c]    Only a, b or c
- [^a,b,c]   Not a,b, or c
- [a-z]      Characters a to z
- [0-9]      Numbers 0 to 9
- \w any     Alphanumeric chracter
- \W         any non-Alphanumeric character
- {m}        m Repetitions
- {m,n}      m to n repetitions
- *          Zero or more repetitions
- +          One or more repetitions
- ?          Optional Character
- \s         any Whitespace
- \S         any non-Whitespace character
- ^...$      Starts & Ends
- (...)      Capture Group
- (a(bc))    Capture sub-Group
- (.*)       Capture All
- (abc|def)  Capture abc or def

EXAMPLES

- (\d\d|\D) will match 22X, 23G, 56H, etc...
- \w will match any characters between 0-9 or a-z
- \w{1-3} will match any alphanumeric character of a length of 1 to 3. 

References and Further Readings

Parsing

Regular Expression

String Comparators

Fellegi-Sunter Record Linkage