This IPython notebook explains a basic workflow two tables using py_entitymatching. Our goal is to come up with a workflow to match restaurants from Fodors and Zagat sites. Specifically, we want to maximize F1. The datasets contain information about the restaurants.
First, we need to import py_entitymatching package and other libraries as follows:
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')
import py_entitymatching as em
import pandas as pd
import os
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )
python version: 2.7.13 | packaged by conda-forge | (default, May 2 2017, 13:29:36) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] pandas version: 0.20.3 magellan version: 0.2.0
Matching two tables typically consists of the following three steps:
** 1. Reading the input tables **
** 2. Blocking the input tables to get a candidate set **
** 3. Matching the tuple pairs in the candidate set **
We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
No handlers could be found for logger "py_entitymatching.io.parsers"
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
Number of tuples in A: 533 Number of tuples in B: 331 Number of tuples in A X B (i.e the cartesian product): 176423
A.head(2)
id | name | addr | city | phone | type | |
---|---|---|---|---|---|---|
0 | 534 | arnie mortons of chicago | 435 s. la cienega blv. | los angeles | 310/246-1501 | american |
1 | 535 | arts delicatessen | 12224 ventura blvd. | studio city | 818/762-1221 | american |
B.head(2)
id | name | addr | city | phone | type | |
---|---|---|---|---|---|---|
0 | 1 | apple pan the | 10801 w. pico blvd. | west la | 310-475-3585 | american |
1 | 2 | asahi ramen | 2027 sawtelle blvd. | west la | 310-479-2231 | noodle shops |
# Display the keys of the input tables
em.get_key(A), em.get_key(B)
('id', 'id')
# If the tables are large we can downsample the tables like this
A1, B1 = em.down_sample(A, B, 200, 1, show_progress=False)
len(A1), len(B1)
# But for the purposes of this notebook, we will use the entire table A and B
(147, 200)
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.
For the matching problem at hand, we know that two restaurants with different city names will not match. So we decide the apply blocking over names:
# Blocking plan
# A, B -- attribute equiv. blocker [city] --------------------|---> candidate set
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block using city attribute
C1 = ab.block_tables(A, B, 'city', 'city',
l_output_attrs=['name', 'addr', 'city', 'phone'],
r_output_attrs=['name', 'addr', 'city', 'phone']
)
len(C1)
10165
The number of tuple pairs considered for matching is reduced to 10165 (from 176423), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in py_entitymatching as follows:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)
# Display first few tuple pairs from the debug_blocker's output
dbg.head()
_id | similarity | ltable_id | rtable_id | ltable_name | ltable_addr | ltable_city | ltable_phone | ltable_type | rtable_name | rtable_addr | rtable_city | rtable_phone | rtable_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.800000 | 569 | 254 | gotham bar & grill | 12 e. 12th st. | new york | 212/620-4020 | american | gotham bar & grill | 12 e. 12th st. | new york city | 212-620-4020 | american (new) |
1 | 1 | 0.777778 | 561 | 246 | cafe des artistes | 1 w. 67th st. | new york | 212/877-3500 | continental | cafe des artistes | 1 w. 67th st. | new york city | 212-877-3500 | french (classic) |
2 | 2 | 0.777778 | 599 | 284 | union square cafe | 21 e. 16th st. | new york | 212/243-4020 | american | union square cafe | 21 e. 16th st. | new york city | 212-243-4020 | american (new) |
3 | 3 | 0.750000 | 571 | 256 | island spice | 402 w. 44th st. | new york | 212/765-1737 | tel caribbean | island spice | 402 w. 44th st. | new york city | 212-765-1737 | caribbean |
4 | 4 | 0.750000 | 572 | 257 | jo jo | 160 e. 64th st. | new york | 212/223-5656 | american | jo jo | 160 e. 64th st. | new york city | 212-223-5656 | french bistro |
From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.
For the considered dataset, we know that for the restaurants to match the names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.
# Updated blocking sequence
# A, B ------ attribute equivalence [city] -----> C1--
# |----> C
# A, B ------ overlap blocker [name] --------> C2--
# Create overlap blocker
ob = em.OverlapBlocker()
# Block tables using 'name' attribute
C2 = ob.block_tables(A, B, 'name', 'name',
l_output_attrs=['name', 'addr', 'city', 'phone'],
r_output_attrs=['name', 'addr', 'city', 'phone'],
overlap_size=1,
show_progress=False
)
len(C2)
2915
# Display first two rows from C2
C2.head(2)
_id | ltable_id | rtable_id | ltable_name | ltable_addr | ltable_city | ltable_phone | rtable_name | rtable_addr | rtable_city | rtable_phone | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 543 | 1 | grill on the alley | 9560 dayton way | los angeles | 310/276-0615 | apple pan the | 10801 w. pico blvd. | west la | 310-475-3585 |
1 | 1 | 549 | 1 | the palm | 9001 santa monica blvd. | los angeles | 310/550-8811 | apple pan the | 10801 w. pico blvd. | west la | 310-475-3585 |
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1, C2])
len(C)
12530
We observe that the number of tuple pairs considered for matching is increased to 12530 (from 10165). Now let us debug the blocker output again to check if the current blocker sequence is dropping any potential matches.
# Debug again
dbg = em.debug_blocker(C, A, B)
# Display first few rows from the debugger output
dbg.head(3)
_id | similarity | ltable_id | rtable_id | ltable_name | ltable_addr | ltable_city | ltable_phone | ltable_type | rtable_name | rtable_addr | rtable_city | rtable_phone | rtable_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.4 | 667 | 230 | drais | 730 n. la cienega blvd. | los angeles | 310/358-8585 | french | lorangerie | 903 n. la cienega blvd. | w. hollywood | 310-652-9770 | french (classic) |
1 | 1 | 0.4 | 548 | 230 | matsuhisa | 129 n. la cienega blvd. | beverly hills | 310/659-9639 | asian | lorangerie | 903 n. la cienega blvd. | w. hollywood | 310-652-9770 | french (classic) |
2 | 2 | 0.4 | 545 | 233 | lorangerie | 903 n. la cienega blvd. | los angeles | 310/652-9770 | french | matsuhisa | 129 n. la cienega blvd. | beverly hills | 310-659-9639 | seafood |
We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.
In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following five steps:
First, we randomly sample 450 tuple pairs for labeling purposes.
# Sample candidate set
S = em.sample_table(C, 450)
Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.
# Label S
G = em.label_table(S, 'gold')
For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.
# Load the pre-labeled data
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
G = em.read_csv_metadata(path_G,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)
450
In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']
Selecting the best learning-based matcher typically involves the following steps:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')
Next, we need to create a set of features for the development set. py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
# List the names of the features generated
feature_table['feature_name']
0 id_id_exm 1 id_id_anm 2 id_id_lev_dist 3 id_id_lev_sim 4 name_name_jac_qgm_3_qgm_3 5 name_name_cos_dlm_dc0_dlm_dc0 6 name_name_jac_dlm_dc0_dlm_dc0 7 name_name_mel 8 name_name_lev_dist 9 name_name_lev_sim 10 name_name_nmw 11 name_name_sw 12 addr_addr_jac_qgm_3_qgm_3 13 addr_addr_cos_dlm_dc0_dlm_dc0 14 addr_addr_jac_dlm_dc0_dlm_dc0 15 addr_addr_mel 16 addr_addr_lev_dist 17 addr_addr_lev_sim 18 addr_addr_nmw 19 addr_addr_sw 20 city_city_jac_qgm_3_qgm_3 21 city_city_cos_dlm_dc0_dlm_dc0 22 city_city_jac_dlm_dc0_dlm_dc0 23 city_city_mel 24 city_city_lev_dist 25 city_city_lev_sim 26 city_city_nmw 27 city_city_sw 28 type_type_jac_qgm_3_qgm_3 29 type_type_cos_dlm_dc0_dlm_dc0 30 type_type_jac_dlm_dc0_dlm_dc0 31 type_type_mel 32 type_type_lev_dist 33 type_type_lev_sim 34 type_type_nmw 35 type_type_sw Name: feature_name, dtype: object
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I,
feature_table=feature_table,
attrs_after='gold',
show_progress=False)
# Display first few rows
H.head(3)
_id | ltable_id | rtable_id | id_id_exm | id_id_anm | id_id_lev_dist | id_id_lev_sim | name_name_jac_qgm_3_qgm_3 | name_name_cos_dlm_dc0_dlm_dc0 | name_name_jac_dlm_dc0_dlm_dc0 | ... | city_city_sw | type_type_jac_qgm_3_qgm_3 | type_type_cos_dlm_dc0_dlm_dc0 | type_type_jac_dlm_dc0_dlm_dc0 | type_type_mel | type_type_lev_dist | type_type_lev_sim | type_type_nmw | type_type_sw | gold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
221 | 1790 | 563 | 248 | 0 | 0.440497 | 3 | 0.0 | 1.000000 | 1.000000 | 1.000000 | ... | 8.0 | 0.235294 | 0.0 | 0.0 | 0.883333 | 7.0 | 0.416667 | -2.0 | 4.0 | 1 |
439 | 794 | 544 | 116 | 0 | 0.213235 | 3 | 0.0 | 0.258065 | 0.500000 | 0.333333 | ... | 1.0 | 0.000000 | 0.0 | 0.0 | 0.466667 | 4.0 | 0.200000 | 0.0 | 1.0 | 0 |
191 | 2315 | 589 | 305 | 0 | 0.517827 | 3 | 0.0 | 0.172414 | 0.353553 | 0.200000 | ... | 1.0 | 0.000000 | 0.0 | 0.0 | 0.451923 | 24.0 | 0.000000 | -11.0 | 2.0 | 0 |
3 rows × 40 columns
Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
k=5,
target_attr='gold', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']
Matcher | Average precision | Average recall | Average f1 | |
---|---|---|---|---|
0 | DecisionTree | 0.976471 | 0.936450 | 0.954078 |
1 | RF | 0.975735 | 0.963235 | 0.968450 |
2 | SVM | 1.000000 | 0.297437 | 0.440952 |
3 | LinReg | 0.944444 | 0.920497 | 0.929657 |
4 | LogReg | 0.974603 | 0.948214 | 0.960212 |
5 | NaiveBayes | 0.953947 | 0.973214 | 0.962530 |
We observe that the best matcher is not maximizing F1. We debug the matcher to see what might be wrong. To do this, first we split the feature vectors into train and test.
# Split feature vectors into train and test
UV = em.split_train_test(H, train_proportion=0.5)
U = UV['train']
V = UV['test']
Next, we debug the matcher using GUI. For the purposes of this guide, we use random forest matcher for debugging purposes.
# Debug decision tree using GUI
em.vis_debug_rf(rf, U, V,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
target_attr='gold')
From the GUI, we observe that phone numbers seem to be an important attribute, but they are in different format. Current features does not capture and adding a feature incorporating this difference in format can potentially improve the F1 numbers.
def phone_phone_feature(ltuple, rtuple):
p1 = ltuple.phone
p2 = rtuple.phone
p1 = p1.replace('/','-')
p1 = p1.replace(' ','')
p2 = p2.replace('/','-')
p2 = p2.replace(' ','')
if p1 == p2:
return 1.0
else:
return 0.0
feature_table = em.get_features_for_matching(A, B)
em.add_blackbox_feature(feature_table, 'phone_phone_feature', phone_phone_feature)
The table shows the corresponding attributes along with their respective types. Please confirm that the information has been correctly inferred. If you would like to skip this validation process in the future, please set the flag validate_inferred_attr_types equal to false.
Left Attribute | Right Attribute | Left Attribute Type | Right Attribute Type | Example Features | |
---|---|---|---|---|---|
0 | id | id | numeric | numeric | Exact Match; Absolute Norm |
1 | name | name | short string (1 word to 5 words) | short string (1 word to 5 words) | Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter] |
2 | addr | addr | short string (1 word to 5 words) | short string (1 word to 5 words) | Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter] |
3 | city | city | short string (1 word to 5 words) | short string (1 word to 5 words) | Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter] |
4 | phone | phone | short string (1 word to 5 words) | short string (1 word) | Not Applicable: Types do not match |
5 | type | type | short string (1 word to 5 words) | short string (1 word to 5 words) | Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter] |
Do you want to proceed? (y/n):y
True
Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.
H = em.extract_feature_vecs(I, feature_table=feature_table, attrs_after='gold', show_progress=False)
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
k=5,
target_attr='gold', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']
Matcher | Average precision | Average recall | Average f1 | |
---|---|---|---|---|
0 | DecisionTree | 0.986667 | 0.974167 | 0.979755 |
1 | RF | 0.988889 | 0.987500 | 0.987834 |
2 | SVM | 1.000000 | 0.297437 | 0.440952 |
3 | LinReg | 1.000000 | 0.974167 | 0.986652 |
4 | LogReg | 0.974603 | 0.948214 | 0.960212 |
5 | NaiveBayes | 0.976389 | 0.985714 | 0.980427 |
Now, observe the best matcher is achieving a better F1. Let us stop here and proceed on to evaluating the best matcher on the unseen data (the evaluation set).
Evaluating the matching outputs for the evaluation set typically involves the following four steps:
As before, we convert to the feature vectors (using the feature table and the evaluation set)
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=feature_table,
attrs_after='gold', show_progress=False)
Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.
# Train using feature vectors from I
rf.fit(table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
target_attr='gold')
Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).
# Predict on L
predictions = rf.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold'],
append=True, target_attr='predicted', inplace=False)
Finally, we evaluate the accuracy of predicted outputs
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'gold', 'predicted')
em.print_eval_summary(eval_result)
Precision : 100.0% (34/34) Recall : 100.0% (34/34) F1 : 100.0% False positives : 0 (out of 34 positive predictions) False negatives : 0 (out of 101 negative predictions)