Introduction

As a part of the Magellan ecosystem, in this tutorial we will show how to use deepmatcher together with Magellan, to perform an end-to-end Entity Matching (EM) task. Specifically, Magellan performs EM in a two-stage fashion where given two tables,

  1. Perform blocking on the two tables by removing obvious non-matching tuple pairs to get a candidate set K.
  2. Perform matching to predict each pair in K as match or non-match. This stage consists of the following substeps:
    1. Magellan first helps the user take a sample S from K.
    2. The user labels all pairs in S as the training data.
    3. A classifer L will be learned on S.
    4. The user applies L to K to predict each pair as match or non-match.

Given the above workflow, deepmatcher fits in the step C, which is to take advantage of deep learning to learn a classifer. For the rest of the tutorial, we will use a real example, which is to match songs across two tables from iTunes and Amazon Music, to go through the workflow.

For more information on Magellan, please go to the project website: https://sites.google.com/site/anhaidgroup/projects/magellan.

Step 0. Preparation

In order to use Magellan, we need to first install it. The easiest way is to use "pip" as follow (please consult the package website for other installation options).

In [1]:
try:
    import py_entitymatching as em
except:
    !pip install py_entitymatching
In [2]:
import sys
In [3]:
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/deepmatcher')
In [4]:
import deepmatcher as dm
import py_entitymatching as em

import os

Step 1. Load data

We first load the two input tables in the csv format using Magellan, that contain songs from iTunes and Amazon Music. These two tables are in the "example" directory included in our package.

In [5]:
# The path to the two input tables.
path_A = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableA.csv')
path_B = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableB.csv')
In [6]:
# Load the two tables.
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
In [7]:
# Basic information about the tables.
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
Number of tuples in A: 6907
Number of tuples in B: 55923
Number of tuples in A X B (i.e the cartesian product): 386260161
In [8]:
# The first few tuples in table A.
A.head()
Out[8]:
id Song_Name Artist_Name Album_Name Genre Price CopyRight Time Released
0 0 Runaway Train Cam Welcome to Cam Country - EP Country , Music , Contemporary Country , Honky Tonk $ 0.99 2015 Sony Music Entertainment 3:01 31-Mar-15
1 1 Track 14 Omi Me 4 U Pop/Rock , Music , Pop , Dance , R&B / Soul Album Only 2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ... 3:41 NaN
2 2 Lips Are Movin Meghan Trainor Title Pop , Music , Pop/Rock , Dance , Rock , Teen Pop $ 1.29 2014 , 2015 Epic Records , a division of Sony Music Entertainment 3:01 9-Jan-15
3 3 I Want a Hippopotamus for Christmas A Great Big World I 'll Be Home For Christmas Holiday , Music $ 1.29 Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment 2:20 24-Nov-14
4 4 Credit Meghan Trainor Title ( Deluxe ) Pop , Music , Rock , Pop/Rock , Dance , Teen Pop $ 1.29 2014 , 2015 Epic Records , a division of Sony Music Entertainment 2:51 9-Jan-15
In [9]:
# The first few tuples in table B.
B.head()
Out[9]:
id Song_Name Artist_Name Album_Name Genre Price CopyRight Time Released
0 0 Saxophone Stomp [ Explicit ] Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:20 September 16 , 2014
1 1 I Wan na Mingle [ feat . Pusher ] Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 2:36 September 16 , 2014
2 2 Lytah Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:48 September 16 , 2014
3 3 Slappy Pappy Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:55 September 16 , 2014
4 4 Fushion Jam Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 1:48 September 16 , 2014

Step 2. Block tables to get the candidate set

We first perform blocking on A and B to get a candidate set K, by removing obvious non-matching tuple pairs. Magellan supports four different types of blocker: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Typically, users need to mix and match these blockers with the debugging functionality provided in Magellan to get a good candidate set. (Developing a good blocker is not the focus of this tutorial. For more information on developing and debugging a blocker, please consult the user manual of Magellan.)

Here we show an example of blocking. Observe that matching tuple pairs should have some common words in the album name, so we first create an overlap blocker on the attribute name "Album_Name" with threshold 2, to remove all pairs with word overlap less 2 in that attribute.

In [10]:
# Create an overlap blocker in Magellan and apply it to A and B to get the candidate set K1 which is in the format of 
# a dataframe. The "l_out_attrs" and "r_out_attrs" parameters indicate the columns that will be included in K1 from A
# and B respectively.
ob = em.OverlapBlocker()
K1 = ob.block_tables(A, B, 'Album_Name', 'Album_Name',
                    l_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'], 
                    r_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],
                    overlap_size=2)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18
In [11]:
# The number of tuple pairs in K1.
len(K1)
Out[11]:
3152021
In [12]:
# The first few tuple pairs in K1
K1.head()
Out[12]:
_id ltable_id rtable_id ltable_Song_Name ltable_Artist_Name ltable_Album_Name ltable_Genre ltable_Price ltable_CopyRight ltable_Time ltable_Released rtable_Song_Name rtable_Artist_Name rtable_Album_Name rtable_Genre rtable_Price rtable_CopyRight rtable_Time rtable_Released
0 0 3995 0 Rudolph Blue Brian Fechino A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 Saxophone Stomp [ Explicit ] Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:20 September 16 , 2014
1 1 3995 1 Rudolph Blue Brian Fechino A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 I Wan na Mingle [ feat . Pusher ] Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 2:36 September 16 , 2014
2 2 3995 2 Rudolph Blue Brian Fechino A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 Lytah Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:48 September 16 , 2014
3 3 3995 3 Rudolph Blue Brian Fechino A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 Slappy Pappy Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 3:55 September 16 , 2014
4 4 3995 4 Rudolph Blue Brian Fechino A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 Fushion Jam Rusko ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. 1:48 September 16 , 2014

Notice that K1 has more than 3 million pair in it, which is too large to considered for matching. So we create an overlap blocking with threshold 1 on the attribute "Artist_Name" for K1, to filter all pairs in K1 that don't share any word in "Artist_Name". And we get the candidate set K2.

In [13]:
# Create a new overlap blocker to remove pairs from K1 that have no common word in "Artist_Name".
K2 = ob.block_candset(K1, 'Artist_Name', 'Artist_Name', overlap_size=1)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:29
In [14]:
# The number of tuple pairs in K2.
len(K2)
Out[14]:
167180

After the second blocking step, now we have a candidate set K2 with about 170K pairs. But we think it is still a bit larger to consider for matching. So we apply a third blocker, which is an overlap blocker on the attribute "Song_Name", to further reduce the size of the candidate set.

In [15]:
# Apply the third overlap blocker.
K3 = ob.block_candset(K2, 'Song_Name', 'Song_Name', overlap_size=1)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01
In [16]:
# The number of tuples pairs in K3.
len(K3)
Out[16]:
38015

Now we have a candidate set with 38K pairs which is reasonable, so we take K3 as the final candidate set. We save the candidate to the disk in the csv format for future reuse.

In [17]:
path_K = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv')
K3.to_csv(path_K, index=False)

Step 3. Match tuple pairs in the candidate set

In this stage we will match tuple pairs in the candidate set to predict each of them as match or non-match. This is the part that deepmatcher will be involved. Specifically, it consists of the following steps:

  1. Take a sample S from the candidate set K, and label all pairs in S.
  2. Train a classifier L using S. Specifically, we will train a classifier using deepmatcher.
  3. Apply L to the candidiate set K.

Sample and label the candidate set

We first take a random sample S from the candidate set K using Magellan. Here for example, we sample 500 pairs for K. Then we label each pair as match (enter 1) or non-match (enter 0) and use S as the training data.

In [ ]:
# Take a sample of 500 pairs from the candidate set.
S = em.sample_table(K3, 500)
In [ ]:
# Label the sample S in a GUI. Enter 1 for match and 0 for non-match.
G = em.label_table(S, 'gold')

For the purposes of this tutorial, we will load in a pre-labeled dataset (of 539 tuple pairs) included in this package.

In [18]:
# The path to the labeled data file.
path_G = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'gold.csv')
In [19]:
# Load the labeled data into a dataframe.
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
print('Number of labeled pairs:', len(G))
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
Number of labeled pairs: 539

Train a classifier using labeled data

Once we have the labeled data, we use deepmatcher to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated).

In [20]:
# The directory where the data splits will be saved.
split_path = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial')
In [21]:
# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.
dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',
              [3, 1, 1])
In [22]:
# Load the training data files from the disk. Ignore the "left_id" and "right_id" 
# columns for data preprocessing.
# The 'use_magellan_convention' parameter asks deepmatcher to use Magellan's 
# naming convention for the left and right table column name prefixes 
# ("ltable_", and "rtable_"), and also to consider "_id" as the ID column.
train, validation, test = dm.data.process(
    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial'),
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    test='test.csv',
    use_magellan_convention=True,
    ignore_columns=('ltable_id', 'rtable_id'))
WARNING:deepmatcher.data.dataset:Rebuilding data cache because: ['One or more data files have been modified.']

Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/valid.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/test.csv"
0% [############################# ] 100% | ETA: 00:00:00
Building vocabulary
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Computing principal components
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

After we get the training data, we can use deepmatcher to train a classifier. Here we train a hybrid model.

In [23]:
# Create a hybrid model.
model = dm.MatchingModel(attr_summarizer='hybrid')
In [24]:
# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative 
# ratio to be 3. We save the best model (with the 
# highest F1 score on the validation set) to 'hybrid_model.pth'.
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)
* Number of trainable parameters: 17757810
===>  TRAIN Epoch 1
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
Finished Epoch 1 || Run Time:    4.2 | Load Time:    0.4 || F1:  39.60 | Prec:  31.25 | Rec:  54.05 || Ex/s:  69.89

===>  EVAL Epoch 1
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 1 || Run Time:    0.6 | Load Time:    0.2 || F1:  60.61 | Prec:  50.00 | Rec:  76.92 || Ex/s: 146.30

* Best F1: 60.60606060606061
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
Finished Epoch 2 || Run Time:    3.9 | Load Time:    0.4 || F1:  65.62 | Prec:  53.39 | Rec:  85.14 || Ex/s:  75.12

===>  EVAL Epoch 2
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 2 || Run Time:    0.6 | Load Time:    0.2 || F1:  76.00 | Prec:  79.17 | Rec:  73.08 || Ex/s: 146.34

* Best F1: 76.0
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 3
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
Finished Epoch 3 || Run Time:    3.8 | Load Time:    0.4 || F1:  87.27 | Prec:  79.12 | Rec:  97.30 || Ex/s:  76.32

===>  EVAL Epoch 3
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 3 || Run Time:    0.6 | Load Time:    0.2 || F1:  84.21 | Prec:  77.42 | Rec:  92.31 || Ex/s: 143.25

* Best F1: 84.21052631578948
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
Finished Epoch 4 || Run Time:    3.8 | Load Time:    0.4 || F1:  90.57 | Prec:  84.71 | Rec:  97.30 || Ex/s:  76.22

===>  EVAL Epoch 4
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 4 || Run Time:    0.6 | Load Time:    0.2 || F1:  82.76 | Prec:  75.00 | Rec:  92.31 || Ex/s: 146.11

---------------------

===>  TRAIN Epoch 5
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
Finished Epoch 5 || Run Time:    3.9 | Load Time:    0.4 || F1:  97.37 | Prec:  94.87 | Rec: 100.00 || Ex/s:  74.96

===>  EVAL Epoch 5
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 5 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.57 | Prec:  88.89 | Rec:  92.31 || Ex/s: 146.11

* Best F1: 90.56603773584906
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 6
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
Finished Epoch 6 || Run Time:    3.9 | Load Time:    0.4 || F1:  98.67 | Prec:  97.37 | Rec: 100.00 || Ex/s:  74.94

===>  EVAL Epoch 6
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 6 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.73

---------------------

===>  TRAIN Epoch 7
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
Finished Epoch 7 || Run Time:    3.8 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.88

===>  EVAL Epoch 7
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 7 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.56

---------------------

===>  TRAIN Epoch 8
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
Finished Epoch 8 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.16

===>  EVAL Epoch 8
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 8 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.76

* Best F1: 92.00000000000001
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 9
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
Finished Epoch 9 || Run Time:    3.7 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  77.50

===>  EVAL Epoch 9
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 9 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.27

---------------------

===>  TRAIN Epoch 10
0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
Finished Epoch 10 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.28

===>  EVAL Epoch 10
0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
Finished Epoch 10 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.95

---------------------

Loading best model...
Training done.
Out[24]:
92.00000000000001

Once we have the trained classifier, we can evaluate the accuracy using the test data.

In [25]:
# Evaluate the accuracy on the test data.
model.run_eval(test)
===>  EVAL Epoch 8
Finished Epoch 8 || Run Time:    0.3 | Load Time:    0.2 || F1:  92.31 | Prec:  90.91 | Rec:  93.75 || Ex/s: 208.04

Out[25]:
92.3076923076923

Apply the trained classifier to the candidate set

Now we have a trained classifer, we can apply it to the candidate set to predict each pair as match or non-match. To achieve this, we first load the trained model and the candidate set.

In [26]:
# Load the model.
model = dm.MatchingModel(attr_summarizer='hybrid')
model.load_state('hybrid_model.pth')
In [27]:
# Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate = dm.data.process_unlabeled(
    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv'),
    trained_model=model,
    ignore_columns=('ltable_id', 'rtable_id'))
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/candidate.csv"
0% [##############################] 100% | ETA: 00:00:00

After we load the candidate set, now we make the predictions using the run_prediction function. The output_attribute argument indicates the columns that will be included in the prediction table. Here we use all of the columns in the candidate set.

In [28]:
# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))
===>  PREDICT Epoch 8
0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:18
Finished Epoch 8 || Run Time:   97.2 | Load Time:   41.5 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00

Below shows a few pairs in the prediction table. Note that this table is not sorted in the descending order of the match scores.

In [29]:
predictions.head()
Out[29]:
match_score ltable_id rtable_id ltable_Song_Name ltable_Artist_Name ltable_Album_Name ltable_Genre ltable_Price ltable_CopyRight ltable_Time ltable_Released rtable_Song_Name rtable_Artist_Name rtable_Album_Name rtable_Genre rtable_Price rtable_CopyRight rtable_Time rtable_Released
_id
3237 0.960461 2583 228 #Selfie ( Instrumental Mix ) The Chainsmokers #Selfie ( Instrumental Mix ) - Single Dance , Music , Electronic $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. 3:03 20-Mar-14 #SELFIE ( Instrumental Mix ) The Chainsmokers #SELFIE ( Instrumental Mix ) Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:03 March 20 , 2014
3274 0.294473 2003 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3275 0.294473 2574 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3276 0.294473 2575 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3359 0.161323 2003 230 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Botnek Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:36 June 17 , 2014

We can simply manipulate the prediction table above to get the pairs that we want. Suppose we are only interested in pairs with the match score more over 0.9, and we want to display them sorting in the descending order on match score. This can be achieved as follow.

In [30]:
high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)
high_score_pairs.head()
Out[30]:
match_score ltable_id rtable_id ltable_Song_Name ltable_Artist_Name ltable_Album_Name ltable_Genre ltable_Price ltable_CopyRight ltable_Time ltable_Released rtable_Song_Name rtable_Artist_Name rtable_Album_Name rtable_Genre rtable_Price rtable_CopyRight rtable_Time rtable_Released
_id
1472466 0.995653 2462 30798 It 's Never Easy to Say Goodbye Kenny Chesney Me and You Country , Music , Urban Cowboy , Contemporary Country $ 0.99 1996 BMG Entertainment 4:44 1-Jan-90 It 's Never Easy To Say Goodbye Kenny Chesney Me And You Country $ 0.99 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC 4:44 January 1 , 1990
1472465 0.995653 2453 30798 It 's Never Easy to Say Goodbye Kenny Chesney Me and You Country , Music , Urban Cowboy , Contemporary Country $ 0.99 1996 BMG Entertainment 4:44 1-Jan-90 It 's Never Easy To Say Goodbye Kenny Chesney Me And You Country $ 0.99 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC 4:44 January 1 , 1990
1472464 0.995653 2438 30798 It 's Never Easy to Say Goodbye Kenny Chesney Me and You Country , Music , Urban Cowboy , Contemporary Country $ 0.99 1996 BMG Entertainment 4:44 1-Jan-90 It 's Never Easy To Say Goodbye Kenny Chesney Me And You Country $ 0.99 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC 4:44 January 1 , 1990
1472463 0.995653 2435 30798 It 's Never Easy to Say Goodbye Kenny Chesney Me and You Country , Music , Urban Cowboy , Contemporary Country $ 0.99 1996 BMG Entertainment 4:44 1-Jan-90 It 's Never Easy To Say Goodbye Kenny Chesney Me And You Country $ 0.99 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC 4:44 January 1 , 1990
1472462 0.995653 2430 30798 It 's Never Easy to Say Goodbye Kenny Chesney Me and You Country , Music , Urban Cowboy , Contemporary Country $ 0.99 1996 BMG Entertainment 4:44 1-Jan-90 It 's Never Easy To Say Goodbye Kenny Chesney Me And You Country $ 0.99 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC 4:44 January 1 , 1990

As you have seen, deepmatcher does not include a column indicating match / non-match predictions in the output table. You can easily create such a column by thresholding the match_score at 0.5 (which is what deepmatcher uses as the threshold to compute F1), or using a threshold value or your choice.

In [32]:
predictions['match_prediction'] = predictions['match_score'].apply(lambda score: 1 if score >= 0.5 else 0)

# Reorder columns to avoid scrolling...
predictions =  predictions[['match_score', 'match_prediction'] + predictions.columns.values[1:-1].tolist()]
predictions.head()
Out[32]:
match_score match_prediction ltable_id rtable_id ltable_Song_Name ltable_Artist_Name ltable_Album_Name ltable_Genre ltable_Price ltable_CopyRight ltable_Time ltable_Released rtable_Song_Name rtable_Artist_Name rtable_Album_Name rtable_Genre rtable_Price rtable_CopyRight rtable_Time rtable_Released
_id
3237 0.960461 1 2583 228 #Selfie ( Instrumental Mix ) The Chainsmokers #Selfie ( Instrumental Mix ) - Single Dance , Music , Electronic $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. 3:03 20-Mar-14 #SELFIE ( Instrumental Mix ) The Chainsmokers #SELFIE ( Instrumental Mix ) Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:03 March 20 , 2014
3274 0.294473 0 2003 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3275 0.294473 0 2574 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3276 0.294473 0 2575 229 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:52 June 17 , 2014
3359 0.161323 0 2003 230 #SELFIE ( Caked Up Remix ) The Chainsmokers #SELFIE ( The Remixes ) - Single Electronic , Music , Dance $ 1.29 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... 3:15 17-Jun-14 #Selfie ( Botnek Remix ) The Chainsmokers #Selfie ( The Remixes ) Electronica , Dance & Electronic $ 1.29 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... 3:36 June 17 , 2014
In [33]:
# Reset index as Magellan requires the key to be a column in the table
predictions.reset_index(inplace=True)
In [34]:
# Update metadata in the catalog. This information can later be used by triggers to modify the labels from 
# the learning-based matcher 
em.set_key(predictions, '_id')
em.set_fk_ltable(predictions, 'ltable_id')
em.set_fk_rtable(predictions, 'rtable_id')
em.set_ltable(predictions, A)
em.set_rtable(predictions, B)
Out[34]:
True