Introduction¶

As a part of the *Magellan* ecosystem, in this tutorial we will show how to use deepmatcher together with Magellan, to perform an end-to-end Entity Matching (EM) task. Specifically, Magellan performs EM in a two-stage fashion where given two tables,

Perform blocking on the two tables by removing obvious non-matching tuple pairs to get a candidate set K.
Perform matching to predict each pair in K as match or non-match. This stage consists of the following substeps:
1. Magellan first helps the user take a sample S from K.
2. The user labels all pairs in S as the training data.
3. A classifer L will be learned on S.
4. The user applies L to K to predict each pair as match or non-match.

Given the above workflow, deepmatcher fits in the step C, which is to take advantage of deep learning to learn a classifer. For the rest of the tutorial, we will use a real example, which is to match songs across two tables from iTunes and Amazon Music, to go through the workflow.

For more information on Magellan, please go to the project website: https://sites.google.com/site/anhaidgroup/projects/magellan.

Step 0. Preparation¶

In order to use *Magellan*, we need to first install it. The easiest way is to use "pip" as follow (please consult the package website for other installation options).

In [1]:

try:
    import py_entitymatching as em
except:
    !pip install py_entitymatching

In [2]:

import sys

In [3]:

sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/deepmatcher')

In [4]:

import deepmatcher as dm
import py_entitymatching as em

import os

Step 1. Load data¶

We first load the two input tables in the csv format using Magellan, that contain songs from iTunes and Amazon Music. These two tables are in the "example" directory included in our package.

In [5]:

# The path to the two input tables.
path_A = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableA.csv')
path_B = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableB.csv')

In [6]:

# Load the two tables.
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.

In [7]:

# Basic information about the tables.
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 6907
Number of tuples in B: 55923
Number of tuples in A X B (i.e the cartesian product): 386260161

In [8]:

# The first few tuples in table A.
A.head()

Out[8]:

	id	Song_Name	Artist_Name	Album_Name	Genre	Price	CopyRight	Time	Released
0	0	Runaway Train	Cam	Welcome to Cam Country - EP	Country , Music , Contemporary Country , Honky Tonk	$ 0.99	2015 Sony Music Entertainment	3:01	31-Mar-15
1	1	Track 14	Omi	Me 4 U	Pop/Rock , Music , Pop , Dance , R&B / Soul	Album Only	2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ...	3:41	NaN
2	2	Lips Are Movin	Meghan Trainor	Title	Pop , Music , Pop/Rock , Dance , Rock , Teen Pop	$ 1.29	2014 , 2015 Epic Records , a division of Sony Music Entertainment	3:01	9-Jan-15
3	3	I Want a Hippopotamus for Christmas	A Great Big World	I 'll Be Home For Christmas	Holiday , Music	$ 1.29	Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment	2:20	24-Nov-14
4	4	Credit	Meghan Trainor	Title ( Deluxe )	Pop , Music , Rock , Pop/Rock , Dance , Teen Pop	$ 1.29	2014 , 2015 Epic Records , a division of Sony Music Entertainment	2:51	9-Jan-15

In [9]:

# The first few tuples in table B.
B.head()

Out[9]:

	id	Song_Name	Artist_Name	Album_Name	Genre	Price	CopyRight	Time	Released
0	0	Saxophone Stomp [ Explicit ]	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:20	September 16 , 2014
1	1	I Wan na Mingle [ feat . Pusher ]	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	2:36	September 16 , 2014
2	2	Lytah	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:48	September 16 , 2014
3	3	Slappy Pappy	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:55	September 16 , 2014
4	4	Fushion Jam	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	1:48	September 16 , 2014

Step 2. Block tables to get the candidate set¶

We first perform blocking on A and B to get a candidate set K, by removing obvious non-matching tuple pairs. Magellan supports four different types of blocker: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Typically, users need to mix and match these blockers with the debugging functionality provided in Magellan to get a good candidate set. (Developing a good blocker is not the focus of this tutorial. For more information on developing and debugging a blocker, please consult the user manual of Magellan.)

Here we show an example of blocking. Observe that matching tuple pairs should have some common words in the album name, so we first create an overlap blocker on the attribute name "Album_Name" with threshold 2, to remove all pairs with word overlap less 2 in that attribute.

In [10]:

# Create an overlap blocker in Magellan and apply it to A and B to get the candidate set K1 which is in the format of 
# a dataframe. The "l_out_attrs" and "r_out_attrs" parameters indicate the columns that will be included in K1 from A
# and B respectively.
ob = em.OverlapBlocker()
K1 = ob.block_tables(A, B, 'Album_Name', 'Album_Name',
                    l_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'], 
                    r_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],
                    overlap_size=2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18

In [11]:

# The number of tuple pairs in K1.
len(K1)

Out[11]:

In [12]:

# The first few tuple pairs in K1
K1.head()

Out[12]:

	_id	ltable_id	rtable_id	ltable_Song_Name	ltable_Artist_Name	ltable_Album_Name	ltable_Genre	ltable_Price	ltable_CopyRight	ltable_Time	ltable_Released	rtable_Song_Name	rtable_Artist_Name	rtable_Album_Name	rtable_Genre	rtable_Price	rtable_CopyRight	rtable_Time	rtable_Released
0	0	3995	0	Rudolph Blue	Brian Fechino	A Rock By The Sea Christmas : : Volume 2	Holiday , Music , Rock	$ 0.99	2011 Hubbub !	1:57	8-Nov-11	Saxophone Stomp [ Explicit ]	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:20	September 16 , 2014
1	1	3995	1	Rudolph Blue	Brian Fechino	A Rock By The Sea Christmas : : Volume 2	Holiday , Music , Rock	$ 0.99	2011 Hubbub !	1:57	8-Nov-11	I Wan na Mingle [ feat . Pusher ]	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	2:36	September 16 , 2014
2	2	3995	2	Rudolph Blue	Brian Fechino	A Rock By The Sea Christmas : : Volume 2	Holiday , Music , Rock	$ 0.99	2011 Hubbub !	1:57	8-Nov-11	Lytah	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:48	September 16 , 2014
3	3	3995	3	Rudolph Blue	Brian Fechino	A Rock By The Sea Christmas : : Volume 2	Holiday , Music , Rock	$ 0.99	2011 Hubbub !	1:57	8-Nov-11	Slappy Pappy	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	3:55	September 16 , 2014
4	4	3995	4	Rudolph Blue	Brian Fechino	A Rock By The Sea Christmas : : Volume 2	Holiday , Music , Rock	$ 0.99	2011 Hubbub !	1:57	8-Nov-11	Fushion Jam	Rusko	! ( Volume 2 ) [ Explicit ]	Dance & Electronic , Dubstep	$ 1.29	( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.	1:48	September 16 , 2014

Notice that K1 has more than 3 million pair in it, which is too large to considered for matching. So we create an overlap blocking with threshold 1 on the attribute "Artist_Name" for K1, to filter all pairs in K1 that don't share any word in "Artist_Name". And we get the candidate set K2.

In [13]:

# Create a new overlap blocker to remove pairs from K1 that have no common word in "Artist_Name".
K2 = ob.block_candset(K1, 'Artist_Name', 'Artist_Name', overlap_size=1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:29

In [14]:

# The number of tuple pairs in K2.
len(K2)

Out[14]:

After the second blocking step, now we have a candidate set K2 with about 170K pairs. But we think it is still a bit larger to consider for matching. So we apply a third blocker, which is an overlap blocker on the attribute "Song_Name", to further reduce the size of the candidate set.

In [15]:

# Apply the third overlap blocker.
K3 = ob.block_candset(K2, 'Song_Name', 'Song_Name', overlap_size=1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01

In [16]:

# The number of tuples pairs in K3.
len(K3)

Out[16]:

Now we have a candidate set with 38K pairs which is reasonable, so we take K3 as the final candidate set. We save the candidate to the disk in the csv format for future reuse.

In [17]:

path_K = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv')
K3.to_csv(path_K, index=False)

Step 3. Match tuple pairs in the candidate set¶

In this stage we will match tuple pairs in the candidate set to predict each of them as match or non-match. This is the part that deepmatcher will be involved. Specifically, it consists of the following steps:

Take a sample S from the candidate set K, and label all pairs in S.
Train a classifier L using S. Specifically, we will train a classifier using deepmatcher.
Apply L to the candidiate set K.

Sample and label the candidate set¶

We first take a random sample S from the candidate set K using Magellan. Here for example, we sample 500 pairs for K. Then we label each pair as match (enter 1) or non-match (enter 0) and use S as the training data.

In [ ]:

# Take a sample of 500 pairs from the candidate set.
S = em.sample_table(K3, 500)

In [ ]:

# Label the sample S in a GUI. Enter 1 for match and 0 for non-match.
G = em.label_table(S, 'gold')

For the purposes of this tutorial, we will load in a pre-labeled dataset (of 539 tuple pairs) included in this package.

In [18]:

# The path to the labeled data file.
path_G = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'gold.csv')

In [19]:

# Load the labeled data into a dataframe.
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
print('Number of labeled pairs:', len(G))

WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.

Number of labeled pairs: 539

Train a classifier using labeled data¶

Once we have the labeled data, we use deepmatcher to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated).

In [20]:

# The directory where the data splits will be saved.
split_path = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial')

In [21]:

# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.
dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',
              [3, 1, 1])

In [22]:

# Load the training data files from the disk. Ignore the "left_id" and "right_id" 
# columns for data preprocessing.
# The 'use_magellan_convention' parameter asks deepmatcher to use Magellan's 
# naming convention for the left and right table column name prefixes 
# ("ltable_", and "rtable_"), and also to consider "_id" as the ID column.
train, validation, test = dm.data.process(
    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial'),
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    test='test.csv',
    use_magellan_convention=True,
    ignore_columns=('ltable_id', 'rtable_id'))

WARNING:deepmatcher.data.dataset:Rebuilding data cache because: ['One or more data files have been modified.']

Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/valid.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/test.csv"
0% [############################# ] 100% | ETA: 00:00:00
Building vocabulary
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Computing principal components
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

After we get the training data, we can use deepmatcher to train a classifier. Here we train a hybrid model.

In [23]:

# Create a hybrid model.
model = dm.MatchingModel(attr_summarizer='hybrid')

In [24]:

# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative 
# ratio to be 3. We save the best model (with the 
# highest F1 score on the validation set) to 'hybrid_model.pth'.
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 17757810
===>  TRAIN Epoch 1

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04

Finished Epoch 1 || Run Time:    4.2 | Load Time:    0.4 || F1:  39.60 | Prec:  31.25 | Rec:  54.05 || Ex/s:  69.89

===>  EVAL Epoch 1

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 1 || Run Time:    0.6 | Load Time:    0.2 || F1:  60.61 | Prec:  50.00 | Rec:  76.92 || Ex/s: 146.30

* Best F1: 60.60606060606061
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04

Finished Epoch 2 || Run Time:    3.9 | Load Time:    0.4 || F1:  65.62 | Prec:  53.39 | Rec:  85.14 || Ex/s:  75.12

===>  EVAL Epoch 2

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 2 || Run Time:    0.6 | Load Time:    0.2 || F1:  76.00 | Prec:  79.17 | Rec:  73.08 || Ex/s: 146.34

* Best F1: 76.0
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 3

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

Finished Epoch 3 || Run Time:    3.8 | Load Time:    0.4 || F1:  87.27 | Prec:  79.12 | Rec:  97.30 || Ex/s:  76.32

===>  EVAL Epoch 3

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 3 || Run Time:    0.6 | Load Time:    0.2 || F1:  84.21 | Prec:  77.42 | Rec:  92.31 || Ex/s: 143.25

* Best F1: 84.21052631578948
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

Finished Epoch 4 || Run Time:    3.8 | Load Time:    0.4 || F1:  90.57 | Prec:  84.71 | Rec:  97.30 || Ex/s:  76.22

===>  EVAL Epoch 4

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 4 || Run Time:    0.6 | Load Time:    0.2 || F1:  82.76 | Prec:  75.00 | Rec:  92.31 || Ex/s: 146.11

---------------------

===>  TRAIN Epoch 5

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04

Finished Epoch 5 || Run Time:    3.9 | Load Time:    0.4 || F1:  97.37 | Prec:  94.87 | Rec: 100.00 || Ex/s:  74.96

===>  EVAL Epoch 5

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 5 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.57 | Prec:  88.89 | Rec:  92.31 || Ex/s: 146.11

* Best F1: 90.56603773584906
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 6

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04

Finished Epoch 6 || Run Time:    3.9 | Load Time:    0.4 || F1:  98.67 | Prec:  97.37 | Rec: 100.00 || Ex/s:  74.94

===>  EVAL Epoch 6

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 6 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.73

---------------------

===>  TRAIN Epoch 7

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

Finished Epoch 7 || Run Time:    3.8 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.88

===>  EVAL Epoch 7

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 7 || Run Time:    0.6 | Load Time:    0.2 || F1:  90.20 | Prec:  92.00 | Rec:  88.46 || Ex/s: 145.56

---------------------

===>  TRAIN Epoch 8

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

Finished Epoch 8 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.16

===>  EVAL Epoch 8

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 8 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.76

* Best F1: 92.00000000000001
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 9

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

Finished Epoch 9 || Run Time:    3.7 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  77.50

===>  EVAL Epoch 9

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 9 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.27

---------------------

===>  TRAIN Epoch 10

0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04

Finished Epoch 10 || Run Time:    3.9 | Load Time:    0.4 || F1:  99.33 | Prec:  98.67 | Rec: 100.00 || Ex/s:  75.28

===>  EVAL Epoch 10

0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Finished Epoch 10 || Run Time:    0.6 | Load Time:    0.2 || F1:  92.00 | Prec:  95.83 | Rec:  88.46 || Ex/s: 145.95

---------------------

Loading best model...
Training done.

Out[24]:

92.00000000000001

Once we have the trained classifier, we can evaluate the accuracy using the test data.

In [25]:

# Evaluate the accuracy on the test data.
model.run_eval(test)

===>  EVAL Epoch 8
Finished Epoch 8 || Run Time:    0.3 | Load Time:    0.2 || F1:  92.31 | Prec:  90.91 | Rec:  93.75 || Ex/s: 208.04

Out[25]:

92.3076923076923

Apply the trained classifier to the candidate set¶

Now we have a trained classifer, we can apply it to the candidate set to predict each pair as match or non-match. To achieve this, we first load the trained model and the candidate set.

In [26]:

# Load the model.
model = dm.MatchingModel(attr_summarizer='hybrid')
model.load_state('hybrid_model.pth')

In [27]:

# Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate = dm.data.process_unlabeled(
    path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv'),
    trained_model=model,
    ignore_columns=('ltable_id', 'rtable_id'))

Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/candidate.csv"
0% [##############################] 100% | ETA: 00:00:00

After we load the candidate set, now we make the predictions using the run_prediction function. The output_attribute argument indicates the columns that will be included in the prediction table. Here we use all of the columns in the candidate set.

In [28]:

# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))

===>  PREDICT Epoch 8

0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:18

Finished Epoch 8 || Run Time:   97.2 | Load Time:   41.5 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00

Below shows a few pairs in the prediction table. Note that this table is not sorted in the descending order of the match scores.

In [29]:

predictions.head()

Out[29]:

	match_score	ltable_id	rtable_id	ltable_Song_Name	ltable_Artist_Name	ltable_Album_Name	ltable_Genre	ltable_Price	ltable_CopyRight	ltable_Time	ltable_Released	rtable_Song_Name	rtable_Artist_Name	rtable_Album_Name	rtable_Genre	rtable_Price	rtable_CopyRight	rtable_Time	rtable_Released
_id
3237	0.960461	2583	228	#Selfie ( Instrumental Mix )	The Chainsmokers	#Selfie ( Instrumental Mix ) - Single	Dance , Music , Electronic	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.	3:03	20-Mar-14	#SELFIE ( Instrumental Mix )	The Chainsmokers	#SELFIE ( Instrumental Mix )	Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:03	March 20 , 2014
3274	0.294473	2003	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3275	0.294473	2574	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3276	0.294473	2575	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3359	0.161323	2003	230	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Botnek Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:36	June 17 , 2014

We can simply manipulate the prediction table above to get the pairs that we want. Suppose we are only interested in pairs with the match score more over 0.9, and we want to display them sorting in the descending order on match score. This can be achieved as follow.

In [30]:

high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)
high_score_pairs.head()

Out[30]:

	match_score	ltable_id	rtable_id	ltable_Song_Name	ltable_Artist_Name	ltable_Album_Name	ltable_Genre	ltable_Price	ltable_CopyRight	ltable_Time	ltable_Released	rtable_Song_Name	rtable_Artist_Name	rtable_Album_Name	rtable_Genre	rtable_Price	rtable_CopyRight	rtable_Time	rtable_Released
_id
1472466	0.995653	2462	30798	It 's Never Easy to Say Goodbye	Kenny Chesney	Me and You	Country , Music , Urban Cowboy , Contemporary Country	$ 0.99	1996 BMG Entertainment	4:44	1-Jan-90	It 's Never Easy To Say Goodbye	Kenny Chesney	Me And You	Country	$ 0.99	( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC	4:44	January 1 , 1990
1472465	0.995653	2453	30798	It 's Never Easy to Say Goodbye	Kenny Chesney	Me and You	Country , Music , Urban Cowboy , Contemporary Country	$ 0.99	1996 BMG Entertainment	4:44	1-Jan-90	It 's Never Easy To Say Goodbye	Kenny Chesney	Me And You	Country	$ 0.99	( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC	4:44	January 1 , 1990
1472464	0.995653	2438	30798	It 's Never Easy to Say Goodbye	Kenny Chesney	Me and You	Country , Music , Urban Cowboy , Contemporary Country	$ 0.99	1996 BMG Entertainment	4:44	1-Jan-90	It 's Never Easy To Say Goodbye	Kenny Chesney	Me And You	Country	$ 0.99	( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC	4:44	January 1 , 1990
1472463	0.995653	2435	30798	It 's Never Easy to Say Goodbye	Kenny Chesney	Me and You	Country , Music , Urban Cowboy , Contemporary Country	$ 0.99	1996 BMG Entertainment	4:44	1-Jan-90	It 's Never Easy To Say Goodbye	Kenny Chesney	Me And You	Country	$ 0.99	( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC	4:44	January 1 , 1990
1472462	0.995653	2430	30798	It 's Never Easy to Say Goodbye	Kenny Chesney	Me and You	Country , Music , Urban Cowboy , Contemporary Country	$ 0.99	1996 BMG Entertainment	4:44	1-Jan-90	It 's Never Easy To Say Goodbye	Kenny Chesney	Me And You	Country	$ 0.99	( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC	4:44	January 1 , 1990

As you have seen, deepmatcher does not include a column indicating match / non-match predictions in the output table. You can easily create such a column by thresholding the match_score at 0.5 (which is what deepmatcher uses as the threshold to compute F1), or using a threshold value or your choice.

In [32]:

predictions['match_prediction'] = predictions['match_score'].apply(lambda score: 1 if score >= 0.5 else 0)

# Reorder columns to avoid scrolling...
predictions =  predictions[['match_score', 'match_prediction'] + predictions.columns.values[1:-1].tolist()]
predictions.head()

Out[32]:

	match_score	match_prediction	ltable_id	rtable_id	ltable_Song_Name	ltable_Artist_Name	ltable_Album_Name	ltable_Genre	ltable_Price	ltable_CopyRight	ltable_Time	ltable_Released	rtable_Song_Name	rtable_Artist_Name	rtable_Album_Name	rtable_Genre	rtable_Price	rtable_CopyRight	rtable_Time	rtable_Released
_id
3237	0.960461	1	2583	228	#Selfie ( Instrumental Mix )	The Chainsmokers	#Selfie ( Instrumental Mix ) - Single	Dance , Music , Electronic	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.	3:03	20-Mar-14	#SELFIE ( Instrumental Mix )	The Chainsmokers	#SELFIE ( Instrumental Mix )	Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:03	March 20 , 2014
3274	0.294473	0	2003	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3275	0.294473	0	2574	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3276	0.294473	0	2575	229	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Will Sparks Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:52	June 17 , 2014
3359	0.161323	0	2003	230	#SELFIE ( Caked Up Remix )	The Chainsmokers	#SELFIE ( The Remixes ) - Single	Electronic , Music , Dance	$ 1.29	2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...	3:15	17-Jun-14	#Selfie ( Botnek Remix )	The Chainsmokers	#Selfie ( The Remixes )	Electronica , Dance & Electronic	$ 1.29	( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...	3:36	June 17 , 2014

In [33]:

# Reset index as Magellan requires the key to be a column in the table
predictions.reset_index(inplace=True)

In [34]:

# Update metadata in the catalog. This information can later be used by triggers to modify the labels from 
# the learning-based matcher 
em.set_key(predictions, '_id')
em.set_fk_ltable(predictions, 'ltable_id')
em.set_fk_rtable(predictions, 'rtable_id')
em.set_ltable(predictions, A)
em.set_rtable(predictions, B)

Out[34]:

True