As a part of the *Magellan* ecosystem, in this tutorial we will show how to use deepmatcher
together with Magellan
, to perform an end-to-end Entity Matching (EM) task. Specifically, Magellan
performs EM in a two-stage fashion where given two tables,
Given the above workflow, deepmatcher
fits in the step C, which is to take advantage of deep learning to learn a classifer. For the rest of the tutorial, we will use a real example, which is to match songs across two tables from iTunes and Amazon Music, to go through the workflow.
For more information on Magellan
, please go to the project website: https://sites.google.com/site/anhaidgroup/projects/magellan.
In order to use *Magellan*, we need to first install it. The easiest way is to use "pip" as follow (please consult the package website for other installation options).
try:
import py_entitymatching as em
except:
!pip install py_entitymatching
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/deepmatcher')
import deepmatcher as dm
import py_entitymatching as em
import os
We first load the two input tables in the csv format using Magellan
, that contain songs from iTunes and Amazon Music. These two tables are in the "example" directory included in our package.
# The path to the two input tables.
path_A = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableA.csv')
path_B = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableB.csv')
# Load the two tables.
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file. WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
# Basic information about the tables.
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
Number of tuples in A: 6907 Number of tuples in B: 55923 Number of tuples in A X B (i.e the cartesian product): 386260161
# The first few tuples in table A.
A.head()
id | Song_Name | Artist_Name | Album_Name | Genre | Price | CopyRight | Time | Released | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Runaway Train | Cam | Welcome to Cam Country - EP | Country , Music , Contemporary Country , Honky Tonk | $ 0.99 | 2015 Sony Music Entertainment | 3:01 | 31-Mar-15 |
1 | 1 | Track 14 | Omi | Me 4 U | Pop/Rock , Music , Pop , Dance , R&B / Soul | Album Only | 2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ... | 3:41 | NaN |
2 | 2 | Lips Are Movin | Meghan Trainor | Title | Pop , Music , Pop/Rock , Dance , Rock , Teen Pop | $ 1.29 | 2014 , 2015 Epic Records , a division of Sony Music Entertainment | 3:01 | 9-Jan-15 |
3 | 3 | I Want a Hippopotamus for Christmas | A Great Big World | I 'll Be Home For Christmas | Holiday , Music | $ 1.29 | Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment | 2:20 | 24-Nov-14 |
4 | 4 | Credit | Meghan Trainor | Title ( Deluxe ) | Pop , Music , Rock , Pop/Rock , Dance , Teen Pop | $ 1.29 | 2014 , 2015 Epic Records , a division of Sony Music Entertainment | 2:51 | 9-Jan-15 |
# The first few tuples in table B.
B.head()
id | Song_Name | Artist_Name | Album_Name | Genre | Price | CopyRight | Time | Released | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Saxophone Stomp [ Explicit ] | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:20 | September 16 , 2014 |
1 | 1 | I Wan na Mingle [ feat . Pusher ] | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 2:36 | September 16 , 2014 |
2 | 2 | Lytah | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:48 | September 16 , 2014 |
3 | 3 | Slappy Pappy | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:55 | September 16 , 2014 |
4 | 4 | Fushion Jam | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 1:48 | September 16 , 2014 |
We first perform blocking on A and B to get a candidate set K, by removing obvious non-matching tuple pairs. Magellan
supports four different types of blocker: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Typically, users need to mix and match these blockers with the debugging functionality provided in Magellan
to get a good candidate set. (Developing a good blocker is not the focus of this tutorial. For more information on developing and debugging a blocker, please consult the user manual of Magellan
.)
Here we show an example of blocking. Observe that matching tuple pairs should have some common words in the album name, so we first create an overlap blocker on the attribute name "Album_Name" with threshold 2, to remove all pairs with word overlap less 2 in that attribute.
# Create an overlap blocker in Magellan and apply it to A and B to get the candidate set K1 which is in the format of
# a dataframe. The "l_out_attrs" and "r_out_attrs" parameters indicate the columns that will be included in K1 from A
# and B respectively.
ob = em.OverlapBlocker()
K1 = ob.block_tables(A, B, 'Album_Name', 'Album_Name',
l_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],
r_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],
overlap_size=2)
0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:18
# The number of tuple pairs in K1.
len(K1)
3152021
# The first few tuple pairs in K1
K1.head()
_id | ltable_id | rtable_id | ltable_Song_Name | ltable_Artist_Name | ltable_Album_Name | ltable_Genre | ltable_Price | ltable_CopyRight | ltable_Time | ltable_Released | rtable_Song_Name | rtable_Artist_Name | rtable_Album_Name | rtable_Genre | rtable_Price | rtable_CopyRight | rtable_Time | rtable_Released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3995 | 0 | Rudolph Blue | Brian Fechino | A Rock By The Sea Christmas : : Volume 2 | Holiday , Music , Rock | $ 0.99 | 2011 Hubbub ! | 1:57 | 8-Nov-11 | Saxophone Stomp [ Explicit ] | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:20 | September 16 , 2014 |
1 | 1 | 3995 | 1 | Rudolph Blue | Brian Fechino | A Rock By The Sea Christmas : : Volume 2 | Holiday , Music , Rock | $ 0.99 | 2011 Hubbub ! | 1:57 | 8-Nov-11 | I Wan na Mingle [ feat . Pusher ] | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 2:36 | September 16 , 2014 |
2 | 2 | 3995 | 2 | Rudolph Blue | Brian Fechino | A Rock By The Sea Christmas : : Volume 2 | Holiday , Music , Rock | $ 0.99 | 2011 Hubbub ! | 1:57 | 8-Nov-11 | Lytah | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:48 | September 16 , 2014 |
3 | 3 | 3995 | 3 | Rudolph Blue | Brian Fechino | A Rock By The Sea Christmas : : Volume 2 | Holiday , Music , Rock | $ 0.99 | 2011 Hubbub ! | 1:57 | 8-Nov-11 | Slappy Pappy | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 3:55 | September 16 , 2014 |
4 | 4 | 3995 | 4 | Rudolph Blue | Brian Fechino | A Rock By The Sea Christmas : : Volume 2 | Holiday , Music , Rock | $ 0.99 | 2011 Hubbub ! | 1:57 | 8-Nov-11 | Fushion Jam | Rusko | ! ( Volume 2 ) [ Explicit ] | Dance & Electronic , Dubstep | $ 1.29 | ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. | 1:48 | September 16 , 2014 |
Notice that K1 has more than 3 million pair in it, which is too large to considered for matching. So we create an overlap blocking with threshold 1 on the attribute "Artist_Name" for K1, to filter all pairs in K1 that don't share any word in "Artist_Name". And we get the candidate set K2.
# Create a new overlap blocker to remove pairs from K1 that have no common word in "Artist_Name".
K2 = ob.block_candset(K1, 'Artist_Name', 'Artist_Name', overlap_size=1)
0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:29
# The number of tuple pairs in K2.
len(K2)
167180
After the second blocking step, now we have a candidate set K2 with about 170K pairs. But we think it is still a bit larger to consider for matching. So we apply a third blocker, which is an overlap blocker on the attribute "Song_Name", to further reduce the size of the candidate set.
# Apply the third overlap blocker.
K3 = ob.block_candset(K2, 'Song_Name', 'Song_Name', overlap_size=1)
0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:01
# The number of tuples pairs in K3.
len(K3)
38015
Now we have a candidate set with 38K pairs which is reasonable, so we take K3 as the final candidate set. We save the candidate to the disk in the csv format for future reuse.
path_K = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv')
K3.to_csv(path_K, index=False)
In this stage we will match tuple pairs in the candidate set to predict each of them as match or non-match. This is the part that deepmatcher
will be involved. Specifically, it consists of the following steps:
deepmatcher
.We first take a random sample S from the candidate set K using Magellan
. Here for example, we sample 500 pairs for K. Then we label each pair as match (enter 1) or non-match (enter 0) and use S as the training data.
# Take a sample of 500 pairs from the candidate set.
S = em.sample_table(K3, 500)
# Label the sample S in a GUI. Enter 1 for match and 0 for non-match.
G = em.label_table(S, 'gold')
For the purposes of this tutorial, we will load in a pre-labeled dataset (of 539 tuple pairs) included in this package.
# The path to the labeled data file.
path_G = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'gold.csv')
# Load the labeled data into a dataframe.
G = em.read_csv_metadata(path_G,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
print('Number of labeled pairs:', len(G))
WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.
Number of labeled pairs: 539
Once we have the labeled data, we use deepmatcher
to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated).
# The directory where the data splits will be saved.
split_path = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial')
# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.
dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',
[3, 1, 1])
# Load the training data files from the disk. Ignore the "left_id" and "right_id"
# columns for data preprocessing.
# The 'use_magellan_convention' parameter asks deepmatcher to use Magellan's
# naming convention for the left and right table column name prefixes
# ("ltable_", and "rtable_"), and also to consider "_id" as the ID column.
train, validation, test = dm.data.process(
path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial'),
cache='train_cache.pth',
train='train.csv',
validation='valid.csv',
test='test.csv',
use_magellan_convention=True,
ignore_columns=('ltable_id', 'rtable_id'))
WARNING:deepmatcher.data.dataset:Rebuilding data cache because: ['One or more data files have been modified.'] Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/train.csv" 0% [############################# ] 100% | ETA: 00:00:00 Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/valid.csv" 0% [############################# ] 100% | ETA: 00:00:00 Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/test.csv" 0% [############################# ] 100% | ETA: 00:00:00 Building vocabulary 0% [#] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00 Computing principal components 0% [#] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
After we get the training data, we can use deepmatcher
to train a classifier. Here we train a hybrid model.
# Create a hybrid model.
model = dm.MatchingModel(attr_summarizer='hybrid')
# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative
# ratio to be 3. We save the best model (with the
# highest F1 score on the validation set) to 'hybrid_model.pth'.
model.run_train(
train,
validation,
epochs=10,
batch_size=16,
best_save_path='hybrid_model.pth',
pos_neg_ratio=3)
* Number of trainable parameters: 17757810 ===> TRAIN Epoch 1
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04
Finished Epoch 1 || Run Time: 4.2 | Load Time: 0.4 || F1: 39.60 | Prec: 31.25 | Rec: 54.05 || Ex/s: 69.89 ===> EVAL Epoch 1
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 1 || Run Time: 0.6 | Load Time: 0.2 || F1: 60.61 | Prec: 50.00 | Rec: 76.92 || Ex/s: 146.30 * Best F1: 60.60606060606061 Saving best model... Done. --------------------- ===> TRAIN Epoch 2
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04
Finished Epoch 2 || Run Time: 3.9 | Load Time: 0.4 || F1: 65.62 | Prec: 53.39 | Rec: 85.14 || Ex/s: 75.12 ===> EVAL Epoch 2
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 2 || Run Time: 0.6 | Load Time: 0.2 || F1: 76.00 | Prec: 79.17 | Rec: 73.08 || Ex/s: 146.34 * Best F1: 76.0 Saving best model... Done. --------------------- ===> TRAIN Epoch 3
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03
Finished Epoch 3 || Run Time: 3.8 | Load Time: 0.4 || F1: 87.27 | Prec: 79.12 | Rec: 97.30 || Ex/s: 76.32 ===> EVAL Epoch 3
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 3 || Run Time: 0.6 | Load Time: 0.2 || F1: 84.21 | Prec: 77.42 | Rec: 92.31 || Ex/s: 143.25 * Best F1: 84.21052631578948 Saving best model... Done. --------------------- ===> TRAIN Epoch 4
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03
Finished Epoch 4 || Run Time: 3.8 | Load Time: 0.4 || F1: 90.57 | Prec: 84.71 | Rec: 97.30 || Ex/s: 76.22 ===> EVAL Epoch 4
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 4 || Run Time: 0.6 | Load Time: 0.2 || F1: 82.76 | Prec: 75.00 | Rec: 92.31 || Ex/s: 146.11 --------------------- ===> TRAIN Epoch 5
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04
Finished Epoch 5 || Run Time: 3.9 | Load Time: 0.4 || F1: 97.37 | Prec: 94.87 | Rec: 100.00 || Ex/s: 74.96 ===> EVAL Epoch 5
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 5 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.57 | Prec: 88.89 | Rec: 92.31 || Ex/s: 146.11 * Best F1: 90.56603773584906 Saving best model... Done. --------------------- ===> TRAIN Epoch 6
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04
Finished Epoch 6 || Run Time: 3.9 | Load Time: 0.4 || F1: 98.67 | Prec: 97.37 | Rec: 100.00 || Ex/s: 74.94 ===> EVAL Epoch 6
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 6 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.20 | Prec: 92.00 | Rec: 88.46 || Ex/s: 145.73 --------------------- ===> TRAIN Epoch 7
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03
Finished Epoch 7 || Run Time: 3.8 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.88 ===> EVAL Epoch 7
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 7 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.20 | Prec: 92.00 | Rec: 88.46 || Ex/s: 145.56 --------------------- ===> TRAIN Epoch 8
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03
Finished Epoch 8 || Run Time: 3.9 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.16 ===> EVAL Epoch 8
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 8 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.76 * Best F1: 92.00000000000001 Saving best model... Done. --------------------- ===> TRAIN Epoch 9
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03
Finished Epoch 9 || Run Time: 3.7 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 77.50 ===> EVAL Epoch 9
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 9 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.27 --------------------- ===> TRAIN Epoch 10
0% [████] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04
Finished Epoch 10 || Run Time: 3.9 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.28 ===> EVAL Epoch 10
0% [█] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00
Finished Epoch 10 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.95 --------------------- Loading best model... Training done.
92.00000000000001
Once we have the trained classifier, we can evaluate the accuracy using the test data.
# Evaluate the accuracy on the test data.
model.run_eval(test)
===> EVAL Epoch 8 Finished Epoch 8 || Run Time: 0.3 | Load Time: 0.2 || F1: 92.31 | Prec: 90.91 | Rec: 93.75 || Ex/s: 208.04
92.3076923076923
Now we have a trained classifer, we can apply it to the candidate set to predict each pair as match or non-match. To achieve this, we first load the trained model and the candidate set.
# Load the model.
model = dm.MatchingModel(attr_summarizer='hybrid')
model.load_state('hybrid_model.pth')
# Load the candidate set. Note that the trained model is an input parameter as we need to trained
# model for candidate set preprocessing.
candidate = dm.data.process_unlabeled(
path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv'),
trained_model=model,
ignore_columns=('ltable_id', 'rtable_id'))
Reading and processing data from "./sample_data/itunes-amazon/e2e-tutorial/candidate.csv" 0% [##############################] 100% | ETA: 00:00:00
After we load the candidate set, now we make the predictions using the run_prediction
function. The output_attribute argument indicates the columns that will be included in the prediction table. Here
we use all of the columns in the candidate set.
# Predict the pairs in the candidate set and return a dataframe containing the pair id with
# the score of being a match.
predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))
===> PREDICT Epoch 8
0% [██████████████████████████████] 100% | ETA: 00:00:00 Total time elapsed: 00:02:18
Finished Epoch 8 || Run Time: 97.2 | Load Time: 41.5 || F1: 0.00 | Prec: 0.00 | Rec: 0.00 || Ex/s: 0.00
Below shows a few pairs in the prediction table. Note that this table is not sorted in the descending order of the match scores.
predictions.head()
match_score | ltable_id | rtable_id | ltable_Song_Name | ltable_Artist_Name | ltable_Album_Name | ltable_Genre | ltable_Price | ltable_CopyRight | ltable_Time | ltable_Released | rtable_Song_Name | rtable_Artist_Name | rtable_Album_Name | rtable_Genre | rtable_Price | rtable_CopyRight | rtable_Time | rtable_Released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_id | |||||||||||||||||||
3237 | 0.960461 | 2583 | 228 | #Selfie ( Instrumental Mix ) | The Chainsmokers | #Selfie ( Instrumental Mix ) - Single | Dance , Music , Electronic | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. | 3:03 | 20-Mar-14 | #SELFIE ( Instrumental Mix ) | The Chainsmokers | #SELFIE ( Instrumental Mix ) | Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:03 | March 20 , 2014 |
3274 | 0.294473 | 2003 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3275 | 0.294473 | 2574 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3276 | 0.294473 | 2575 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3359 | 0.161323 | 2003 | 230 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Botnek Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:36 | June 17 , 2014 |
We can simply manipulate the prediction table above to get the pairs that we want. Suppose we are only interested in pairs with the match score more over 0.9, and we want to display them sorting in the descending order on match score. This can be achieved as follow.
high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)
high_score_pairs.head()
match_score | ltable_id | rtable_id | ltable_Song_Name | ltable_Artist_Name | ltable_Album_Name | ltable_Genre | ltable_Price | ltable_CopyRight | ltable_Time | ltable_Released | rtable_Song_Name | rtable_Artist_Name | rtable_Album_Name | rtable_Genre | rtable_Price | rtable_CopyRight | rtable_Time | rtable_Released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_id | |||||||||||||||||||
1472466 | 0.995653 | 2462 | 30798 | It 's Never Easy to Say Goodbye | Kenny Chesney | Me and You | Country , Music , Urban Cowboy , Contemporary Country | $ 0.99 | 1996 BMG Entertainment | 4:44 | 1-Jan-90 | It 's Never Easy To Say Goodbye | Kenny Chesney | Me And You | Country | $ 0.99 | ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC | 4:44 | January 1 , 1990 |
1472465 | 0.995653 | 2453 | 30798 | It 's Never Easy to Say Goodbye | Kenny Chesney | Me and You | Country , Music , Urban Cowboy , Contemporary Country | $ 0.99 | 1996 BMG Entertainment | 4:44 | 1-Jan-90 | It 's Never Easy To Say Goodbye | Kenny Chesney | Me And You | Country | $ 0.99 | ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC | 4:44 | January 1 , 1990 |
1472464 | 0.995653 | 2438 | 30798 | It 's Never Easy to Say Goodbye | Kenny Chesney | Me and You | Country , Music , Urban Cowboy , Contemporary Country | $ 0.99 | 1996 BMG Entertainment | 4:44 | 1-Jan-90 | It 's Never Easy To Say Goodbye | Kenny Chesney | Me And You | Country | $ 0.99 | ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC | 4:44 | January 1 , 1990 |
1472463 | 0.995653 | 2435 | 30798 | It 's Never Easy to Say Goodbye | Kenny Chesney | Me and You | Country , Music , Urban Cowboy , Contemporary Country | $ 0.99 | 1996 BMG Entertainment | 4:44 | 1-Jan-90 | It 's Never Easy To Say Goodbye | Kenny Chesney | Me And You | Country | $ 0.99 | ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC | 4:44 | January 1 , 1990 |
1472462 | 0.995653 | 2430 | 30798 | It 's Never Easy to Say Goodbye | Kenny Chesney | Me and You | Country , Music , Urban Cowboy , Contemporary Country | $ 0.99 | 1996 BMG Entertainment | 4:44 | 1-Jan-90 | It 's Never Easy To Say Goodbye | Kenny Chesney | Me And You | Country | $ 0.99 | ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC | 4:44 | January 1 , 1990 |
As you have seen, deepmatcher
does not include a column indicating match / non-match predictions in the output table. You can easily create such a column by thresholding the match_score
at 0.5 (which is what deepmatcher
uses as the threshold to compute F1), or using a threshold value or your choice.
predictions['match_prediction'] = predictions['match_score'].apply(lambda score: 1 if score >= 0.5 else 0)
# Reorder columns to avoid scrolling...
predictions = predictions[['match_score', 'match_prediction'] + predictions.columns.values[1:-1].tolist()]
predictions.head()
match_score | match_prediction | ltable_id | rtable_id | ltable_Song_Name | ltable_Artist_Name | ltable_Album_Name | ltable_Genre | ltable_Price | ltable_CopyRight | ltable_Time | ltable_Released | rtable_Song_Name | rtable_Artist_Name | rtable_Album_Name | rtable_Genre | rtable_Price | rtable_CopyRight | rtable_Time | rtable_Released | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_id | ||||||||||||||||||||
3237 | 0.960461 | 1 | 2583 | 228 | #Selfie ( Instrumental Mix ) | The Chainsmokers | #Selfie ( Instrumental Mix ) - Single | Dance , Music , Electronic | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. | 3:03 | 20-Mar-14 | #SELFIE ( Instrumental Mix ) | The Chainsmokers | #SELFIE ( Instrumental Mix ) | Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:03 | March 20 , 2014 |
3274 | 0.294473 | 0 | 2003 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3275 | 0.294473 | 0 | 2574 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3276 | 0.294473 | 0 | 2575 | 229 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Will Sparks Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:52 | June 17 , 2014 |
3359 | 0.161323 | 0 | 2003 | 230 | #SELFIE ( Caked Up Remix ) | The Chainsmokers | #SELFIE ( The Remixes ) - Single | Electronic , Music , Dance | $ 1.29 | 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... | 3:15 | 17-Jun-14 | #Selfie ( Botnek Remix ) | The Chainsmokers | #Selfie ( The Remixes ) | Electronica , Dance & Electronic | $ 1.29 | ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... | 3:36 | June 17 , 2014 |
# Reset index as Magellan requires the key to be a column in the table
predictions.reset_index(inplace=True)
# Update metadata in the catalog. This information can later be used by triggers to modify the labels from
# the learning-based matcher
em.set_key(predictions, '_id')
em.set_fk_ltable(predictions, 'ltable_id')
em.set_fk_rtable(predictions, 'rtable_id')
em.set_ltable(predictions, A)
em.set_rtable(predictions, B)
True