A matching model (an instance of class MatchingModel
) is a neural network to perform entity matching. It takes in the contents of a tuple pair, i.e., two sequnces of words for each attribute, as input and produces a match score as output. This tutorial describes the structure of this network and presents options available for each of its components.
Important Note: Be aware that creating a matching model (MatchingModel
) object does not immediately instantiate all its components - deepmatcher
uses a lazy initialization paradigm where components are instantiated just before training. Hence, code examples in this tutorial manually perform this initialization to demonstrate model customization meaningfully.
import deepmatcher as dm
import logging
import torch
logging.getLogger('deepmatcher.core').setLevel(logging.INFO)
# Download sample data.
!mkdir -p sample_data/itunes-amazon
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/train.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/validation.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/test.csv
train_dataset, validation_dataset, test_dataset = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'))
model = dm.MatchingModel()
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.
For more details on lazy initialization, please refer to the Lazy Initialization section of this tutorial.
At its core, a matching model has 3 main components: 1. Attribute Embedding, 2. Attribute Similarity Representation, and 3. Classifier. This is illustrated in the figure below:
We briefly describe these components below. For a more in-depth explanation, please take a look at our paper.
The 3 components are further broken down into sub-modules as shown:
The Attribute Embedding component (AE) takes in two sequences of words corresponding to the value of each attribute and converts each word in them to a word embedding (vector representation of a word). This produces two sequences of word embeddings as output for each attribute. This is illustrated in the figure below. For an intuitive explanation of word embeddings, please refer this blog post. The Attribute Embedding component is also presented in more detail in our talk. Note that this component is shared across all attributes - the same AE model is used for all attributes.
This component uses word embeddings that were loaded as part of data processing. To customize it, you can set the embeddings
parameter in dm.data.process
, as described in the tutorial on data processing.
This component (ASR) takes attribute value embeddings, i.e., two sequences of word embeddings, and encodes them into a representation that captures their similarities and differences. Its operations are split between two modules as described below and as illustrated in the following figure:
2.1 Attribute Summarization (AS): This module takes as input the two word embedding sequences and summarizes the information in them to produce two summary vectors as output. The role of attribute summarization is to aggregate information across all tokens in an attribute value sequence of an entity mention. This summarization process may consider the pair of sequences of an attribute jointly to perform more sophisticated operations such as alignment. Folks in NLP: this has nothing to do with text summarization.
2.2 Attribute Comparison (AC): This module takes as input the two summary vectors and applies a comparison function over those summaries to obtain the final similarity representation of the two attribute values.
Note that unlike Attribute Embedding, this component is not shared across attributes - each attribute has its own dedicated ASR model. They share the same structure but do not share their parameters.
The ASR can be customized by specifying the attr_summarizer
and optionally the attr_comparator
parameter as follows:
model = dm.MatchingModel(attr_summarizer='sif', attr_comparator='diff')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.
attr_summarizer
can be set to one of the following:¶A string: One of the following 4 string literals
attr_summarizer = dm.attr_summarizers.SIF()
.attr_summarizer = dm.attr_summarizers.RNN()
.attr_summarizer = dm.attr_summarizers.Attention()
.attr_summarizer = dm.attr_summarizers.Hybrid()
.An instance of dm.AttrSummarizer
or one of its subclasses:
dm.attr_summarizers.SIF
: Use the SIF attribute summarizer.dm.attr_summarizers.RNN
: Use the RNN attribute summarizer.dm.attr_summarizers.Attention
: Use the Attention attribute summarizer.dm.attr_summarizers.Hybrid
: Use the Hybrid attribute summarizer.A callable
: Put simply, a function that returns a PyTorch Module. The module must behave like an Attribute Summarizer, i.e., takes two word embedding sequences, summarizes the information in them and returns two vectors as output. Note that we cannot accept a PyTorch module directly as input because we may need to create multiple instances of this module, one for each attribute. Thus, we require that you specify custom modules via a callable
.
(batch, seq1_len, input_size)
and (batch, seq2_len, input_size)
. These tensors will be wrapped within AttrTensor
s which will contain metadata about the batch.(batch, output_size)
, wrapped within AttrTensor
s (with metadata information unchanged). output_size
need not be the same as input_size
.string
arg example:
model = dm.MatchingModel(attr_summarizer='sif')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.
dm.AttrSummarizer
arg example:
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN())
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 3917002 trainable parameters.
callable
arg example: We create a custom attribute summarizer, one that simply sums up all the word embeddings in each sequence. To do this we use two helper modules:
dm.modules.Lambda
: Used to create PyTorch module from a lambda function without having to define a class.dm.modules.NoMeta
: Used to remove metadata information from the input and restore it back in the output.Note that since we are using a custom attr_summarizer
, the attr_comparator
must be specified.
my_attr_summarizer_module = dm.modules.NoMeta(dm.modules.Lambda(
lambda x, y: (x.sum(dim=1), y.sum(dim=1))))
model = dm.MatchingModel(attr_summarizer=
lambda: my_attr_summarizer_module, attr_comparator='abs-diff')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.
attr_comparator
can be set to one of the following:¶style
s supported by the dm.modules.Merge
module.dm.modules.Merge
callable
: Put simply, a function that returns a PyTorch Module. The module must take in two vectors as input and produces one vector as output.(batch, input_size)
.(batch, output_size)
. output_size
need not be the same as input_size
.string
arg example:
model = dm.MatchingModel(attr_comparator='concat')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17517810 trainable parameters.
dm.modules.Merge
arg example:
model = dm.MatchingModel(attr_comparator=dm.modules.Merge('mul'))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17277810 trainable parameters.
callable
arg example: We create a custom attribute comparator, one that concatenates the two attribute summaries and their element-wise product. We use the dm.modules.Lambda
helper module again to create PyTorch module from a lambda function.
my_attr_comparator_module = dm.modules.Lambda(
lambda x, y: torch.cat((x, y, x * y), dim=x.dim() - 1))
model = dm.MatchingModel(attr_comparator=lambda: my_attr_comparator_module)
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.
If attr_comparator
is not set, deepmatcher
will try to automatically set it based on the attr_summarizer
specified. The following mapping shows the attr_comparator
that will be used for various kinds of atrribute summarizers:
dm.attr_summarizers.SIF
: attr_comparator='abs-diff'
dm.attr_summarizers.RNN
: : attr_comparator='abs-diff'
dm.attr_summarizers.Attention
: : attr_comparator='concat'
dm.attr_summarizers.Hybrid
: : attr_comparator='concat-abs-diff'
If the specified attr_summarizer
is not a supported string and is not an instance of any of these classes, then attr_comparator
must be specified.
The Attribute Summarization module is the most critical component of a matching model. As mentioned earlier, it takes in two word embedding sequences and summarizes the information in them to produce two summary vectors as output. It consists of 3 sub-modules, described below and illustrated in the following figure:
This is an optional module that takes as input a word embedding sequence and produces a context-aware word embedding sequence as output. For example, consider the raw word embedding sequence for sentences "Brand : Orange" and "Color : Orange". In the first case, the output word embedding for "Orange" may be adjusted to represent the color orange, and in the second case, it may be adjusted to represent the brand orange. This module is shared for both word embedding sequences, i.e., the same neural network is used for both sequences.
This is an optional module takes as input two word embedding sequence (may or may not be context-aware), one of which is treated as the primary sequence and the other is treated as context. Intuitively, this modules does the following:
The output of this module is the sequence of word comparison vectors, i.e., one word comparison vector for each word in the primary word embedding sequence. This module is shared for both word embedding sequences, i.e., the 1st word embedding sequence is compared to the 2nd to obtain a word comparison vector sequence for the 1st sequence, and the same network is used to compare the 2nd sequence to the 1st to obtain a word comparison vector sequence for the 2nd sequence.
This module takes as input a sequence of vectors - either a sequence of word embedding or a sequence of word comparison vectors. It aggregates this sequence to produce a single vector as summarizing this sequence. This module may optionally make use of the other sequence as context. This module is shared for both word embedding sequences, i.e., the same neural network is used for both sequences.
Attribute Summarization can be customized by specifying the word_contextualizer
, word_comparator
, and word_aggregator
parameters while creating a dm.AttrSummarizer
or any of its four sub-classes discussed above. For example,
model = dm.MatchingModel(
attr_summarizer=dm.attr_summarizers.Hybrid(
word_contextualizer='self-attention',
word_comparator='general-attention',
word_aggregator='inv-freq-avg-pool'))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11973802 trainable parameters.
word_contextualizer
can be set to one of the following:¶A string: One of the unit_type
supported by : dm.word_contextualizers.RNN
or 'self-attention'
word_contextualizer = dm.word_contextualizers.RNN(unit_type='gru')
.word_contextualizer = dm.word_contextualizers.RNN(unit_type='lstm')
.word_contextualizer = dm.word_contextualizers.RNN(unit_type='rnn')
.word_contextualizer = dm.word_contextualizers.SelfAttention()
.An instance of dm.WordContextualizer
or one of its subclasses:
dm.word_contextualizers.RNN
: Use the RNN word contextualizer.dm.word_contextualizers.SelfAttention
: Use the Self-Attention word contextualizer.A callable
: Put simply, a function that returns a PyTorch Module.
(batch, seq_len, input_size)
. The tensor will be wrapped within AttrTensor
which will contain metadata about the batch.(batch, seq_len, output_size)
, wrapped within an AttrTensor
(with metadata information unchanged). output_size
need not be the same as input_size
.We show some examples on how to customize word contextualizers for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid
) below, but these are also applicable to other attribute summarizers:
string
arg example:
model = dm.MatchingModel(
attr_summarizer=dm.attr_summarizers.Hybrid(word_contextualizer='gru'))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 20645010 trainable parameters.
dm.WordContextualizer
arg example:
# Example 2: dm.AttrSummarizer arg.
model = dm.MatchingModel(
attr_summarizer = dm.attr_summarizers.Hybrid(
word_contextualizer=dm.word_contextualizers.SelfAttention(heads=2)))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 26059410 trainable parameters.
callable
arg example: We create a custom convolutional word contextualizer. To do this, we need the input_size
dimension. This will be provided if the callable
takes in one argument named input_size
as shown. We then use this input size to create a convolutional layer. But the convolutional layer expects the sequence length dimension to be last. To deal with this we swap the 2nd and 3rd dimensions of the tensor before and after convolution. We also use dm.modules.Lambda
and dm.modules.NoMeta
as in earlier examples.
def my_word_contextualizer(input_size):
return dm.modules.NoMeta(torch.nn.Sequential(
dm.modules.Lambda(lambda x: x.transpose(1, 2)),
torch.nn.Conv1d(in_channels=input_size, out_channels=512,
kernel_size=3, padding=1),
dm.modules.Lambda(lambda x: x.transpose(1, 2))))
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.Hybrid(
word_contextualizer=my_word_contextualizer))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 38093682 trainable parameters.
word_comparator
can be set to one of the following:¶A string: One of the following 3 string literals:
word_comparator = dm.word_comparators.Attention(alignment_network='decomposable')
.word_comparator = dm.word_comparators.Attention(alignment_network='general')
.word_comparator = dm.word_comparators.Attention(alignment_network='dot')
.An instance of dm.WordComparator
or one of its subclasses:
dm.word_comparators.Attention
: Use the Attention word comparator.A callable
: Put simply, a function that returns a PyTorch Module.
AttrTensor
s.(batch, seq1_len, input_size)
(batch, seq1_len, input_size)
(batch, seq1_len, raw_input_size)
(batch, seq2_len, raw_input_size)
(batch, seq1_len, output_size)
, wrapped within an AttrTensor
(with the same metadata information as the first input tensor). output_size
need not be the same as input_size
.We show some examples on how to customize word comparators for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid
) below, but these are also applicable to other attribute summarizers:
string
arg example:
model = dm.MatchingModel(
attr_summarizer=dm.attr_summarizers.Hybrid(word_comparator='dot-attention'))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 20645010 trainable parameters.
dm.WordComparator
arg example:
model = dm.MatchingModel(
attr_summarizer = dm.attr_summarizers.Hybrid(
word_comparator=dm.word_comparators.Attention(heads=4, input_dropout=0.2)))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 29675010 trainable parameters.
callable
arg example: We create a custom word comparator, one that uses the Attention word comparator but has a 2 layer RNN following it. Since there are multiple inputs, we cannot use the standard torch.nn.Sequential
, but we can instead use the dm.modules.MultiSequential
utility module.
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.Hybrid(
word_comparator=lambda: dm.modules.MultiSequential(
dm.word_comparators.Attention(),
dm.modules.RNN(unit_type='gru', layers=2))))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 27153810 trainable parameters.
word_aggregator
can be set to one of the following:¶A string: One of the following string literals:
style
s supported by the dm.modules.Pool
module suffixed by '-pool', e.g., 'avg-pool', 'sif-pool', 'divsqrt-pool', etc.word_aggregator = dm.word_aggregators.Pool(<pool_type>)
word_aggregator = dm.word_aggregators.Pool('avg')
word_aggregator = dm.word_aggregators.AttentionWithRNN()
An instance of dm.WordAggregator
or one of its subclasses:
dm.word_aggregators.Pool
: Use the Pool word aggregator.dm.word_aggregators.AttentionWithRNN
: Use the AttentionWithRNN word aggregator.A callable
: Put simply, a function that returns a PyTorch Module.
(batch, seq1_len, input_size)
and (batch, seq2_len, input_size)
. The tensors will be wrapped within AttrTensor
s which will contain metadata about the batch.(batch, output_size)
, wrapped within an AttrTensor
(with the same metadata information as the first input tensor). This output must be the aggregation of the sequence of vectors in the first input (primary input), optionally taking into account the context input, i.e., the second input. output_size
need not be the same as input_size
.We show some examples on how to customize word aggregators for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid
) below, but these are also applicable to other attribute summarizers:
string
arg example:
model = dm.MatchingModel(
attr_summarizer=dm.attr_summarizers.Hybrid(word_aggregator='max-pool'))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11616202 trainable parameters.
dm.WordAggregator
arg example:
model = dm.MatchingModel(
attr_summarizer = dm.attr_summarizers.Hybrid(
word_aggregator=dm.word_aggregators.AttentionWithRNN(
rnn='lstm', rnn_pool_style='max')))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 21729810 trainable parameters.
callable
arg example: We create a custom word aggregator, one that concatenates the average and max of the given input sequence. We also use dm.modules.Lambda
and dm.modules.NoMeta
as in earlier examples.
my_word_aggregator_module = dm.modules.NoMeta(dm.modules.Lambda(
lambda x, y: torch.cat((x.mean(dim=1), x.max(dim=1)[0]), dim=-1)))
# Next, create the matching model.
model = dm.MatchingModel(
attr_summarizer = dm.attr_summarizers.Hybrid(word_aggregator=
lambda: my_word_aggregator_module))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11856202 trainable parameters.
This component takes the attribute similarity representations and uses those as features for a classifier that determines whether the input tuple pair refers to the same real-world entity.
The ASR can be customized by specifying the classifier
parameter as follows:
model = dm.MatchingModel(classifier='3-layer-residual-relu')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17938410 trainable parameters.
classifier
can be set to one of the following:¶style
string supported by the dm.modules.Transform
module.dm.Classifier
.callable
: Put simply, a function that returns a PyTorch Module. The module must take in one vectors as input and produce the log probability of non-match and match as output. Two outputs are used instead of one to work around a numerical stability issue in torch.(batch, input_size)
.(batch, 2)
. The second dimension must contain non-match and match class probabilities, in that order.string
arg example:
model = dm.MatchingModel(classifier='2-layer-highway-tanh')
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.
dm.Classifier
arg example:
model = dm.MatchingModel(classifier=dm.Classifier(
dm.modules.Transform('3-layer-residual', non_linearity=None,
hidden_size=512)))
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 19137482 trainable parameters.
callable
arg example: We create a custom classifier, one that outputs 3 class probabilities (e.g., for text entailment). We also use dm.modules.Lambda
and dm.modules.NoMeta
as in earlier examples.
my_classifier_module = torch.nn.Sequential(
dm.modules.Transform('2-layer-highway', hidden_size=300),
dm.modules.Transform('1-layer', non_linearity=None, output_size=3),
torch.nn.LogSoftmax(dim=1))
model = dm.MatchingModel(classifier=lambda: my_classifier_module)
model.initialize(train_dataset) # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17758111 trainable parameters.
As mentioned earlier, deepmatcher
follows a lazy initialization paradigm. This enables it to:
deepmatcher
performs one full forward pass through the model. In this process, each component is initialized sonly after initializing all its parent modules in the computational graph. This makes automatic input size inference for modules possible. As a result, plugging in custom modules in the middle of the network is much easier as you do not have to manually compute the input size.deepmatcher
verifies that all modules output tensors with the correct output shapes. This verification is done only once during initialization to avoid slowing down training.It's becase of the above reasons 1 and 2 that deepmatcher
does not permit custom modules to be specified directly and requires them to be specified via functions.
The core module that enables lazy initialization is dm.modules.LazyModule
which is a base class for most modules in deepmatcher
.