Matching Models

A matching model (an instance of class MatchingModel) is a neural network to perform entity matching. It takes in the contents of a tuple pair, i.e., two sequnces of words for each attribute, as input and produces a match score as output. This tutorial describes the structure of this network and presents options available for each of its components.

Important Note: Be aware that creating a matching model (MatchingModel) object does not immediately instantiate all its components - deepmatcher uses a lazy initialization paradigm where components are instantiated just before training. Hence, code examples in this tutorial manually perform this initialization to demonstrate model customization meaningfully.

In [1]:
import deepmatcher as dm
import logging
import torch

logging.getLogger('deepmatcher.core').setLevel(logging.INFO)

# Download sample data.
!mkdir -p sample_data/itunes-amazon
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/train.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/validation.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/sidharthms/deepmatcher/master/examples/sample_data/itunes-amazon/test.csv

train_dataset, validation_dataset, test_dataset = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'))

model = dm.MatchingModel()
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.

For more details on lazy initialization, please refer to the Lazy Initialization section of this tutorial.

At its core, a matching model has 3 main components: 1. Attribute Embedding, 2. Attribute Similarity Representation, and 3. Classifier. This is illustrated in the figure below:

Matching model structure

We briefly describe these components below. For a more in-depth explanation, please take a look at our paper.

The 3 components are further broken down into sub-modules as shown:

  1. Attribute Embedding
  2. Attribute Similarity Representation
    1. Attr Summarizer
      1. Word Contextualizer
      2. Word Comparator
      3. Word Aggregator
    2. Attr Comparator
  3. Classifier

1. Attribute Embedding

The Attribute Embedding component (AE) takes in two sequences of words corresponding to the value of each attribute and converts each word in them to a word embedding (vector representation of a word). This produces two sequences of word embeddings as output for each attribute. This is illustrated in the figure below. For an intuitive explanation of word embeddings, please refer this blog post. The Attribute Embedding component is also presented in more detail in our talk. Note that this component is shared across all attributes - the same AE model is used for all attributes.

Attribute Embedding

Customizing Attribute Embedding

This component uses word embeddings that were loaded as part of data processing. To customize it, you can set the embeddings parameter in dm.data.process, as described in the tutorial on data processing.

2. Attribute Similarity Representation

This component (ASR) takes attribute value embeddings, i.e., two sequences of word embeddings, and encodes them into a representation that captures their similarities and differences. Its operations are split between two modules as described below and as illustrated in the following figure:

2.1 Attribute Summarization (AS): This module takes as input the two word embedding sequences and summarizes the information in them to produce two summary vectors as output. The role of attribute summarization is to aggregate information across all tokens in an attribute value sequence of an entity mention. This summarization process may consider the pair of sequences of an attribute jointly to perform more sophisticated operations such as alignment. Folks in NLP: this has nothing to do with text summarization.

2.2 Attribute Comparison (AC): This module takes as input the two summary vectors and applies a comparison function over those summaries to obtain the final similarity representation of the two attribute values.

Matching model structure

Note that unlike Attribute Embedding, this component is not shared across attributes - each attribute has its own dedicated ASR model. They share the same structure but do not share their parameters.

Customizing Attribute Similarity Representation

The ASR can be customized by specifying the attr_summarizer and optionally the attr_comparator parameter as follows:

In [2]:
model = dm.MatchingModel(attr_summarizer='sif', attr_comparator='diff')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.

attr_summarizer can be set to one of the following:

  • A string: One of the following 4 string literals
    1. 'sif': Use the SIF attribute summarizer (refer our paper for details on SIF and other attribute summarizers). Equivalent to setting attr_summarizer = dm.attr_summarizers.SIF().
    2. 'rnn': Use the RNN attribute summarizer. Equivalent to setting attr_summarizer = dm.attr_summarizers.RNN().
    3. 'attention': Use the Attention attribute summarizer. Equivalent to setting attr_summarizer = dm.attr_summarizers.Attention().
    4. 'hybrid': Use the Hybrid attribute summarizer. Equivalent to setting attr_summarizer = dm.attr_summarizers.Hybrid().
  • A callable: Put simply, a function that returns a PyTorch Module. The module must behave like an Attribute Summarizer, i.e., takes two word embedding sequences, summarizes the information in them and returns two vectors as output. Note that we cannot accept a PyTorch module directly as input because we may need to create multiple instances of this module, one for each attribute. Thus, we require that you specify custom modules via a callable.
    • Input to module: Two 3d tensors of shape (batch, seq1_len, input_size) and (batch, seq2_len, input_size). These tensors will be wrapped within AttrTensors which will contain metadata about the batch.
    • Expected output from module: Two 2d tensors of shape (batch, output_size), wrapped within AttrTensors (with metadata information unchanged). output_size need not be the same as input_size.

string arg example:

In [3]:
model = dm.MatchingModel(attr_summarizer='sif')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.

dm.AttrSummarizer arg example:

In [4]:
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN())
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 3917002 trainable parameters.

callable arg example: We create a custom attribute summarizer, one that simply sums up all the word embeddings in each sequence. To do this we use two helper modules:

  • dm.modules.Lambda: Used to create PyTorch module from a lambda function without having to define a class.
  • dm.modules.NoMeta: Used to remove metadata information from the input and restore it back in the output.

Note that since we are using a custom attr_summarizer, the attr_comparator must be specified.

In [5]:
my_attr_summarizer_module = dm.modules.NoMeta(dm.modules.Lambda(
    lambda x, y: (x.sum(dim=1), y.sum(dim=1))))

model = dm.MatchingModel(attr_summarizer=
    lambda: my_attr_summarizer_module, attr_comparator='abs-diff')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 662602 trainable parameters.

attr_comparator can be set to one of the following:

  • A string: One of the styles supported by the dm.modules.Merge module.
  • An instance of dm.modules.Merge
  • A callable: Put simply, a function that returns a PyTorch Module. The module must take in two vectors as input and produces one vector as output.
    • Input to module: Two 2d tensors of shape (batch, input_size).
    • Expected output from module: One 2d tensor of shape (batch, output_size). output_size need not be the same as input_size.

string arg example:

In [6]:
model = dm.MatchingModel(attr_comparator='concat')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17517810 trainable parameters.

dm.modules.Merge arg example:

In [7]:
model = dm.MatchingModel(attr_comparator=dm.modules.Merge('mul'))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17277810 trainable parameters.

callable arg example: We create a custom attribute comparator, one that concatenates the two attribute summaries and their element-wise product. We use the dm.modules.Lambda helper module again to create PyTorch module from a lambda function.

In [8]:
my_attr_comparator_module = dm.modules.Lambda(
    lambda x, y: torch.cat((x, y, x * y), dim=x.dim() - 1))

model = dm.MatchingModel(attr_comparator=lambda: my_attr_comparator_module)
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.

If attr_comparator is not set, deepmatcher will try to automatically set it based on the attr_summarizer specified. The following mapping shows the attr_comparator that will be used for various kinds of atrribute summarizers:

  • Instance of dm.attr_summarizers.SIF: attr_comparator='abs-diff'
  • Instance of dm.attr_summarizers.RNN: : attr_comparator='abs-diff'
  • Instance of dm.attr_summarizers.Attention: : attr_comparator='concat'
  • Instance of dm.attr_summarizers.Hybrid: : attr_comparator='concat-abs-diff'

If the specified attr_summarizer is not a supported string and is not an instance of any of these classes, then attr_comparator must be specified.

2.1. Attribute Summarization

The Attribute Summarization module is the most critical component of a matching model. As mentioned earlier, it takes in two word embedding sequences and summarizes the information in them to produce two summary vectors as output. It consists of 3 sub-modules, described below and illustrated in the following figure:

Attribute Summarization

2.1.1. Word Contextualizer

This is an optional module that takes as input a word embedding sequence and produces a context-aware word embedding sequence as output. For example, consider the raw word embedding sequence for sentences "Brand : Orange" and "Color : Orange". In the first case, the output word embedding for "Orange" may be adjusted to represent the color orange, and in the second case, it may be adjusted to represent the brand orange. This module is shared for both word embedding sequences, i.e., the same neural network is used for both sequences.

2.1.2. Word Comparator

This is an optional module takes as input two word embedding sequence (may or may not be context-aware), one of which is treated as the primary sequence and the other is treated as context. Intuitively, this modules does the following:

  • For each word in the primary sequence, find the corresponding aligning word in the context sequence.
  • Compare each word in the primary sequence with its corresponding word in the context sequence to obtain a word comparison vector for each word.

The output of this module is the sequence of word comparison vectors, i.e., one word comparison vector for each word in the primary word embedding sequence. This module is shared for both word embedding sequences, i.e., the 1st word embedding sequence is compared to the 2nd to obtain a word comparison vector sequence for the 1st sequence, and the same network is used to compare the 2nd sequence to the 1st to obtain a word comparison vector sequence for the 2nd sequence.

2.1.3. Word Aggregator

This module takes as input a sequence of vectors - either a sequence of word embedding or a sequence of word comparison vectors. It aggregates this sequence to produce a single vector as summarizing this sequence. This module may optionally make use of the other sequence as context. This module is shared for both word embedding sequences, i.e., the same neural network is used for both sequences.

Customizing Attribute Summarization

Attribute Summarization can be customized by specifying the word_contextualizer, word_comparator, and word_aggregator parameters while creating a dm.AttrSummarizer or any of its four sub-classes discussed above. For example,

In [9]:
model = dm.MatchingModel(
    attr_summarizer=dm.attr_summarizers.Hybrid(
        word_contextualizer='self-attention',
        word_comparator='general-attention',
        word_aggregator='inv-freq-avg-pool'))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11973802 trainable parameters.

word_contextualizer can be set to one of the following:

  • A string: One of the unit_type supported by : dm.word_contextualizers.RNN or 'self-attention'
    1. 'gru': Equivalent to setting word_contextualizer = dm.word_contextualizers.RNN(unit_type='gru').
    2. 'lstm': Equivalent to setting word_contextualizer = dm.word_contextualizers.RNN(unit_type='lstm').
    3. 'rnn': Equivalent to setting word_contextualizer = dm.word_contextualizers.RNN(unit_type='rnn').
    4. 'self-attention': Equivalent to setting word_contextualizer = dm.word_contextualizers.SelfAttention().
  • A callable: Put simply, a function that returns a PyTorch Module.
    • Input to module: One 3d tensor of shape (batch, seq_len, input_size). The tensor will be wrapped within AttrTensor which will contain metadata about the batch.
    • Expected output from module: One 3d tensor of shape(batch, seq_len, output_size), wrapped within an AttrTensor (with metadata information unchanged). output_size need not be the same as input_size.

We show some examples on how to customize word contextualizers for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid) below, but these are also applicable to other attribute summarizers:

string arg example:

In [10]:
model = dm.MatchingModel(
    attr_summarizer=dm.attr_summarizers.Hybrid(word_contextualizer='gru'))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 20645010 trainable parameters.

dm.WordContextualizer arg example:

In [11]:
# Example 2: dm.AttrSummarizer arg.
model = dm.MatchingModel(
    attr_summarizer = dm.attr_summarizers.Hybrid(
        word_contextualizer=dm.word_contextualizers.SelfAttention(heads=2)))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 26059410 trainable parameters.

callable arg example: We create a custom convolutional word contextualizer. To do this, we need the input_size dimension. This will be provided if the callable takes in one argument named input_size as shown. We then use this input size to create a convolutional layer. But the convolutional layer expects the sequence length dimension to be last. To deal with this we swap the 2nd and 3rd dimensions of the tensor before and after convolution. We also use dm.modules.Lambda and dm.modules.NoMeta as in earlier examples.

In [12]:
def my_word_contextualizer(input_size):
    return dm.modules.NoMeta(torch.nn.Sequential(
        dm.modules.Lambda(lambda x: x.transpose(1, 2)),
        torch.nn.Conv1d(in_channels=input_size, out_channels=512, 
                        kernel_size=3, padding=1),
        dm.modules.Lambda(lambda x: x.transpose(1, 2))))

model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.Hybrid(
    word_contextualizer=my_word_contextualizer))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 38093682 trainable parameters.

word_comparator can be set to one of the following:

  • A string: One of the following 3 string literals:
    1. 'decomposable-attention': Equivalent to setting word_comparator = dm.word_comparators.Attention(alignment_network='decomposable').
    2. 'general-attention': Equivalent to setting word_comparator = dm.word_comparators.Attention(alignment_network='general').
    3. 'dot-attention': Equivalent to setting word_comparator = dm.word_comparators.Attention(alignment_network='dot').
  • An instance of dm.WordComparator or one of its subclasses:
    1. An instance of dm.word_comparators.Attention: Use the Attention word comparator.
  • A callable: Put simply, a function that returns a PyTorch Module.
    • Inputs to module: Four input tensors, all of which will be wrapped within AttrTensors.
      1. Primary context-aware word embedding sequence. Shape: (batch, seq1_len, input_size)
      2. Secondary context-aware word embedding sequence, i.e., the sequence to compare the primary sequence with. Shape: (batch, seq1_len, input_size)
      3. Raw word embedding sequence (context-unaware). Shape: (batch, seq1_len, raw_input_size)
      4. Raw secondary context-aware word embedding sequence (context-unaware). Shape: (batch, seq2_len, raw_input_size)
    • Expected output from module: One 3d tensor of shape(batch, seq1_len, output_size), wrapped within an AttrTensor (with the same metadata information as the first input tensor). output_size need not be the same as input_size.
    • Notes:
      • The custom module may choose to ignore the last two raw context-unaware inputs if they are deemed unnecessary.
      • If no Word Contextualizer is used, the last two inputs will be the same as the first two inputs.

We show some examples on how to customize word comparators for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid) below, but these are also applicable to other attribute summarizers:

string arg example:

In [13]:
model = dm.MatchingModel(
    attr_summarizer=dm.attr_summarizers.Hybrid(word_comparator='dot-attention'))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 20645010 trainable parameters.

dm.WordComparator arg example:

In [14]:
model = dm.MatchingModel(
    attr_summarizer = dm.attr_summarizers.Hybrid(
        word_comparator=dm.word_comparators.Attention(heads=4, input_dropout=0.2)))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 29675010 trainable parameters.

callable arg example: We create a custom word comparator, one that uses the Attention word comparator but has a 2 layer RNN following it. Since there are multiple inputs, we cannot use the standard torch.nn.Sequential, but we can instead use the dm.modules.MultiSequential utility module.

In [15]:
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.Hybrid(
    word_comparator=lambda: dm.modules.MultiSequential(
        dm.word_comparators.Attention(),
        dm.modules.RNN(unit_type='gru', layers=2))))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 27153810 trainable parameters.

word_aggregator can be set to one of the following:

  • A string: One of the following string literals:
    • One of the styles supported by the dm.modules.Pool module suffixed by '-pool', e.g., 'avg-pool', 'sif-pool', 'divsqrt-pool', etc.
      • Equivalent to setting word_aggregator = dm.word_aggregators.Pool(<pool_type>)
      • E.g. equivalent to setting word_aggregator = dm.word_aggregators.Pool('avg')
    • 'attention-with-rnn': Equivalent to setting word_aggregator = dm.word_aggregators.AttentionWithRNN()
  • A callable: Put simply, a function that returns a PyTorch Module.
    • Input to module: Two 3d tensor of shape (batch, seq1_len, input_size) and (batch, seq2_len, input_size). The tensors will be wrapped within AttrTensors which will contain metadata about the batch.
    • Expected output from module: One 2d tensor of shape(batch, output_size), wrapped within an AttrTensor (with the same metadata information as the first input tensor). This output must be the aggregation of the sequence of vectors in the first input (primary input), optionally taking into account the context input, i.e., the second input. output_size need not be the same as input_size.

We show some examples on how to customize word aggregators for Hybrid attribute summarization modules (dm.attr_summarizers.Hybrid) below, but these are also applicable to other attribute summarizers:

string arg example:

In [16]:
model = dm.MatchingModel(
    attr_summarizer=dm.attr_summarizers.Hybrid(word_aggregator='max-pool'))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11616202 trainable parameters.

dm.WordAggregator arg example:

In [17]:
model = dm.MatchingModel(
    attr_summarizer = dm.attr_summarizers.Hybrid(
        word_aggregator=dm.word_aggregators.AttentionWithRNN(
            rnn='lstm', rnn_pool_style='max')))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 21729810 trainable parameters.

callable arg example: We create a custom word aggregator, one that concatenates the average and max of the given input sequence. We also use dm.modules.Lambda and dm.modules.NoMeta as in earlier examples.

In [18]:
my_word_aggregator_module = dm.modules.NoMeta(dm.modules.Lambda(
    lambda x, y: torch.cat((x.mean(dim=1), x.max(dim=1)[0]), dim=-1)))

# Next, create the matching model.
model = dm.MatchingModel(
    attr_summarizer = dm.attr_summarizers.Hybrid(word_aggregator=
        lambda: my_word_aggregator_module))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 11856202 trainable parameters.

3. Classifier

This component takes the attribute similarity representations and uses those as features for a classifier that determines whether the input tuple pair refers to the same real-world entity.

Customizing Classifier

The ASR can be customized by specifying the classifier parameter as follows:

In [19]:
model = dm.MatchingModel(classifier='3-layer-residual-relu')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17938410 trainable parameters.

classifier can be set to one of the following:

  • A string: A valid style string supported by the dm.modules.Transform module.
  • An instance of dm.Classifier.
  • A callable: Put simply, a function that returns a PyTorch Module. The module must take in one vectors as input and produce the log probability of non-match and match as output. Two outputs are used instead of one to work around a numerical stability issue in torch.
    • Input to module: Two 2d tensors of shape (batch, input_size).
    • Expected output from module: One 2d tensor of shape (batch, 2). The second dimension must contain non-match and match class probabilities, in that order.

string arg example:

In [20]:
model = dm.MatchingModel(classifier='2-layer-highway-tanh')
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17757810 trainable parameters.

dm.Classifier arg example:

In [21]:
model = dm.MatchingModel(classifier=dm.Classifier(
    dm.modules.Transform('3-layer-residual', non_linearity=None, 
                         hidden_size=512)))
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 19137482 trainable parameters.

callable arg example: We create a custom classifier, one that outputs 3 class probabilities (e.g., for text entailment). We also use dm.modules.Lambda and dm.modules.NoMeta as in earlier examples.

In [22]:
my_classifier_module = torch.nn.Sequential(
    dm.modules.Transform('2-layer-highway', hidden_size=300),
    dm.modules.Transform('1-layer', non_linearity=None, output_size=3),
    torch.nn.LogSoftmax(dim=1))

model = dm.MatchingModel(classifier=lambda: my_classifier_module)
model.initialize(train_dataset)  # Explicitly initialize model.
INFO:deepmatcher.core:Successfully initialized MatchingModel with 17758111 trainable parameters.

Design Note: Lazy Initialization

As mentioned earlier, deepmatcher follows a lazy initialization paradigm. This enables it to:

  1. Easily create clones of modules: These clones share the same structure but have their own separate trainable parameters.
  2. Automatically infer input sizes: In order to initialize the model deepmatcher performs one full forward pass through the model. In this process, each component is initialized sonly after initializing all its parent modules in the computational graph. This makes automatic input size inference for modules possible. As a result, plugging in custom modules in the middle of the network is much easier as you do not have to manually compute the input size.
  3. Verify module output shapes: Having incorrect output shapes in custom modules can introduce subtle bugs that are difficult to catch. As part of initialization, deepmatcher verifies that all modules output tensors with the correct output shapes. This verification is done only once during initialization to avoid slowing down training.

It's becase of the above reasons 1 and 2 that deepmatcher does not permit custom modules to be specified directly and requires them to be specified via functions.

The core module that enables lazy initialization is dm.modules.LazyModule which is a base class for most modules in deepmatcher.