Data Processing

Data processing for labeled datasets involves the following operations:

  1. Read and preprocess train, validation and test CSV files (involves data alterations, e.g., tokenization).
  2. Compute global vocabulary (set of all unique words in all 3 datasets).
  3. Compute word embeddings for all words in vocabulary.
  4. Compute metadata on training set (e.g., word frequencies).
  5. Save all 3 processed datasets to cache along with data processing info.

Data processing for unlabeled datasets involves the following operations:

  1. Read and preprocess unlabeled CSV file (involves data alterations, e.g., tokenization).
  2. Compute vocabulary (set of all unique words in all unlabeled dataset).
  3. Compute word embeddings for words in vocabulary that are not present in labeled datasets.

In deepmatcher we aim to simplify and abstract away the complexity of data processing as much as possible. In some cases however, you may need to customize some aspects of it.

This tutorial is structured into four sections, each describing one kind of customization:

  1. CSV format
  2. Data alterations
  3. Word embeddings
  4. Caching

1. CSV format

As described in the getting started tutorial, each CSV file is assumed to have the following kinds of columns:

  • "Left" attributes (required): Remember our goal is to match tuple pairs across two tables (e.g., table A and B). "Left" attributes are columns that correspond to the "left" tuple or the first tuple (in table A) in an tuple pair. These column names are expected to be prefixed with "left_" by default. This can be customized by setting the left_prefix parameter (e.g., use "ltable_" as the prefix).
  • "Right" attributes (required): "Right" attributes are columns that correspond to the "right" tuple or the second tuple (in table B) in an tuple pair. These column names are expected to be prefixed with "right_" by default. This can be customized by setting the right_prefix parameter (e.g., use "rtable_" as the prefix).
  • Label column (required for train, validation, test): Column containing the labels (match or non-match) for each tuple pair. Expected to be named "label" by default. This can be customized by setting the label_attr parameter.
  • ID column (required): Column containing a unique ID for each tuple pair. This is for evaluation convenience. Expected to be named "id" by default. This can be customized by setting the id_attr parameter.

An example of this is shown below:

In [1]:
import deepmatcher as dm
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    left_prefix='left_',
    right_prefix='right_',
    label_attr='label',
    id_attr='id')

2. Data alterations

By default, data processing involves performing the following two modifications to all data:

Tokenization: Tokenization involves dividing text into a sequence of tokens, which roughly correspond to "words". E.g., "This ain't funny. It's actually hillarious." will be converted to the following sequence after tokenization: ['This', 'ain', ''t', 'funny', '.', 'It', ''s', 'actually', 'hillarious', '.']. The tokenizer can be set by specifying the tokenizer parameter. By default, this is set to "nltk", which will use the default nltk tokenizer. Alternatively, you may set this to "spacy" which will use the tokenizer provided by the spacy package. You need to first install and setup spacy to do this.

In [2]:
import deepmatcher as dm
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    tokenize='spacy')
Rebuilding data cache because: {'Field arguments have changed.'}
Load time: 0.747850532643497
Vocab time: 13.160778391174972
Metadata time: 0.3041890738531947
Cache time: 0.5588693134486675

Lowercasing: By default all data is lowercased to improve generalization. This can be disabled by setting the lowercase parameter to False.

In [3]:
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    lowercase=False)
Rebuilding data cache because: {'Field arguments have changed.'}
Load time: 0.8287883093580604
Vocab time: 0.07724822405725718
Metadata time: 0.2948675565421581
Cache time: 0.5322669483721256

3. Word Embeddings

Word embeddings are obtained using fastText by default. This can be customized by setting the embeddings parameter to any of the pre-trained word embedding models described in the API docs under the embeddings parameter. We list a few common settings for embeddings below:

  • fasttext.en.bin: Uncased character level fastText embeddings trained on English Wikipedia. This is the default.
  • fasttext.crawl.vec: Uncased word level fastText embeddings trained on Common Crawl
  • fasttext.wiki.vec: Uncased word level fastText embeddings trained on English Wikipedia and news data
  • glove.42B.300d: Uncased glove word embeddings trained on Common Crawl
  • glove.6B.300d: Uncased glove word embeddings trained on English Wikipedia and news data

An example follows:

In [7]:
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    embeddings='glove.42B.300d')
Rebuilding data cache because: {'Field arguments have changed.'}
INFO:torchtext.vocab:Downloading vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
/u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 0.00B [00:00, ?B/s]
Load time: 0.6479603135958314
/u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 1.88GB [04:29, 6.96MB/s]                               
INFO:torchtext.vocab:Extracting vectors into /u/s/i/sidharth/.vector_cache
INFO:torchtext.vocab:Loading vectors from /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt
100%|██████████| 1917494/1917494 [02:42<00:00, 11825.27it/s]
INFO:torchtext.vocab:Saving vectors to /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt.pt
Vocab time: 647.3909602137282
Metadata time: 0.33549712132662535
Cache time: 0.5913551291450858

In order to avoid redownloading pre-trained embeddings during each run, the downloads are saved to a shared directory which serves as a cache for word embeddings. By default this is set to ~/.vector_cache, but you may customize this location by setting the embeddings_cache_path parameter as follows:

In [8]:
# First, remove data cache file. Otherwise `embeddings_cache_path` won't be used - cache 
# already contains embeddings information.
!rm -f sample_data/itunes-amazon/*.pth

# Also reset in-memory vector cache. Otherwise in-memory fastText embeddings will be used 
# instead of loading them from disk.
dm.data.reset_vector_cache()

# Then, re-process.
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    embeddings_cache_path='~/custom_embeddings_cache_dir')
INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh
Load time: 0.6334623005241156
downloading from Google Drive; may take a few minutes
/afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed <socket.socket fd=53, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('128.105.19.33', 59380), raddr=('172.217.4.110', 443)>
  download_from_url(url, self.destination)
/afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed <socket.socket fd=54, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('128.105.19.33', 48138), raddr=('172.217.4.97', 443)>
  download_from_url(url, self.destination)
INFO:deepmatcher.data.field:Extracting vectors into /u/s/i/sidharth/custom_embeddings_cache_dir
Vocab time: 435.4614539416507
Metadata time: 0.34553063195198774
Cache time: 0.6381127825006843

4. Caching

Processing data is time consuming. In order to reduce the time spent in processing, deepmatcher automatically caches the result of the data processing step for all labeled datasets. It is designed such that users do not need to be aware of its existence at all - you only need to call deepmatcher.data.process as you would normally. In the first such call, deepmatcher would do the processing and cache the result. In subsequent calls, unless there are any changes to this call that would necessitate re-processing data, e.g., a modification in any of the data files or a change in the tokenizer used, deepmatcher will re-use the cached results. If there changes that makes the cache invalid, it will automatically rebuild the cache. The caching behavior can be customized by setting these parameters in the deepmatcher.data.process call:

  • cache: The filename of the cache file which will be stored in the same path as the data sets. This file will store the processed train, validation and test data, along with all relevant information about data processing (e.g. columns to ignore, tokenizer, lowercasing etc.). This defaults to cacheddata.pth.
  • check_cached_data: Whether to check the contents of the cache file to ensure its compatibility with the specified data processing arguments. This defaults to True, and we strongly recommend against disabling the check.
  • auto_rebuild_cache: Whether to automatically rebuild the cache file if the cache is stale, i.e., if the data processing arguments have changed in a way that makes the previously processed data invalid. If this is False and check_cached_data is enabled, then deepmatcher will throw an exception if the cache is stale. This defaults to True.

An example of using these parameters is shown below:

In [11]:
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    cache='my_itunes_cache.pth',
    check_cached_data=True,
    auto_rebuild_cache=False)
Load time: 0.6869373824447393
Vocab time: 0.05685392860323191
Metadata time: 0.2930362243205309
Cache time: 0.5228710686787963

Now when you change a processing parameter with auto_rebuild_cache set to False, you will get an error.