Data processing for labeled datasets involves the following operations:
Data processing for unlabeled datasets involves the following operations:
In deepmatcher
we aim to simplify and abstract away the complexity of data processing as much as possible. In some cases however, you may need to customize some aspects of it.
This tutorial is structured into four sections, each describing one kind of customization:
As described in the getting started tutorial, each CSV file is assumed to have the following kinds of columns:
left_prefix
parameter (e.g., use "ltable_" as the prefix).right_prefix
parameter (e.g., use "rtable_" as the prefix).label_attr
parameter.id_attr
parameter.An example of this is shown below:
import deepmatcher as dm
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
left_prefix='left_',
right_prefix='right_',
label_attr='label',
id_attr='id')
By default, data processing involves performing the following two modifications to all data:
Tokenization: Tokenization involves dividing text into a sequence of tokens, which roughly correspond to "words". E.g., "This ain't funny. It's actually hillarious." will be converted to the following sequence after tokenization: ['This', 'ain', ''t', 'funny', '.', 'It', ''s', 'actually', 'hillarious', '.']. The tokenizer can be set by specifying the tokenizer
parameter. By default, this is set to "nltk"
, which will use the default nltk tokenizer. Alternatively, you may set this to "spacy"
which will use the tokenizer provided by the spacy
package. You need to first install and setup spacy
to do this.
import deepmatcher as dm
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
tokenize='spacy')
Rebuilding data cache because: {'Field arguments have changed.'} Load time: 0.747850532643497 Vocab time: 13.160778391174972 Metadata time: 0.3041890738531947 Cache time: 0.5588693134486675
Lowercasing: By default all data is lowercased to improve generalization. This can be disabled by setting the lowercase
parameter to False
.
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
lowercase=False)
Rebuilding data cache because: {'Field arguments have changed.'} Load time: 0.8287883093580604 Vocab time: 0.07724822405725718 Metadata time: 0.2948675565421581 Cache time: 0.5322669483721256
Word embeddings are obtained using fastText by default. This can be customized by setting the embeddings
parameter to any of the pre-trained word embedding models described in the API docs under the embeddings
parameter. We list a few common settings for embeddings
below:
fasttext.en.bin
: Uncased character level fastText
embeddings trained on English Wikipedia. This is the default.fasttext.crawl.vec
: Uncased word level fastText
embeddings trained on Common Crawlfasttext.wiki.vec
: Uncased word level fastText
embeddings trained on English Wikipedia and news dataglove.42B.300d
: Uncased glove word embeddings trained on Common Crawlglove.6B.300d
: Uncased glove word embeddings trained on English Wikipedia and news dataAn example follows:
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
embeddings='glove.42B.300d')
Rebuilding data cache because: {'Field arguments have changed.'}
INFO:torchtext.vocab:Downloading vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip /u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 0.00B [00:00, ?B/s]
Load time: 0.6479603135958314
/u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 1.88GB [04:29, 6.96MB/s] INFO:torchtext.vocab:Extracting vectors into /u/s/i/sidharth/.vector_cache INFO:torchtext.vocab:Loading vectors from /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt 100%|██████████| 1917494/1917494 [02:42<00:00, 11825.27it/s] INFO:torchtext.vocab:Saving vectors to /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt.pt
Vocab time: 647.3909602137282 Metadata time: 0.33549712132662535 Cache time: 0.5913551291450858
In order to avoid redownloading pre-trained embeddings during each run, the downloads are saved to a shared directory which serves as a cache for word embeddings. By default this is set to ~/.vector_cache
, but you may customize this location by setting the embeddings_cache_path
parameter as follows:
# First, remove data cache file. Otherwise `embeddings_cache_path` won't be used - cache
# already contains embeddings information.
!rm -f sample_data/itunes-amazon/*.pth
# Also reset in-memory vector cache. Otherwise in-memory fastText embeddings will be used
# instead of loading them from disk.
dm.data.reset_vector_cache()
# Then, re-process.
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
embeddings_cache_path='~/custom_embeddings_cache_dir')
INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh
Load time: 0.6334623005241156 downloading from Google Drive; may take a few minutes
/afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed <socket.socket fd=53, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('128.105.19.33', 59380), raddr=('172.217.4.110', 443)> download_from_url(url, self.destination) /afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed <socket.socket fd=54, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('128.105.19.33', 48138), raddr=('172.217.4.97', 443)> download_from_url(url, self.destination) INFO:deepmatcher.data.field:Extracting vectors into /u/s/i/sidharth/custom_embeddings_cache_dir
Vocab time: 435.4614539416507 Metadata time: 0.34553063195198774 Cache time: 0.6381127825006843
Processing data is time consuming. In order to reduce the time spent in processing, deepmatcher
automatically caches the result of the data processing step for all labeled datasets. It is designed such that users do not need to be aware of its existence at all - you only need to call deepmatcher.data.process
as you would normally. In the first such call, deepmatcher
would do the processing and cache the result. In subsequent calls, unless there are any changes to this call that would necessitate re-processing data, e.g., a modification in any of the data files or a change in the tokenizer used, deepmatcher
will re-use the cached results. If there changes that makes the cache invalid, it will automatically rebuild the cache. The caching behavior can be customized by setting these parameters in the deepmatcher.data.process
call:
path
as the data sets. This file will store the processed train, validation and test data, along with all relevant information about data processing (e.g. columns to ignore, tokenizer, lowercasing etc.). This defaults to cacheddata.pth
.True
, and we strongly recommend against disabling the check.False
and check_cached_data
is enabled, then deepmatcher
will throw an exception if the cache is stale. This defaults to True
.An example of using these parameters is shown below:
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv',
ignore_columns=('left_id', 'right_id'),
cache='my_itunes_cache.pth',
check_cached_data=True,
auto_rebuild_cache=False)
Load time: 0.6869373824447393 Vocab time: 0.05685392860323191 Metadata time: 0.2930362243205309 Cache time: 0.5228710686787963
Now when you change a processing parameter with auto_rebuild_cache
set to False, you will get an error.