This IPython notebook show some features of the Python Nazca library :
Once you have created your datasets, and define your preprocessings and blockings, you can use the BaseAligner
object to perform the alignment.
The BaseAligner
is defined as:
class BaseAligner(object):
def register_ref_normalizer(self, normalizer):
""" Register normalizers to be applied before alignment """
def register_target_normalizer(self, normalizer):
""" Register normalizers to be applied before alignment """
def register_blocking(self, blocking):
self.blocking = blocking
def align(self, refset, targetset, get_matrix=True):
""" Perform the alignment on the referenceset and the targetset """
def get_aligned_pairs(self, refset, targetset, unique=True):
""" Get the pairs of aligned elements """
The align()
function return the global distance matrix and the matched elements as a dictionnary, with key the index of reference records, and values the list of aligned target set records.
from nazca.utils.distances import GeographicalProcessing
from nazca.rl.aligner import BaseAligner
refset = [['R1', 'ref1', (6.14194444444, 48.67)],
['R2', 'ref2', (6.2, 49)],
['R3', 'ref3', (5.1, 48)],
['R4', 'ref4', (5.2, 48.1)]]
targetset = [['T1', 'target1', (6.17, 48.7)],
['T2', 'target2', (5.3, 48.2)],
['T3', 'target3', (6.25, 48.91)]]
processings = (GeographicalProcessing(2, 2, units='km'),)
aligner = BaseAligner(threshold=30, processings=processings)
mat, matched = aligner.align(refset, targetset)
print mat
print matched
The get_aligned_pairs()
directly yield the found aligned pairs and the distance
aligner = BaseAligner(threshold=30, processings=processings)
for pair in aligner.get_aligned_pairs(refset, targetset):
print pair
We can plug the preprocessings using register_ref_normalizer()
and register_target_normalizer
, and the blocking using register_blocking()
. Only ONE blocking is allowed, thus you should use PipelineBlocking for multiple blockings.
import nazca.utils.normalize as nno
from nazca.rl import blocking as nrb
normalizer = nno.SimplifyNormalizer(attr_index=1)
blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)
aligner = BaseAligner(threshold=30, processings=processings)
aligner.register_ref_normalizer(normalizer)
aligner.register_target_normalizer(normalizer)
aligner.register_blocking(blocking)
for pair in aligner.get_aligned_pairs(refset, targetset):
print pair
An unique
boolean could be set to False to get all the alignments and not just the one unique on the target set.
for pair in aligner.get_aligned_pairs(refset, targetset, unique=False):
print pair
A pipeline of aligners could be created using PipelineAligner
.
from nazca.utils.distances import LevenshteinProcessing, GeographicalProcessing
from nazca.rl.aligner import PipelineAligner
processings = (GeographicalProcessing(2, 2, units='km'),)
aligner_1 = BaseAligner(threshold=30, processings=processings)
processings = (LevenshteinProcessing(1, 1),)
aligner_2 = BaseAligner(threshold=1, processings=processings)
pipeline = PipelineAligner((aligner_1, aligner_2))
for pair in pipeline.get_aligned_pairs(refset, targetset):
print pair