#!/usr/bin/env python
# coding: utf-8

# This quickstart guide explains how to join two tables A and B using edit distance measure. First, you need to import the required packages as follows (if you have installed **py_stringsimjoin** it will automatically install the dependencies **py_stringmatching** and **pandas**):

# In[20]:


# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os, sys


# In[21]:


print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)


# Joining two tables using edit distance measure typically consists of three steps:
# 1. Loading the input tables
# 2. Profiling the tables
# 3. Performing the join

# # 1. Loading the input tables

# We begin by loading the two tables. For the purpose of this guide, 
# we use the sample dataset that comes with the package. 

# In[22]:


# construct the path of the tables to be loaded. Since we are loading a 
# dataset from the package, we need to access the data from the path 
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_A.csv'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'person_table_B.csv'])


# In[23]:


# Load csv files as dataframes.
A = pd.read_csv(table_A_path)
B = pd.read_csv(table_B_path)
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))


# In[24]:


A


# In[25]:


B


# # 2. Profiling the tables

# Before performing the join, we may want to profile the tables to 
# know about the characteristics of the attributes. This can help identify:
# 
# a) unique attributes in the table which can be used as key attribute when performing 
#    the join. A key attribute is needed to uniquely identify a tuple. 
#    
# b) the number of missing values present in each attribute. This can 
#    help you in deciding the attribute on which to perform the join. 
#    For example, an attribute with a lot of missing values may not be a good 
#    join attribute. Further, based on the missing value information you 
#    need to decide on how to handle missing values when performing the join 
#    (See the section below on 'Handling missing values' to know more about
#    the options available for handling missing values when performing the join).
#    
# You can profile the attributes in a table using the following command:

# In[26]:


# profile attributes in table A
ssj.profile_table_for_join(A)


# In[27]:


# profile attributes in table B
ssj.profile_table_for_join(B)


# If the input tables does not contain any key attribute, then you need 
# to create a key attribute. In the current example, both the input tables
# A and B have key attributes, and hence you can proceed to the next step.
# In the case the table does not have any key attribute, you can 
# add a key attribute using the following command:

# In[28]:


B['new_key_attr'] = range(0, len(B))
B


# For the purpose of this guide, we will now join tables A and B on 
# 'name' attribute using edit distance measure. Next, we need to decide on what 
# threshold to use for the join. For this guide, we will use a threshold of 5. 
# Specifically, the join will now find tuple pairs from A and B such that 
# the edit distance over the 'name' attributes is at most 5.

# # 3. Performing the join

# The next step is to perform the edit distance join using the following command:

# In[29]:


# find all pairs from A and B such that the edit distance
# on 'name' is at most 5.
# l_out_attrs and r_out_attrs denote the attributes from the 
# left table (A) and right table (B) that need to be included in the output.

output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])


# In[30]:


len(output_pairs)


# In[31]:


# examine the output pairs
output_pairs


# # Handling missing values

# By default, pairs with missing values are not included
# in the output. This is because a string with a missing value
# can potentially match with all strings in the other table and 
# hence the number of output pairs can become huge. If you want 
# to include pairs with missing value in the output, you need to 
# set the **allow_missing** flag to True, as shown below:

# In[32]:


output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, allow_missing=True, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'])


# In[33]:


output_pairs


# # Enabling parallel processing

# If you have multiple cores which you want to exploit for performing the 
# join, you need to use the **n_jobs** option. If n_jobs is -1, all CPUs 
# are used. If 1 is given, no parallel computing code is used at all, 
# which is useful for debugging and is the default option. For n_jobs below 
# -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of 
# CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If 
# (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code 
# will be used (i.e., equivalent to the default).
# 
# The following command exploits all the cores available to perform the join:

# In[34]:


output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.name', 'B.name', 5, 
                                      l_out_attrs=['A.name'], r_out_attrs=['B.name'], n_jobs=-1)


# In[35]:


len(output_pairs)


# You need to set n_jobs to 1 when you are debugging or you do not want 
# to use any parallel computing code. If you want to execute the join as 
# fast as possible, you need to set n_jobs to -1 which will exploit all 
# the CPUs in your machine. In case there are other concurrent processes 
# running in your machine and you do not want to halt them, then you may 
# need to set n_jobs to a value below -1.

# # Performing join on numeric attributes

# The join method expects the join attributes to be of string type. 
# If you need to perform the join over numeric attributes, then you need 
# to first convert the attributes to string type and then perform the join.
# For example, if you need to join 'A.zipcode' in table A with 'B.zipcode' in
# table B, you need to first convert the attributes to string type using 
# the following command:

# In[36]:


ssj.dataframe_column_to_str(A, 'A.zipcode', inplace=True)
ssj.dataframe_column_to_str(B, 'B.zipcode', inplace=True)


# Note that the above command preserves the NaN values while converting the numeric column to string type. Next, you can perform the join as shown below:

# In[37]:


output_pairs = ssj.edit_distance_join(A, B, 'A.id', 'B.id', 'A.zipcode', 'B.zipcode', 1, 
                                      l_out_attrs=['A.zipcode'], r_out_attrs=['B.zipcode'])
output_pairs


# # Additional options

# You can find all the options available for the edit distance 
# join function using the **help** command as shown below:

# In[ ]:


help(ssj.edit_distance_join)


# # More information

# Similar to edit distance measure, you can use the package to perform 
# join using other measures such as cosine, Dice, Jaccard, overlap and 
# overlap coefficient. For measures such as TF-IDF which are not 
# directly supported, you can perform the join using the filters provided 
# in the package. To know more about other join methods as well as how to 
# use filters, refer to the how-to guide (available from the 
# [package homepage](https://sites.google.com/site/anhaidgroup/projects/py_stringsimjoin)).