Developing A Fuzzy String Strategy Pt. 2 - Clustering

In [2]:
Image(r'C:\Users\rcreedon\Dropbox\Rory Notes\Notes\JellyFish\StringStrategy\clusters.png')
Out[2]:

Author: Rory Creedon

Date: 5 December 2013

Purpose: Building upon a previous notebook (Fuzzy Strategy Part 1), this notebook continues to look at how we might match fuzzy style data strings (and by extension any set of strings). In particular the previous notebook looked at how to strip out generic components of strings in order that matching may happen on a less generic core ID substring. This notebook starts where we left off with the next step which is clustering the strings into matched 'groups'.

The question of clustering is somewhat technical in nature, and I had never looked into it before writing this notebook. Therefore the potential for error is quite large. If you find mistakes please do notify me. It seems possible that this notebook will become as much about clustering as it is about matching strings.

A good primer, that got me understanding the language of clustering is the following student dissertation:

http://members.tripod.com/asim_saeed/paper.htm

A lot of this notebook may be pretty much directly plagerised from that paper.

As ever this notebook has been developed for my work, and so the issues may not be totally intelligible to all readers (if there are any). However, I take pains to explain what I am doing, and so a careful reading will help those facing similar issues.

Comments/Questions - [email protected]

In [1]:
import pandas as pd
import numpy as np
import datetime
import os
from pandas import DataFrame
from numpy import nan as NA
from IPython.core.display import HTML
from IPython.core.display import Image
from IPython.display import Math
from IPython.display import Latex
import collections
import jellyfish as jf
import re
import random
import itertools
%qtconsole


Introduction to Clustering

Meaning of Clustering
From Wikipedia:

"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)."

So in the context of string matching, where more than one match may be possible per string this means clustering the data into groups of strings that are all potential matches for each other.

Clustering Methods
There are many different types of clustering method, and it is hard to know which is most appropriate for strings (hard at least for me to divine at this stage). Before looking at any particular methods for clustering I plagerise a brief introduction to the techniques that seem relevant to this particular problem (drawn from above referenced paper and wikipedia).

Clustering methods vary in their approach to partitioning. Partitioning refers to how a data set of N observations is partitioned into clusters. There are at least four different types of partitioning:

  1. Strict Partitioning - each object belongs to exactly one cluster
  2. Strict Partitioning with Outliers - as above but outliers can belong to no cluster and are therefore considered outliers
  3. Overlapping - (sometimes referred to as clumping) - objects may belong to more than one cluster
  4. Hierarchical - objects that belong to a child cluster also belong to the parent cluster

In all methods, an object is a member of the cluster(s) with which it is most similar (however that is to be defined). Generally each cluster has a centroid that represents the cluster. Typically this will be some summary description/statistic. The form of this statistic will be dependent upon the nature of the object being clustered.

When deciding upon which method is to be used, clearly the structure of the problem is the most pertinent factor.


Single Pass Partitioning Method

This is the simplest method I could find, and could well be sufficient for my purposes. It partitions the data strictly. The program can be described as follows:

  1. Make the first object the centroid for the first cluster
  2. For the next object, calculate the similarity with each existing cluster centroid
  3. If the highest calculated similarity is over some threshold value then add the object to the relevant cluster and re-determine the centroid of that cluster, otherwise use the object to initiate a new cluster. If any object remains to be clustered then return to step 2.

This method only requires a single pass through the data set which is desirable from a computational power point of view. One issue is that the clusters are not independent of the order in which the data are processed, which leads me to conclude that that data should be randomly shuffled before being processed.

In the method as applied here the similarity between objects is calculated as the Jaro-Distance.

Below is a function that I believe implements the partitioning method as described above. There are comments that are numbers that link to explanations below the code

In [3]:
def SLINK(SList, Threshold):
    #1
    random.shuffle(SList)
    Clusters = []
    Centroid = []
    Scores = []
    
    for string in SList:     
        SPScores = []
        Matched = 0
        
        #2
        if len(Clusters) == 0:
            Clusters.append([string])
            Centroid.append([string])
            Scores.append([])
            continue
        
        #3
        for ClustNum in xrange(len(Clusters)):
            Dist = jf.jaro_distance(string, Centroid[ClustNum][0])
            SPScores.append(Dist)
        
        #4
        MaxVal = max(SPScores)
        MaxInd = SPScores.index(max(SPScores))
        
        #5
        if MaxVal >= Threshold:
            Clusters[MaxInd].append(string)
            
            #6
            if len(Scores[MaxInd]) == 0:
                Scores[MaxInd].append(MaxVal)               
            else:
                #7
                if MaxVal > Scores[MaxInd][0]:
                    Scores[MaxInd][0] = MaxVal
                    Centroid[MaxInd][0] = string    
            Matched = 1
        
        #8
        if Matched ==0:       
            Clusters.append([string])
            Centroid.append([string])
            Scores.append([])
    
    return Clusters

The function accepts two arguments: Firstly a list of strings to be clustered, secondly the threshold which must be reached in order to consider a string matched. It returns a list of lists, where each list represents a cluster.

Comments are as follows:

  1. The shuffle command randomly orders the list of strings passed as the argument to the function. The objects 'Clusters' 'Centroid' and 'Scores' are list objects that will store information about the clusters created. The information will be stored as lists, which means these objects will ultimately be lists of lists. By ensuring that these objects are modified in the same manner at all stages of the function, it is possible to ensure that for cluster $i$ the Cluster itself, the cluster centroid, and the cluster centroid score will be located within the Cluster, Centroid and Scores objects (respectively) at index position $j$.
  2. This is the behaviour when the string being evaluated is the first such string. I.e. the length of the Clusters object is 0. In this case the first string is used to initialize the first cluster.
  3. Loop through the clusters and calculate the jaro-distance between the string being evaluated, and the centroid of each cluster and store the distance measures in a list object names SPScores
  4. Find the max value of the calculated jaro-distances, and the index position of that value in the SPScores list. This index value will also be the index value for the corresponding cluster, centroid and score in the list objects as discussed above
  5. If the max value is above a given threshold, then append the string being evaluated to the relevant cluster.
  6. If the length of the corresponding list item in the Scores object is equal to zero, then this is the first match that has been made with centroid of the relevant cluster. In that case, append the max value of the jaro distance to the empty list within Scores.
  7. If the length as discussed above is not zero, then this is not the first such match. If then the max jaro score for the string being evaluated is larger than the previous scored match, then the relevant list item is updated to reflect the increase in score, and the centroid for that cluster is set as the string being evaluated. This is an attempt to have the centroid as the 'truest' version of that string. However, this method is totally arbitrary with respect to the order that that strings are evaluated.
  8. If Matched = 0 then no jaro distance score was above the threshold for the string being evaluated. In this case the string then becomes the initializer of a new centroid.

Now, let's see this at work with some real data. The following are (modified) style names as observed in some work conducted here in Bangladesh. The above function will be used to group those style names into clusters of strings that potentially represent the same style, but have minor differences in the string value assigned to them.

Read in some data

In [4]:
Styles = list(pd.Series(pd.read_csv(r'C:\Users\rcreedon\Dropbox\GIZSupervisor\DATA\Production_Data\STP_Data\Data_Sets\Wave1\1005\1005_all_merged.csv')['style'].unique()))

Delete missing values, remove multiple white spaces, make lower case, remove special characters

In [5]:
del Styles[Styles.index(np.nan)]
Styles = [re.sub('[^A-Za-z0-9 ' ']+', '', style) for style in Styles]
Styles = [style.lower() for style in Styles]
Styles = [" ".join(style.split()) for style in Styles]
Styles
Out[5]:
['denim jacket',
 'denim vest',
 'jonas long',
 '2618 longer denim',
 'lilen trs',
 'ride',
 'tye dye jean',
 'jonas short',
 'jacket 2002',
 'dobby short',
 'parker chino',
 '550 pant',
 'ride 510',
 'chambrey117',
 'space treggins',
 'jhonny boxer',
 'new skinny',
 'clint',
 'todd jkt',
 'jacket 2001',
 'back to school',
 'jacket 2001',
 'reidun parka',
 'mixmatch',
 'reidun parkalining',
 'parko chino',
 'camden jkt',
 'cb32405510',
 'bomber jkt',
 '5 pkt cord',
 'shairt615',
 'shairt537',
 'shairt564',
 'love trs340',
 '1zb153tpant',
 '1zb153tjkt',
 'jacketswear',
 'jacketbk to schoolboys',
 'jacketbk to schoolgirls',
 'jacket253ke',
 '14011876']

Create a clusters object

In [6]:
Clusters = SLINK(Styles, 0.8)

Sort clusters object according to size of the elements

In [7]:
Clusters.sort(lambda x,y: cmp(len(y), len(x)))
Clusters
Out[7]:
[['jacket 2002', 'jacket 2001', 'jacket 2001'],
 ['shairt615', 'shairt537', 'shairt564'],
 ['jacketbk to schoolboys', 'back to school', 'jacketbk to schoolgirls'],
 ['1zb153tjkt', '1zb153tpant'],
 ['parker chino', 'parko chino'],
 ['reidun parkalining', 'reidun parka'],
 ['ride 510', 'ride'],
 ['denim jacket', 'denim vest'],
 ['jacketswear'],
 ['tye dye jean'],
 ['love trs340'],
 ['14011876'],
 ['bomber jkt'],
 ['2618 longer denim'],
 ['5 pkt cord'],
 ['todd jkt'],
 ['jacket253ke'],
 ['camden jkt'],
 ['lilen trs'],
 ['jonas short'],
 ['mixmatch'],
 ['chambrey117'],
 ['jonas long'],
 ['550 pant'],
 ['clint'],
 ['jhonny boxer'],
 ['cb32405510'],
 ['dobby short'],
 ['space treggins'],
 ['new skinny']]

It appears to have worked rather well. Admittedly I began with a prior that that threshold should be 0.85 (based on work done in previous notebooks), but this had to be lowered to achieve the above somewhat sensible clusters. Sensible, only in that they conform to what I would think of as potentially matched.

In fact this method seems to be pretty decent for what I am trying to do, therefore I am going to pause with the clustering methods, and move on to how the above method could be used in combination with the stripping of generic components method previously discussed.


Combining this Method with Generic Component Stripping

As discussed in the previous notebook, I believe that for the type of string I am working with, it is beneficial to strip out the generic components of the strings being work with. A recapitulation of that method is not going to be given here, but I will now combine the methods.

One of the somewhat unfortunate side effects of stripping away generic components, is that it is theoretically possible that the resulting list of strings will have multiple entries that are exactly the same. In order for the clustering to be useful it is important that after the clustering has happened, the original string can be identified in order that it can be modified if thought necessary. This means, some unbreakable link needs to be be formed between the original string and its stripped counterpart.

It is possible that this can be done using a hash table, however the method I prefer to explore here is of creating a new class of object that contains both the original string and the stripped version. This is achieved below:

In [8]:
class Stripped:
    'Common base class for all stripped stings'
    
    def __init__(self, original, GenericAll, GenericWhite = None, DelWhite = False):
        # Class attribute that is the string in its original format
       
        self.original = original
        StrVal = original.lower()
        StrVal = re.sub('[^A-Za-z0-9 ' ']+', ' ', StrVal)
        
        #strip out all occurences of sub-strings from GenericAll list that appear anywhere in the string
        for word in GenericAll:
            RegEx1 = re.compile('' + word)
            StrVal = re.sub(RegEx1, '', StrVal)
        
        # If provided as argument strip out all occurences of sub-string when that sub string is surrounded by 
        # whitespace (i.e. is not part of another substring-sequence)
        if not GenericWhite == None:
            for word in GenericWhite:
                RegEx2 = re.compile(r'\b' + word + r'\b')
                StrVal = re.sub(RegEx2, '', StrVal)
        
        # Removes special characters, removes all whitespace
        if DelWhite == True:
            StrVal = StrVal.replace(' ', '')
        
        # Class attribute that is the stipped string
        self.stripped = StrVal

Let's look at the components of this class object. Each time an instance of the class Stripped is created, the following arguments must be passed:

  • Original - this is the original string that is to be manipulated and stored as an attribute of the Stripped Class
  • GenericAll - this is a list of the words (i.e. sub-strings) that are to be stripped out of the string before the clustering occurs. These sub-strings will be stripped out no matter where they appear in the string sequence of characters (i.e. possibly from the middle of words/sub-strings) For information about to derive this list from the data, then see the previous notebook. In theory, this could all be done totally automatically, and this may be something I explore toward the end of this notebook.

The following arguments are optional:

  • GenericWhite - this is a list of the words (i.e sub-strings) that will be stripped out. Unlike the GenericAll list these sub strings will only be stripped out if they are surrounded by whitespace (i.e. they are not part of another sub string)
  • DelWhite - set DelWhite equal to True if it desired to compare strings that have no whitespace.

The StippedClass object has two attributes.

  1. The original string
  2. The stripped string

The stripped string is the original string in lower case with special characters removed, and all instances of the sub-strings listed in GenericAll removed. If GenericWhite is passed as an argument then those sub-strings are also removed (when standalone sub-sqtrings). If DelWhite = True then all whitespaces are also removed.

In order to use the SLINK function on this Stripped Class, some modifications are necessary:

In [9]:
def SlinkSC(ClassList, Threshold):
    #1
    random.shuffle(ClassList)
    
   
    Clusters = []
    ClustersStripped = []
    Centroid = []
    Scores = []
    
    for StrippedClass in ClassList:     
        SPScores = []
        Matched = 0
        
        if len(Clusters) == 0:
            Clusters.append([StrippedClass.original])
            ClustersStripped.append([StrippedClass.stripped])
            Centroid.append([StrippedClass.stripped, StrippedClass.original])
            Scores.append([])
            continue
        
        for ClustNum in xrange(len(Clusters)):
            Dist = jf.jaro_distance(StrippedClass.stripped, Centroid[ClustNum][0])
            SPScores.append(Dist)
        
        MaxVal = max(SPScores)
        MaxInd = SPScores.index(max(SPScores))
        
        if MaxVal >= Threshold:
            Clusters[MaxInd].append(StrippedClass.original)
            ClustersStripped[MaxInd].append(StrippedClass.stripped)
            
            if len(Scores[MaxInd]) == 0:
                Scores[MaxInd].append(MaxVal)               
            else:
                if MaxVal > Scores[MaxInd][0]:
                    Scores[MaxInd][0] = MaxVal
                    Centroid[MaxInd][0] = StrippedClass.stripped
                    Centroid[MaxInd][1] = StrippedClass.original
            Matched = 1
        
        if Matched ==0:       
            Clusters.append([StrippedClass.original])
            ClustersStripped.append([StrippedClass.stripped])
            Centroid.append([StrippedClass.stripped, StrippedClass.original])
            Scores.append([])
    
    return [Clusters, ClustersStripped, Centroid]

The function now accepts rather than a list of strings, a list of StrippedClass objects. The matching is done on the 'stripped' attribute of each object (i.e. the stripped strings). However, two cluster lists are now created and returned. The first one is a list of the clusters where the original string values are reported. It is primarily this one that will be of use. The second one is list of the cluster where the stripped string values are reported. This is primarily for information purposes.

The function returns a list of three lists, the first as noted above is the list of clusters that reports the original string names, the second is the list of clusters that reports the stripped string names, the third is the list of centroids in list format, the first element being the stripped version, the second the original string (to be used in setting naming conventions)

The method, including the stripping of generic components is now demonstrated with the same style data as above:

Step 1: Read Data & Determine Generic Components
NB for the sake of convenience I strip out special characters and make lower case at this stage, this is not strictly necessary as this will done when creating the Stripped Class objects

Read in some data, stip special characters etc. (as above)

In [10]:
Styles = list(pd.Series(pd.read_csv(r'C:\Users\rcreedon\Dropbox\GIZSupervisor\DATA\Production_Data\STP_Data\Data_Sets\Wave1\1005\1005_all_merged.csv')['style'].unique()))
del Styles[Styles.index(np.nan)]
Styles = [re.sub('[^A-Za-z0-9 ' ']+', '', style) for style in Styles]
Styles = [style.lower() for style in Styles]
Styles = [" ".join(style.split()) for style in Styles]

Create dictionary to determine generic components

In [11]:
WordDict = {}
for style in Styles:
    for word in style.split(' '):
        if word not in WordDict:
            WordDict[word] = 1
        else:
            WordDict[word] +=1
for word, value in WordDict.iteritems():
    if value > 1:
        print word, value
denim 3
reidun 2
2001 2
jacketbk 2
to 3
jkt 3
jonas 2
chino 2
short 2
ride 2
jacket 4

Step 2: Create Lists of Generic Words to be Stripped Out
NB in this instance I do not wish to create a GenericWhite list

In [12]:
GenericAll = ['denim', 'jkt', 'chino', 'short', 'jacket']

Step 3: Create list of Stripped Class Objects

In [13]:
ClassList = [Stripped(elem, GenericAll, DelWhite = True) for elem in Styles]

Step 4: Apply the SlinkSC Function to the list of Stripped Class Objects
NB I also separate out the lists, and sort them according to the length of the sub-lists

In [14]:
Clustered = SlinkSC(ClassList, 0.8)
ClustersOriginal = Clustered[0]
ClustersOriginal.sort(lambda x,y: cmp(len(y) , len(x)))
ClustersStripped = Clustered[1]
ClustersStripped.sort(lambda x,y: cmp(len(y) , len(x)))

Step 5: View the Clusters

In [15]:
#Original Clusters
ClustersOriginal[:8]
Out[15]:
[['jacket 2001', 'jacket 2001', 'jacket 2002'],
 ['shairt564', 'shairt537', 'shairt615'],
 ['parker chino', 'parko chino'],
 ['reidun parkalining', 'reidun parka'],
 ['1zb153tpant', '1zb153tjkt'],
 ['jonas short', 'jonas long'],
 ['ride 510', 'ride'],
 ['jacketbk to schoolboys', 'jacketbk to schoolgirls']]
In [16]:
#Stripped Clusers
ClustersStripped[:8]
Out[16]:
[['2001', '2001', '2002'],
 ['shairt564', 'shairt537', 'shairt615'],
 ['parker', 'parko'],
 ['reidunparkalining', 'reidunparka'],
 ['1zb153tpant', '1zb153t'],
 ['jonas', 'jonaslong'],
 ['ride510', 'ride'],
 ['bktoschoolboys', 'bktoschoolgirls']]

The clusters as above have been recreated, but the threshold was raised to 0.8 (this is not

So this system could be useful it seems. However, in order to investigate further, a more complex data set if needed. This is looked at below:


A More Complex Data Set

Read in some data

In [17]:
Bstyles = list(pd.read_csv('ExampleData.csv')['style'].unique())

List generic components to be removed (as per previous notebook)

In [18]:
GenericRemoveAll = ['bottom', 'boys', 'long', 'topb', 'tank', 'basic', 'polo', 'shorts', 'ssivts', \
               'top', 'sslvts', 'mens', 'nightware', 'lslvts', 'nightwear', 'msp', 'lsivts', 'tee', 'large', 'slv']

GenericWhiteSpace = ['l', 'pj', 's']

Create list of Stripped Class obejcts from the Bstyles data

In [19]:
# uses list comprehension
BClassList = [Stripped(elem, GenericRemoveAll, GenericWhiteSpace, True) for elem in Bstyles]

Apply function to the list of Stipped Class objects

In [20]:
BClustersSC = SlinkSC(BClassList, 0.85)
# Give separate names to each list in the ClusterSC object
BClustersOriginal = BClustersSC[0]
BClustersOriginal.sort(lambda x,y: cmp(len(y) , len(x)))
BClustersStripped = BClustersSC[1]
BClustersStripped.sort(lambda x,y: cmp(len(y) , len(x)))

Display first 5 clusters in original format

In [21]:
BClustersOriginal[:5]
Out[21]:
[['SR-560 SR-564 S-Siv-T-S',
  'SR-542 SR-565 S-Siv-T-S',
  'SR-560 S-Siv-T-S',
  'SR-567 S-Siv-T-S',
  'SR-568 L-Siv-T-S',
  'SR-562 S-Siv-T-S',
  'SR-564 S-Siv-T-S',
  'SR-561 S-Siv-T-S'],
 ['PJ-49131B Pj-49131 Nightware Bottom',
  'PJ-49131B PJ-49131 Nightware Bottom',
  'PJ-49131Tb Pj-49131 Nightware Top_b',
  'PJ-49131Tb PJ-49130/ PJ-2912 Nightware Top_b',
  'PJ-49131Tb PJ-49130/2913 Nightware Top_b',
  'PJ-49131Tb PJ-49130 Nightware Top_b',
  'PJ-49131Tb Pj-49131/2916 Nightware Top_b',
  'PJ-49131B Pj-49131 /pj-2913 Nightware Bottom'],
 ['UWC-931-006 194 L-Slv-T-S',
  'UWC-931-008 194 L-Slv-T-S',
  'UWC-933-006 194 L-Slv-T-S',
  'UWC-931-006 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-933-006 192/194 S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 194 S-Slv-T-S/ L-Slv-T-S',
  'UWC-931-008 194/192 L-Slv-T-S/ S-Slv-T-S',
  'UWC-931-008 192/194 S-Slv-T-S L-Slv-T-S'],
 ['UWC-932-010 192/194 S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-932-010 192 /M.S.P S-Slv-T-S /L-Slv-T-S',
  'UWC-931-008 192/194 S-Slv-T-S/ L-Slv-T-S',
  'UWC-933-006 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-932-010 192 S-Slv-T-S',
  'UWC-932-010 192/Serafeno S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 192 S-Slv-T-S'],
 ['674262 Ronny S-Slv-T-S',
  '674265 Ronny S-Slv-T-S',
  '674263 Ronny S-Slv-T-S',
  '674262 Ronny S-Siv-T-S',
  '674261 Ronny S-Siv-T-S',
  '674262 Ronny Cocoon S-Siv-T-S Bottom',
  '694265 Ronny S-Slv-T-S',
  '674268 Ronny S-Siv-T-S']]

Display first 5 clusters in stripped format

In [22]:
BClustersStripped[:5]
Out[22]:
[['sr560sr564sivt',
  'sr542sr565sivt',
  'sr560sivt',
  'sr567sivt',
  'sr568sivt',
  'sr562sivt',
  'sr564sivt',
  'sr561sivt'],
 ['49131b49131',
  '49131b49131',
  '49131tb49131b',
  '49131tb491302912b',
  '49131tb491302913b',
  '49131tb49130b',
  '49131tb491312916b',
  '49131b491312913'],
 ['uwc931006194t',
  'uwc931008194t',
  'uwc933006194t',
  'uwc931006194192tt',
  'uwc933006192194tt',
  'uwc931008194tt',
  'uwc931008194192tt',
  'uwc931008192194tt'],
 ['uwc932010192194tt',
  'uwc931008194192tt',
  'uwc932010192mptt',
  'uwc931008192194tt',
  'uwc933006194192tt',
  'uwc932010192t',
  'uwc932010192serafenott',
  'uwc931008192t'],
 ['674262ronnyt',
  '674265ronnyt',
  '674263ronnyt',
  '674262ronnysivt',
  '674261ronnysivt',
  '674262ronnycocoonsivt',
  '694265ronnyt',
  '674268ronnysivt']]

Interim Conclusions

So how useful does this prove?

I think the answer to that depends to a great deal on the nature of the data being examined. If deviations from the 'true' string are expected to be small, the size of the clusters small etc. then this system could work quite smoothly. When the data are simple I think this system will get good results. Also, the clusters in small data sets can be viewed with the eye and judges as to suitability.

When the data are more complex, as above, the results are somewhat more debatable. For example, it seems likely that the 'polo' cluster above truly represents the same style. However it is not clear that the small differences in the ID component (such as a different digit) are not representative of genuine differences.

One possible strategy to deal with this is to simply increase the threshold of what is considered a matched string as below:

In [23]:
BLClustersSC = SlinkSC(BClassList, 0.95)
# Give separate names to each list in the ClusterSC object
BLClustersOriginal = BClustersSC[0]
BLClustersOriginal.sort(lambda x,y: cmp(len(y) , len(x)))
BLClustersStripped = BClustersSC[1]
BLClustersStripped.sort(lambda x,y: cmp(len(y) , len(x)))
In [24]:
BLClustersOriginal[:5]
Out[24]:
[['SR-560 SR-564 S-Siv-T-S',
  'SR-542 SR-565 S-Siv-T-S',
  'SR-560 S-Siv-T-S',
  'SR-567 S-Siv-T-S',
  'SR-568 L-Siv-T-S',
  'SR-562 S-Siv-T-S',
  'SR-564 S-Siv-T-S',
  'SR-561 S-Siv-T-S'],
 ['PJ-49131B Pj-49131 Nightware Bottom',
  'PJ-49131B PJ-49131 Nightware Bottom',
  'PJ-49131Tb Pj-49131 Nightware Top_b',
  'PJ-49131Tb PJ-49130/ PJ-2912 Nightware Top_b',
  'PJ-49131Tb PJ-49130/2913 Nightware Top_b',
  'PJ-49131Tb PJ-49130 Nightware Top_b',
  'PJ-49131Tb Pj-49131/2916 Nightware Top_b',
  'PJ-49131B Pj-49131 /pj-2913 Nightware Bottom'],
 ['UWC-931-006 194 L-Slv-T-S',
  'UWC-931-008 194 L-Slv-T-S',
  'UWC-933-006 194 L-Slv-T-S',
  'UWC-931-006 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-933-006 192/194 S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 194 S-Slv-T-S/ L-Slv-T-S',
  'UWC-931-008 194/192 L-Slv-T-S/ S-Slv-T-S',
  'UWC-931-008 192/194 S-Slv-T-S L-Slv-T-S'],
 ['UWC-932-010 192/194 S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-932-010 192 /M.S.P S-Slv-T-S /L-Slv-T-S',
  'UWC-931-008 192/194 S-Slv-T-S/ L-Slv-T-S',
  'UWC-933-006 194/192 L-Slv-T-S S-Slv-T-S',
  'UWC-932-010 192 S-Slv-T-S',
  'UWC-932-010 192/Serafeno S-Slv-T-S L-Slv-T-S',
  'UWC-931-008 192 S-Slv-T-S'],
 ['674262 Ronny S-Slv-T-S',
  '674265 Ronny S-Slv-T-S',
  '674263 Ronny S-Slv-T-S',
  '674262 Ronny S-Siv-T-S',
  '674261 Ronny S-Siv-T-S',
  '674262 Ronny Cocoon S-Siv-T-S Bottom',
  '694265 Ronny S-Slv-T-S',
  '674268 Ronny S-Siv-T-S']]

This makes what may in the end turn out to be more sensible clusters, but possibly increasing the threshold means that it is possible that too many clusters are created, thereby creating a situation where true matches are not clustered together. Notice the difference in the number of clusters at the different thresholds below:

In [25]:
# Number of clusters with threshold at 0.85
print len([elem for elem in BClustersOriginal if len(elem) > 1])

# Number of clusters with threshold at 0.95
print len([elem for elem in BLClustersOriginal if len(elem) > 1])
101
101

A second strategy that could prove useful is to use another data point to verify the clusters. This will be particularly useful when clustering strings that have other attributes that are consistent when the strings may be inconsistent. One possible strategy would be to set the threshold iteratively until such time as the maximum number of matches can be made within cluster by verifying with an alternative data point.

For the data currently being studied, the SMV would be a good candidate. For name matching for example, age or similar would be a sensible candidate.


Confirming Clusters with Alternate Varible

Read in some data

In [26]:
DF = DataFrame(pd.read_csv('ExampleData.csv'))

Before proceeding to look at possible functions, let's look at an example. Within the second cluster of the data, the unique SMVs for each of the styles that have been clustered together are as follows:

In [27]:
#Example 1
for style in BClustersOriginal[1]:
    print DF[DF.style == style]['smv'].unique()
print BClustersOriginal[1]
print BClustersStripped[1]
[ 8.97]
[ 8.97]
[ 8.02]
[ 8.02]
[ 8.02]
[ 8.02]
[ 8.02]
[ 8.97]
['PJ-49131B Pj-49131 Nightware Bottom', 'PJ-49131B PJ-49131 Nightware Bottom', 'PJ-49131Tb Pj-49131 Nightware Top_b', 'PJ-49131Tb PJ-49130/ PJ-2912 Nightware Top_b', 'PJ-49131Tb PJ-49130/2913 Nightware Top_b', 'PJ-49131Tb PJ-49130 Nightware Top_b', 'PJ-49131Tb Pj-49131/2916 Nightware Top_b', 'PJ-49131B Pj-49131 /pj-2913 Nightware Bottom']
['49131b49131', '49131b49131', '49131tb49131b', '49131tb491302912b', '49131tb491302913b', '49131tb49130b', '49131tb491312916b', '49131b491312913']
In [28]:
#Example 2
for style in BLClustersOriginal[0]:
    print DF[DF.style == style]['smv'].unique()
[ 17.05]
[ 10.27]
[ 17.05]
[ 18.61]
[ 12.96]
[ 24.61]
[ 17.05]
[ 22.28]

It seems clear to me that the example drawn from the higher threshold clustering (example 2) represents the same style.

The first example is a bit trickier. Within this cluster, it appears that the cluster has been formed such that in fact it captures strings that relate to two underlying 'true' styles. The question is, should the threshold be raised such that these styles are grouped separately (and would this even be possible?), or should I accept the overlap in the cluster an separately claim that within cluster there may be multiple styles?

I think the first thing to try would be to iteratively set the threshold to maximize the number of groupings that have consistent SWVs within each cluster.

In [29]:
def maxClusters(ClassList, startThreshold, stopThreshold, step):
    
    Threshold = startThreshold
    ConsistentClusters = []
    
    while Threshold <= stopThreshold:
        ConsistentCount = 0
        TotClusters = SlinkSC(ClassList, Threshold)
        MultiClusters = [elem for elem in TotClusters[0] if len(elem) > 1]
        for elem in MultiClusters:
            MultiSMV = list(itertools.chain(*[DF[DF.style == style]['smv'].unique() for style in elem]))
            if len(set(MultiSMV)) == 1:
                ConsistentCount +=1
        ConsistentClusters.append(ConsistentCount)
        Threshold += step
    
    return ConsistentClusters

        
        

The above function will effectively iteratively call the SlinkSC function on the list of Stripped Class objects using different threshold values, that fall between startThreshold, and stopThreshold, and move in steps as specified by the user in the function arguments.

For the clusters that are produced at iterations, the unique SMVs are pulled from the underlying DF and placed in a list. If all of the unique SMVs in the cluster form a set of length 1, then they are the same, and the cluster is therefore correctly specified (i.e. all strings in fact represent the same underlying style). In this case, then the ConsistentCount is incremented by one.

After each iteration the ConsistenCount variable is append to a list and after the iterations are complete this list is returned.

The returned list allows the user to identify at which threshold value the most number of consistent clusters are identified.

Let's see an example, in fact let's see three examples

In [30]:
print maxClusters(BClassList, 0.7, 0.95, 0.025)
print maxClusters(BClassList, 0.7, 0.95, 0.025)
print maxClusters(BClassList, 0.7, 0.95, 0.025)
[27, 27, 32, 35, 42, 47, 60, 50, 43, 48]
[25, 32, 31, 43, 42, 53, 58, 55, 42, 42]
[21, 26, 32, 43, 43, 53, 57, 54, 46, 41]

What is intriguing about the above is that even though the exact same commands were issued, the results were different. The reason for this is that the list of Stripped Class objects is shuffled in a different way upon each iteration, and as was noted above, the clusters this method generates are not independent of the way in which the data are sorted.

So, this is not necessarily a problem that can be dealt with. However, it would be sensible to run the above analysis a number of times in order to gauge where to best set the threshold. This is now done below

NB the computer is doing a lot of work here so it may take some time to compute! !

In [31]:
Optimum = []
for x in xrange(10):
    L = maxClusters(BClassList, 0.7, 0.95, 0.025)
    Ind = L.index(max(L))
    Val = 0.7 + (Ind*0.025)
    Optimum.append(Val)
In [32]:
Optimum
Out[32]:
[0.875, 0.85, 0.85, 0.825, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85]

So it seems pretty clear that the optimum for this data is somewhere around 0.85. But where exactly? Well let's narrow the search:

In [33]:
Optimum2 = []
for x in xrange(10):
    L = maxClusters(BClassList, 0.82, 0.88, 0.01)
    Ind = L.index(max(L))
    Val = 0.82 + (Ind*0.01)
    Optimum2.append(Val)
In [34]:
print Optimum2
print np.mean(Optimum2)
[0.86, 0.86, 0.87, 0.85, 0.84, 0.87, 0.85, 0.86, 0.86, 0.86]
0.858

For this data set, it appears that 0.85 is the threshold that maximises the consistency of the clusters.

Now, let's run the analysis again, but this time to actually identify the clusters which are consistent, and their centroid name. These will be eligible for automatic renaming.

NB as this function is refined below, the arguments are described below rather than here.

In [35]:
def IdentifyClusters(ClassList, Threshold, df, col1, col2):

    ClustersDict = {}
    Clusters = SlinkSC(ClassList, Threshold)
    ClustersOriginal = Clusters[0]
    ClustersCentroid = Clusters[2]
    
    IndList = [x for x in xrange(len(ClustersOriginal)) if len(ClustersOriginal[x]) > 1]
    
    MultiClusters = [ClustersOriginal[Ind] for Ind in IndList]
    MultiCentroid = [ClustersCentroid[Ind] for Ind in IndList]
    
    for cluster in xrange(len(MultiClusters)):
        MultiSMV = list(itertools.chain(*[df[df[col1] == elem][col2].unique() for elem in MultiClusters[cluster]]))
        if len(set(MultiSMV)) == 1:
               ClustersDict[MultiCentroid[cluster][1]] = MultiClusters[cluster]
    
    return ClustersDict
        

Create a dictionary of matched clusters uding the IdentifyClusters function

In [36]:
MatchedDict = IdentifyClusters(BClassList, 0.85, DF, 'style', 'smv')

Display the length of th dictionary

In [37]:
len(MatchedDict)
Out[37]:
60

Count the number of clustered strings

In [38]:
Count = 0
for key, value in MatchedDict.iteritems():
    for elem in value:
        Count +=1
        
Count
Out[38]:
174

Display the first ten cluster names and associated cluster

In [39]:
for Key in MatchedDict.keys()[:10]:
    print 'The cluster name is ' + Key 
    print MatchedDict[Key]
    print 
The cluster name is 345750 Samira/Bruce Tee Samira nightgown
['345750 Samira/Bruce Tee Samira nightgown', '345750 Samira nightgown']

The cluster name is 615950 Benny-j / SR-561 Bottom S-Siv-T-S
['615950 Benny-J SR-541 Bottom S-Siv-T-S', '615950 Benny-J SR-560 Bottom S-Siv-T-S', '615950 Benny-j / SR-561 Bottom S-Siv-T-S', '615951 Benny-j/ SR-542 Bottom/ S-Siv-T-S']

The cluster name is PJ-2912 PJ-2912/2922 Top
['PJ-2912 PJ-2912/2922 Top', 'PJ-2912 PJ-2922 Top']

The cluster name is Nightwear s slv top (PJ-0405)
['Nightwear s slv top (PJ-0405)', 'Nightwear l slv top (PJ-0405)']

The cluster name is PJ-50307TV PJ-50306 /Nap-56/100 LSV
['PJ-50307TV PJ-50306 /Nap-56/100 LSV', 'PJ-50307TV PJ-50306 LSV']

The cluster name is PJ-2913T PJ-2923 Top
['PJ-2913T PJ-2913 Top', 'PJ-2913T PJ-2313 Top', 'PJ-2913T PJ-2923 Top']

The cluster name is 674262 Ronny B.P.D S-Slv-T-S
['674262 Ronny B.P.D S-Slv-T-S', '674262 Ronny S-Slv-T-S']

The cluster name is 979650 Art Tee/Art tank S-Siv-T-S Tank Top
['979650 Art Tee/Art tank S-Siv-T-S Tank Top', '979650 Art Tee S-Siv-T-S']

The cluster name is 436002 Vito/ Superman S-Siv-T-S Tank Top
['436002 Vito/ Superman S-Siv-T-S Tank Top', '436002 Vito S-Siv-T-S']

The cluster name is 674262 Cocoon pj/ Ronny Bottom S-Siv-T-S
['674262 Cocoon pj/ Ronny Bottom S-Siv-T-S', '674262 Ronny Cocoon S-Siv-T-S Bottom']

To my mind this is pretty successful. Based upon the initial running of the program (which will change with subsequent runs due to changing nature of the clusters) this reduces 145 style names and condenses them into 52.

It seems natural that once this stage has happened, then the analysis should be re-run with the new group of strings, until no more clusters are identified. I am not yet sure about this, so will investigate further. In any case it should be decided what to do with the clusters that show multiple SMVs.

Let's actually look at the failed clusters (i.e. those with multiple SMV measures). This may help in guiding future action.

In [40]:
ForFailed = SlinkSC(BClassList, 0.85)
Potential = [elem for elem in ForFailed[0] if len(elem) > 0]

# Print a selection of the failed clusters
for elem in Potential[:15]:
    SMVs = list(itertools.chain(*[DF[DF.style == style]['smv'].unique() for style in elem]))
    if len(set(SMVs)) != 1:
        print elem
        print SMVs
        print
['SR-560 S-Siv-T-S', 'SR-560 SR-564 S-Siv-T-S', 'SR-541 S-Siv-T-S', 'SR-561 S-Siv-T-S', 'SR-562 S-Siv-T-S', 'SR-564 S-Siv-T-S', 'SR-568 L-Siv-T-S', 'SR-567 S-Siv-T-S']
[17.050000000000001, 17.050000000000001, 12.960000000000001, 22.280000000000001, 24.609999999999999, 17.050000000000001, 12.960000000000001, 18.609999999999999]

['825211T Daniella Top', '825211B Daniella Bottom']
[6.4500000000000002, 6.75]

['185480Ba Cocoon Coocon Bottom', '354750B Cocoon Pj Coocon Bottom', '185480Bb Cocoon Pj Coocon Bottom', '185480Tb Cocoon Coocon Top_b', '185480Ta Cocoon Coocon Top_a', '185480Ba Cocoon Pj Coocon Bottom']
[4.9500000000000002, 5.4500000000000002, 5.4500000000000002, 5.7400000000000002, 5.2400000000000002, 4.9500000000000002]

['Randy shorts btm (820880)', 'Randy tank top (820880)']
[6.9500000000000002, 4.4400000000000004]

['UWC-932-010 192/194 S-Slv-T-S L-Slv-T-S', 'UWC-932-010 192 S-Slv-T-S', 'UWC-932-010 192 /M.S.P S-Slv-T-S /L-Slv-T-S', 'UWC-931-006 194/192 L-Slv-T-S S-Slv-T-S', 'UWC-933-006 192/194 S-Slv-T-S L-Slv-T-S', 'UWC-933-006 194/192 L-Slv-T-S S-Slv-T-S']
[26.030000000000001, 26.030000000000001, 26.030000000000001, 26.559999999999999, 26.559999999999999, 26.559999999999999]

['PJ-49131B Pj-49131 Nightware Bottom', 'PJ-49131Tb PJ-49130 Nightware Top_b', 'PJ-49131Tb Pj-49131/2916 Nightware Top_b', 'PJ-49131B PJ-49131 Nightware Bottom', 'PJ-49131Tb Pj-49131 Nightware Top_b', 'PJ-49131Tb PJ-49130/ PJ-2912 Nightware Top_b', 'PJ-49131B Pj-49131 /pj-2913 Nightware Bottom', 'PJ-49131Tb PJ-49130/2913 Nightware Top_b']
[8.9700000000000006, 8.0199999999999996, 8.0199999999999996, 8.9700000000000006, 8.0199999999999996, 8.0199999999999996, 8.9700000000000006, 8.0199999999999996]

['BS-1105 PJ-1105 Top', 'BS-1106 Pj-1106 Top', 'BS-1105 PJ-1105/49130 Top', 'BS-1106 Pj-1106/49131 Top']
[16.350000000000001, 18.149999999999999, 16.350000000000001, 18.149999999999999, 18.149999999999999, 18.149999999999999]

['674262 Ronny B.P.D S-Slv-T-S', '674262 Ronny S-Siv-T-S', '674262 Ronny S-Slv-T-S', '674262 Ronny Cocoon S-Siv-T-S Bottom', '674263 Ronny S-Slv-T-S', '674261 Ronny S-Siv-T-S', '674268 Ronny S-Siv-T-S', '674265 Ronny S-Slv-T-S', '674269 Ronny S-Slv-T-S', '694265 Ronny S-Slv-T-S']
[5.7800000000000002, 5.7800000000000002, 5.7800000000000002, 5.7800000000000002, 5.7800000000000002, 5.7800000000000002, 6.4000000000000004, 5.7800000000000002, 6.4000000000000004, 5.7800000000000002]

Let's look at the output (may have changed due to changing nature of the data). Observe the following:

Cluster: ['332860Pb M.w.P /M.S.p/194 Polo', '332860Pc M.w.P /M.S.p Polo'] 
SMVs: [17.949999999999999, 18.140000000000001]

Here the cluster has two style names in it, and two SMVs. Quite clearly these cannot be thought of as representing the same style. NB I am not sure what is going on with the floating points here.

However, now look at the following:

Cluster: ['PJ-2913T PJ-2913 Top', 'PJ-2913T PJ-2313 Top', 'PJ-2913B PJ-2913 Bottom', 'PJ-2913B pj-2913 Bottom', 'PJ-2913B PJ-2923 Bottom']
SMVs: [9.1600000000000001, 9.1600000000000001, 7.8700000000000001, 7.8700000000000001, 7.8700000000000001]

In this case it seems like the cluster has picked up two underlying styles one related to the top half of a pijama set, and the other the bottom half. That these were clustered together is a by-product of stripping out the generic components. I think that it is pretty clear that within this cluster there are two separate clusters of styles, that are cleaning identified by the SMVs.

How then to proceed?

I think given the high string matching threshold, I can be pretty comfortable that these stings represent, if not the same exact product, then at least the same product family. Therefore, a rule I propose to use, is to cluster within groups based upon the unique SMV being observed more than once. An attempt to do this is demonstrated below:

In [41]:
def ClustersRemainders(ClassList, Threshold, df, col1, col2):
    #1
    ClustersDict = {}
    Clusters = SlinkSC(ClassList, Threshold)
    ClustersOriginal = Clusters[0]
    ClustersCentroid = Clusters[2]
    
    #2
    IndList = [x for x in xrange(len(ClustersOriginal)) if len(ClustersOriginal[x]) > 1]   
    MultiClusters = [ClustersOriginal[Ind] for Ind in IndList]
    MultiCentroid = [ClustersCentroid[Ind] for Ind in IndList]
    
    #3
    for cluster in xrange(len(MultiClusters)):
        MultiSMV = list(itertools.chain(*[df[df[col1] == elem][col2].unique() for elem in MultiClusters[cluster]]))
        if len(set(MultiSMV)) == 1:
               ClustersDict[MultiCentroid[cluster][1]] = MultiClusters[cluster]
        else:
            
            #4
            if len(MultiSMV) == len(MultiClusters[cluster]):
                for smv in list(set(MultiSMV)):
                    if MultiSMV.count(smv) >= 2:
                        BoolList = [True if elem == smv else False for elem in MultiSMV]
                        StrList = [MultiClusters[cluster][x] for x in xrange(len(BoolList)) if BoolList[x] == True]
                        StrList.sort(lambda x, y: cmp(len(x), len(y)))
                        ClustersDict[StrList[0]] = StrList
                    
    
    return ClustersDict

I think this will be the final identification function and so the workings are explained here.

The function is for use in identifying correctly specified clusters of strings based upon verifying the cluster using some additional data source. The function only works if the strings being clustered and the additional data point are drawn from the same pandas data frame.

The function accepts five arguments:

  1. ClassList - a list of Stripped Class objects representing the strings to be clustered
  2. Threshold - the threshold Jaro-distance score on which the clusters are formed
  3. df - the data frame from which the strings being clustered ar drawn and from which the additional data is drawn
  4. col1 - the column name in the data frame from where the strings being clustered were drawn
  5. col2 - the column name in the data frame from where the additional data are drawn.

The function returns a dictionary where the key is the centroid name of the cluster and value is a list of the strings that form the cluster. Within each cluster the SMVs associated with evry style name are identical.

Comments:

  1. Here the SlinkSC function is called on the ClassList, and the relevant objects are created according to the output of that function
  2. The IndList represents the index points of the clusters that have length of above 1 (meaning that the cluster is in fact a cluster, not a lone string) Lists of clusters and their centroids are then created using this IndList.
  3. For each cluster a list of the unique SMVs associated with each of the strings (style names) is created. If the length of the set of the list is one, then all SMVs are the same, and the centroid and cluster are added to the dictionary of verified clusters.
  4. If the length of the set is not one, then if the number of SMVs found in the data exactly matches the length of the cluster, then there is one unique SMV per style name. For each SMV in that set, if the count of each SMV in the smv list is more than 2, then a cluster is formed by taking the shortest style name in the sub-cluster and making it the centroid, and then clustering all the strings that have the same unique SMV.

Create a dictionary of matched clusters uding the ClustersRemainders function

In [42]:
MatchedDict2 = ClustersRemainders(BClassList, 0.85, DF, 'style', 'smv')

Display length of the dictionary

In [43]:
len(MatchedDict2)
Out[43]:
94

Count the number of clustered strings

In [44]:
Count = 0
for key, value in MatchedDict2.iteritems():
    for elem in value:
        Count +=1
        
Count
Out[44]:
260

Based upon the way the code ran as I write this, this method has taken 258 style strings and clustered them in 97 style clusters, thus reducing the number of unique style names by c.150, which is a pretty big imporovement.


Modifying the String Values

Now that the string have been clustered, I can use the dictionary generated to map the clustered strings the centroid value of each cluster.

Create a dictionary for mapping

In [45]:
StringMap = {}
for key, value in MatchedDict2.iteritems():
    for elem in value:
        StringMap[elem] = key

Display current number of unique strings

In [46]:
len(DF.style.unique())
Out[46]:
454

Map the string clusters to the data

In [47]:
DF['style_new'] = DF.style
for row in DF.index:
    try:
        DF['style_new'][row] = StringMap[DF.style[row]]
    except KeyError:
        pass
    

Display new number of unique strings

In [48]:
len(DF.style_new.unique())
Out[48]:
288

The number of unique strings has been greatly reduced.

Due to the non-perfect nature of the clustering, it might be that some strings were clustered into different clusters when they in fact represent the same underlying reality. Therefore, once this first clustering has occurred, it seems like it could be sensible to go through the process again, in order to see if more clusters can be generated.

This is now explored:


A Second Clustering Pass

Repeat the above process a second time

In [49]:
SecondPass = list(DF.style_new.unique())
In [50]:
SecondClassList = [Stripped(elem, GenericRemoveAll, GenericWhiteSpace, DelWhite = True) for elem in SecondPass]
In [51]:
SecondClusters = SlinkSC(SecondClassList, 0.85)

# Give separate names to each list in the ClusterSC object
SecondClusOrig = BClustersSC[0]
SecondClusOrig.sort(lambda x,y: cmp(len(y) , len(x)))
SecondClusStrip = BClustersSC[1]
SecondClusStrip.sort(lambda x,y: cmp(len(y) , len(x)))
In [52]:
SecondMatchedDict = ClustersRemainders(SecondClassList, 0.85, DF, 'style', 'smv')
In [53]:
len(SecondMatchedDict)
Out[53]:
8
In [54]:
Count = 0
for key, value in SecondMatchedDict.iteritems():
    for elem in value:
        Count +=1
        
Count
Out[54]:
16
In [55]:
for key, value in SecondMatchedDict.iteritems():
    print key, value
    print
PJ-2913T PJ-2913 Top ['PJ-2913T PJ-2913 Top', 'PJ-2913T PJ-2913/2923 Top']

354750Ta Cocoon Coocon Top_a ['354750Ta Cocoon Coocon Top_a', '185480Ta Cocoon Coocon Top_a']

SR-541 S-Siv-T-S ['SR-541 S-Siv-T-S', 'SR-541 /SR-542 S-Siv-T-S']

BS-1105 PJ-1105/49130 Top ['BS-1105 PJ-1105/49130 Top', 'BS-1106 Pj-1106/49131 Top']

239851B Daniella Bottom ['239851B Daniella Bottom', '825211B Daniella Bottom']

UWC-931-007Ta 192 Top large ['UWC-931-007Ta 192 Top large', 'UWC-931-007Ta 194 Top large']

PJ-49131B PJ-49131 Nightware Bottom ['PJ-49131B PJ-49131 Nightware Bottom', 'PJ-49131B Pj-49131 /pj-2913 Nightware Bottom']

SR-565 SR-542 S-Siv-T-S ['SR-565 SR-542 S-Siv-T-S', 'SR-565 SR-542/SR-565 S-Siv-T-S']

What the above demonstrates is that clearly there is a benefit to running the clustering a second time. This would seem to call for a recursive function that continues to perform the clustering until no more clusters are possible. When I ran the program, this would have clustered 33 style names into 16 clusters.


A Recurcive Identifaction Function

In order to achieve a recurcive function unfortunatley some modifications will necessary, but in fact this may actually make calling the function somewhat simpler

In [56]:
def ClustersRecursive(Threshold, df, col1, col2, GenericAll, GenericWhite = None, DelWhite = False):
    Styles = df[col1].unique()
    ClassList = [Stripped(style, GenericAll, GenericWhite, DelWhite) for style in Styles]
    ClustersDict = {}
    Clusters = SlinkSC(ClassList, Threshold)
    ClustersOriginal = Clusters[0]
    ClustersCentroid = Clusters[2]
    IndList = [x for x in xrange(len(ClustersOriginal)) if len(ClustersOriginal[x]) > 1]   
    MultiClusters = [ClustersOriginal[Ind] for Ind in IndList]
    MultiCentroid = [ClustersCentroid[Ind] for Ind in IndList]
    
    if len(MultiClusters) == 0:
        print 'Finished1'
        return 
    else:
        Counter = 0
        for cluster in xrange(len(MultiClusters)):
            MultiSMV = list(itertools.chain(*[df[df[col1] == elem][col2].unique() for elem in MultiClusters[cluster]]))
            if len(set(MultiSMV)) == 1:
                ClustersDict[MultiCentroid[cluster][1]] = MultiClusters[cluster]
                Counter +=1
            else:
                if len(MultiSMV) == len(MultiClusters[cluster]):
                    for smv in list(set(MultiSMV)):
                        if MultiSMV.count(smv) >= 2:
                            BoolList = [True if elem == smv else False for elem in MultiSMV]
                            StrList = [MultiClusters[cluster][x] for x in xrange(len(BoolList)) if BoolList[x] == True]
                            StrList.sort(lambda x, y: cmp(len(x), len(y)))
                            ClustersDict[StrList[0]] = StrList
                            Counter +=1
    
    StringMap = {}
    for key, value in ClustersDict.iteritems():
        for elem in value:
            StringMap[elem] = key
                
    
    if Counter == 0:
        return 
    else:
        for row in DF.index:
            try:
                df[col1][row] = StringMap[df[col1][row]]
            except KeyError:
                pass
        

        ClustersRecursive(Threshold, df, col1, col2, GenericAll, GenericWhite = None, DelWhite = False) 

The function accepts arguments that should now be familar.

The innovation here is that the function not only creates a dictionary of clustered strings, but it maps the dictionary to the underlying data frame from where the strings are drawn, thereby modifying the string names according to the centorid of the cluster they have been placed in. The function then recusively calls itself on the (now) modified string data. This will continue until no more clusters are formed that can be verified as matched according to the method already enumerated at length.

In [57]:
DF['style_new2'] = DF.style

Call the function on the data

In [58]:
ClustersRecursive(0.85, DF, 'style_new2', 'smv', GenericRemoveAll, GenericWhiteSpace, True)

Display number of unqiue strings in the data

In [59]:
len(DF.style_new2.unique())
Out[59]:
265

In testing this function iterated through four rounds of clustering. As can be seen from the above output it has reduced the number of unique strings to 259 (numbers may change when program is re-run). It seems possible that by recursively clustering the strings could also stabilize the clusters that are actually formed. Whereas with one round of clustering the number of strings clustered seemed to vary quite sizably, here the number of unique strings after clustering seems quite stable.

I am quickly going to test how stable it is by seeing how many unique strings are left after multiple runs through the recursive function.

In [62]:
LenList = []
for x in xrange(20):
    DF['style_new2'] = DF.style
    ClustersRecursive(0.85, DF, 'style_new2', 'smv', GenericRemoveAll, GenericWhiteSpace, True)
    LenList.append(len(DF.style_new2.unique()))
    
In [63]:
print LenList
[262, 259, 256, 260, 259, 260, 257, 258, 258, 257, 261, 259, 257, 258, 258, 253, 260, 257, 254, 256]

As the above output shows, the method is not perfect and does not generate the same results each time it is run, although within a fairly small window.


Conclusions

Efficacy
The string clustering method described here, seems to work relatively well. With the data provided a list of 450+ string has been reduced to a group of strings nearly half that size.

If there is no alternative data available to the user by which the clusters can be verified, then some others means will have be sought. It is possible that using the SlinkSC function in a recursive manner (as displayed above in relation to the ClusterRecursive function), sensible clusters can be formed even without verifying data. Th threshold can be manipulated to ensure greater accuracy (at the cost of potentially missing true clusters). With small data sets where the data can be verified either by eye, or by some other method, this will work well.

For large more complex sets of fuzzy strings it is likely that some verifying data will be necessary or some other method of exploring the veracity of the clustered data.

Caveats
The clustering method described here is not perfect. In particular the clusters that are generated are not independent of the order that the data are evaluated. The effects of this can be mitigated (or at least randomized) by shuffling the list of data to be clustered.

There are doubtless many more problems of a technical nature that are somewhat beyond my limited understanding of clustering.

Use in the Future
Provided we can live with imperfect nature of the clustering, I am happy to move forward with using this method for matching strings into clusters in my work.

It is possible that this notebook is too unwieldy in size to be useful to anyone but me, possibly I will create an example notebook that just goes through the final method (without the workings), possibly I can't be bothered though.