Classifying Against Categorical Variables]

If you are trying to build something like a k-NN classifier applied to items in a dataframe where some of the defining characteristics are categorical, rather than numerical variables, what do you do?

The problem with k-NN is that you need a distance metric, and it is not obvious how to define appropriate distance metrics for categorical variables.

Some categorical variables may have a natural mapping into a number space ('cold','warm','hot') so you could define mappings for those easily enough:

In [2]:
import pandas as pd

df = pd.DataFrame({'temp':['cold', 'warm', 'cold', 'veryhot', 'hot']})

tempval = {'cold': 0, 'warm': 50, 'hot':70,'veryhot':100}

df['tempvals'] = df['temp'].map(tempval)
temp tempvals
0 cold 0
1 warm 50
2 cold 0
3 veryhot 100
4 hot 70

But many categories are harder to map sensibly so that the distance metric makes sense.

One thing to note is that does support different distance measures set via the metric parameter (allowed values here: ); different metrics have different properties so you need to balance the metric with the coding scheme.

You can explore what values different metrics give using this sort of pattern:

In [4]:
from sklearn.neighbors import DistanceMetric

#Define a metric
dist = DistanceMetric.get_metric('euclidean')

#Find the distance between pairs
#The first row is the distance from the first item to the first, second, and third item
#The second row is the distance from the second item to the first, second, and third item
array([[  0.,  20., 100.],
       [ 20.,   0.,  80.],
       [100.,  80.,   0.]])

To explore the measured values you get in a dataframe column, generate a list of the pairs of values:

In [5]:
import itertools
pairs = [p for p in itertools.combinations(df['tempvals'].unique(), 2)]
[(0, 50), (0, 100), (0, 70), (50, 100), (50, 70), (100, 70)]

then measure their distances under a particular measure to see if you are happy with it:

In [6]:
dist = DistanceMetric.get_metric('euclidean')

#So the first row is the distance between the (0,50) and each of the other items etc
array([[  0.        ,  50.        ,  20.        ,  70.71067812,
         53.85164807, 101.98039027],
       [ 50.        ,   0.        ,  30.        ,  50.        ,
         58.30951895, 104.40306509],
       [ 20.        ,  30.        ,   0.        ,  58.30951895,
         50.        , 100.        ],
       [ 70.71067812,  50.        ,  58.30951895,   0.        ,
         30.        ,  58.30951895],
       [ 53.85164807,  58.30951895,  50.        ,  30.        ,
          0.        ,  50.        ],
       [101.98039027, 104.40306509, 100.        ,  58.30951895,
         50.        ,   0.        ]])

Note you can use other metrics or define your own:

In [9]:
def mydist(x,y):
    ''' Not necessarily a useful metric! '''
    return x+y

dist = DistanceMetric.get_metric(mydist)
array([[  0.,  20., 100.],
       [ 20.,  40., 120.],
       [100., 120., 200.]])

So you could define your own metric if you wanted to.

A lot of stuff classed as "AI" is based on making classifications based on quite naive (or just convenient to calculate) coding schemes.

Remember the Wizard of Oz?!

Categorical Variables - A Better Way: Dummy Variables

The k-NN classifer needs to measure the distance between values somehow, but with a categorical variable, the problem we are faced with is: how do you sensibly measure the distance between any two items?

If you try to code a categorical variable (where the value are categories; e.g. for furniture: chair, table, settee), trying to map a variable's value onto a single float or int value that makes sense when trying to find a sensible distance between the values applied to any two categories.

Yes, it's easy enough to give a numerical code to each item (chair=1, table=2, settee=3); but how do you arrange those items on a number line so you can find a meaningful distance between them?

Remember Stevens' NOIR: the ordering of nominal items is arbitrary.

Instead, if you have a thing (a row in your dataset) with a furniture column, you could map items as identified in the furniture column onto a set of columns, one for each type of furniture.

Take the following data frame as an example:

In [14]:
import pandas as pd

df3 = pd.DataFrame({'temp':['cold', 'warm', 'cold', 'veryhot', 'hot'],
temp weather
0 cold rainy
1 warm sunny
2 cold overcast
3 veryhot sunny
4 hot sunny

The weather is a categorical item that is harder to represent numerically that the categorical temperature variable.

But what if we define a column for each weather type, and then encode a row with a value that associates the row with that weather type.

These are dummy variables. For each row, set the dummy variable to 1 if it is true for the thing, 0 if it isn't.

And pandas has a convenient function for generating such variable: pd.get_dummies().

Simply pass it a column of string values and it will create a column for each value, with a 1 or a zero saying whether the original column contained that categorical value or not.

In [13]:
overcast rainy sunny
0 0 1 0
1 0 0 1
2 1 0 0
3 0 0 1
4 0 0 1

It's easy enough to add the dummy columns to the original data frame:

In [15]:
df3 = pd.concat([df3, pd.get_dummies(df3['weather'])], axis=1)
temp weather overcast rainy sunny
0 cold rainy 0 1 0
1 warm sunny 0 0 1
2 cold overcast 1 0 0
3 veryhot sunny 0 0 1
4 hot sunny 0 0 1

In the above case, you get a column for each type of weather, with a 1 or 0 identifying whether the weather was that type of weather (!).

The pd.get_dummies() function essentially takes a column in long format, creates a new column for each unique item in that column (that is, maps the long column on a wide set of columns), and uses a binary code to associate / encode the wide columns with the value that appeared in the long column.

When you measure the distance between the weather items represented using dummy variables, you measure the distance across those three dimensions (or however many weather dimensions there are).

You can now classify using those numerical dummy variable columns (which contain a meaningful number: 1 for it's that sort of weather, 0 for it isn't) rather than the nominal weather column. The k-NN classifier will accept as many dimensions as you give it.

Rather than classify on two dimensions (temp and some attempted numerical coding of weather), classify on four numerical cols (temp, rainy, sunny, overcast).

If you're comparing shopping carts, for example, to try to identify different sorts of shopper, you might have a column for every possible item that a supermarket sells. For any given shopping trolley, put a number, N, or zero for if a person bought N items or 0 of each thing. That would be quite a naive encoding but it'd be a start. (For a readable overview of how they approached this with the original Tesco Clubcard, see Scoring Points: How Tesco Continues to Win Customer Loyalty, Terry Hunt, Clive Humby, Tim Phillips; my review here.)