%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p pandas,numpy,scikit-learn -d
Sebastian Raschka 08/02/2015 CPython 3.4.2 IPython 2.3.1 pandas 0.15.2 numpy 1.9.1 scikit-learn 0.15.2
Features can come in various different flavors. Typically we distinguish between
features.
And the categorical features can be categorized further into:
Now, most implementations of machine learning algorithms require numerical data as input, and we have to prepare our data accordingly. This notebook contains some useful tips for how to encode categorical features using Python pandas and scikit-learn.
First, let us create a simple example dataset with 3 different kinds of features:
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'prize', 'class label']
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | M | 10.1 | class1 |
1 | red | L | 13.5 | class2 |
2 | blue | XL | 15.3 | class1 |
"Typical" machine learning algorithms handle class labels with "no order implied" - unless we use a ranking classifier (e.g., SVM-rank). Thus, it is save to use a simple set-item-enumeration to convert the class labels from a string representation into integers.
class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}
df['class label'] = df['class label'].map(class_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | M | 10.1 | 0 |
1 | red | L | 13.5 | 1 |
2 | blue | XL | 15.3 | 0 |
Ordinal features need special attention: We have to make sure that the correct values are associated with the corresponding strings. Thus, we need to set-up an explicit mapping dictionary:
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | 1 | 10.1 | 0 |
1 | red | 2 | 13.5 | 1 |
2 | blue | 3 | 15.3 | 0 |
Unfortunately, we can't simply apply the same mapping scheme to the color
column that we used for the size
-mapping above. However, we can use another simply trick and convert the "colors" into binary features: Each possible color value becomes a feature column itself (with values 1 or 0).
color_mapping = {
'green': (0,0,1),
'red': (0,1,0),
'blue': (1,0,0)}
df['color'] = df['color'].map(color_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | (0, 0, 1) | 1 | 10.1 | 0 |
1 | (0, 1, 0) | 2 | 13.5 | 1 |
2 | (1, 0, 0) | 3 | 15.3 | 0 |
import numpy as np
y = df['class label'].values
X = df.iloc[:, :-1].values
X = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)
print('Class labels:', y)
print('\nFeatures:\n', X)
Class labels: [0 1 0] Features: [[ 0. 0. 1. 1. 10.1] [ 0. 1. 0. 2. 13.5] [ 1. 0. 0. 3. 15.3]]
If we want to convert the features back into its original representation, we can simply do so my using inverted mapping dictionaries:
inv_color_mapping = {v: k for k, v in color_mapping.items()}
inv_size_mapping = {v: k for k, v in size_mapping.items()}
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['color'] = df['color'].map(inv_color_mapping)
df['size'] = df['size'].map(inv_size_mapping)
df['class label'] = df['class label'].map(inv_class_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | M | 10.1 | class1 |
1 | red | L | 13.5 | class2 |
2 | blue | XL | 15.3 | class1 |
The scikit-learn maching library comes with many useful preprocessing functions that we can use for our convenience.
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['class label'] = class_le.fit_transform(df['class label'])
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | 1 | 10.1 | 0 |
1 | red | 2 | 13.5 | 1 |
2 | blue | 3 | 15.3 | 0 |
The class labels can be converted back from integer to string via the inverse_transform
method:
class_le.inverse_transform(df['class label'])
array(['class1', 'class2', 'class1'], dtype=object)
The DictVectorizer
is another handy tool for feature extraction. The DictVectorizer
takes a list of dictionary entries (feature-value mappings) and transforms it to vectors. The expected input looks like this:
df.transpose().to_dict().values()
dict_values([{'class label': 0, 'color': 'green', 'size': 1, 'prize': 10.1}, {'class label': 1, 'color': 'red', 'size': 2, 'prize': 13.5}, {'class label': 0, 'color': 'blue', 'size': 3, 'prize': 15.3}])
Note that the dictionary keys in each row represent the feature column labels.
Now, we can use the DictVectorizer
to turn this
mapping into a matrix:
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X = dvec.fit_transform(df.transpose().to_dict().values())
X
array([[ 0. , 0. , 1. , 0. , 10.1, 1. ], [ 1. , 0. , 0. , 1. , 13.5, 2. ], [ 0. , 1. , 0. , 0. , 15.3, 3. ]])
As we can see in the array above, the columns were reordered during the conversion (due to the hash mapping when we used the dictionary). However, we can simply add back the column names via the get_feature_names
function.
pd.DataFrame(X, columns=dvec.get_feature_names())
class label | color=blue | color=green | color=red | prize | size | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 10.1 | 1 |
1 | 1 | 0 | 0 | 1 | 13.5 | 2 |
2 | 0 | 1 | 0 | 0 | 15.3 | 3 |
Another useful tool in scikit-learn is the OneHotEncoder
. The idea is the same as in the DictVectorizer
example above; the only difference is that the OneHotEncoder
takes integer columns as input. Here we are an LabelEncoder
, we use the LabelEncoder
first, to prepare the color
column before we use the OneHotEncoder
.
color_le = LabelEncoder()
df['color'] = color_le.fit_transform(df['color'])
df
color | size | prize | class label | |
---|---|---|---|---|
0 | 1 | 1 | 10.1 | 0 |
1 | 2 | 2 | 13.5 | 1 |
2 | 0 | 3 | 15.3 | 0 |
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X = ohe.fit_transform(df[['color']].values)
X
array([[ 0., 1., 0.], [ 0., 0., 1.], [ 1., 0., 0.]])
Also, pandas comes with a convenience function to create new categories for nominal features, namely: get_dummies
.
But first, let us quickly regenerate a fresh example DataFrame
where the size and class label columns are already taken care of.
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'prize', 'class label']
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}
df['class label'] = df['class label'].map(class_mapping)
df
color | size | prize | class label | |
---|---|---|---|---|
0 | green | 1 | 10.1 | 0 |
1 | red | 2 | 13.5 | 1 |
2 | blue | 3 | 15.3 | 0 |
Using the get_dummies
will create a new column for every unique string in a certain column:
pd.get_dummies(df)
size | prize | class label | color_blue | color_green | color_red | |
---|---|---|---|---|---|---|
0 | 1 | 10.1 | 0 | 0 | 1 | 0 |
1 | 2 | 13.5 | 1 | 0 | 0 | 1 |
2 | 3 | 15.3 | 0 | 1 | 0 | 0 |
Note that the get_dummies
function leaves the numeric columns untouched, how convenient!