Guide To Encoding Categorical Values in Python

Supporting notebook for article on Practical Business Python.

Import the pandas, scikit-learn, numpy and category_encoder libraries.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer, LabelEncoder

import category_encoders as ce

Need to define the headers since the data does not contain any

In [2]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", 
           "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

Read in the data from the url, add headers and convert ? to nan values

In [3]:
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                 header=None, names=headers, na_values="?" )
In [4]:
df.head()
Out[4]:
symboling normalized_losses make fuel_type aspiration num_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

5 rows × 26 columns

Look at the data types contained in the dataframe

In [5]:
df.dtypes
Out[5]:
symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Create a copy of the data with only the object columns.

In [6]:
obj_df = df.select_dtypes(include=['object']).copy()
In [7]:
obj_df.head()
Out[7]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system
0 alfa-romero gas std two convertible rwd front dohc four mpfi
1 alfa-romero gas std two convertible rwd front dohc four mpfi
2 alfa-romero gas std two hatchback rwd front ohcv six mpfi
3 audi gas std four sedan fwd front ohc four mpfi
4 audi gas std four sedan 4wd front ohc five mpfi

Check for null values in the data

In [8]:
obj_df[obj_df.isnull().any(axis=1)]
Out[8]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system
27 dodge gas turbo NaN sedan fwd front ohc four mpfi
63 mazda diesel std NaN sedan fwd front ohc four idi

Since the num_doors column contains the null values, look at what values are current options

In [9]:
obj_df["num_doors"].value_counts()
Out[9]:
four    114
two      89
Name: num_doors, dtype: int64

We will fill in the doors value with the most common element - four.

In [10]:
obj_df = obj_df.fillna({"num_doors": "four"})
In [11]:
obj_df[obj_df.isnull().any(axis=1)]
Out[11]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system

Encoding values using pandas

Convert the num_cylinders and num_doors values to numbers

In [12]:
obj_df["num_cylinders"].value_counts()
Out[12]:
four      159
six        24
five       11
eight       5
two         4
twelve      1
three       1
Name: num_cylinders, dtype: int64
In [13]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}
In [14]:
obj_df.replace(cleanup_nums, inplace=True)
In [15]:
obj_df.head()
Out[15]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system
0 alfa-romero gas std 2 convertible rwd front dohc 4 mpfi
1 alfa-romero gas std 2 convertible rwd front dohc 4 mpfi
2 alfa-romero gas std 2 hatchback rwd front ohcv 6 mpfi
3 audi gas std 4 sedan fwd front ohc 4 mpfi
4 audi gas std 4 sedan 4wd front ohc 5 mpfi
Check the data types to make sure they are coming through as numbers
In [16]:
obj_df.dtypes
Out[16]:
make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

One approach to encoding labels is to convert the values to a pandas category

In [17]:
obj_df["body_style"].value_counts()
Out[17]:
sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64
In [18]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
In [19]:
obj_df.dtypes
Out[19]:
make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

We can assign the category codes to a new column so we have a clean numeric representation

In [20]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
In [21]:
obj_df.head()
Out[21]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system body_style_cat
0 alfa-romero gas std 2 convertible rwd front dohc 4 mpfi 0
1 alfa-romero gas std 2 convertible rwd front dohc 4 mpfi 0
2 alfa-romero gas std 2 hatchback rwd front ohcv 6 mpfi 2
3 audi gas std 4 sedan fwd front ohc 4 mpfi 3
4 audi gas std 4 sedan 4wd front ohc 5 mpfi 3
In [22]:
obj_df.dtypes
Out[22]:
make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
body_style_cat         int8
dtype: object

In order to do one hot encoding, use pandas get_dummies

In [23]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
Out[23]:
make fuel_type aspiration num_doors body_style engine_location engine_type num_cylinders fuel_system body_style_cat drive_wheels_4wd drive_wheels_fwd drive_wheels_rwd
0 alfa-romero gas std 2 convertible front dohc 4 mpfi 0 0.0 0.0 1.0
1 alfa-romero gas std 2 convertible front dohc 4 mpfi 0 0.0 0.0 1.0
2 alfa-romero gas std 2 hatchback front ohcv 6 mpfi 2 0.0 0.0 1.0
3 audi gas std 4 sedan front ohc 4 mpfi 3 0.0 1.0 0.0
4 audi gas std 4 sedan front ohc 5 mpfi 3 1.0 0.0 0.0

get_dummiers has options for selecting the columns and adding prefixes to make the resulting data easier to understand.

In [24]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()
Out[24]:
make fuel_type aspiration num_doors engine_location engine_type num_cylinders fuel_system body_style_cat body_convertible body_hardtop body_hatchback body_sedan body_wagon drive_4wd drive_fwd drive_rwd
0 alfa-romero gas std 2 front dohc 4 mpfi 0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 alfa-romero gas std 2 front dohc 4 mpfi 0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 alfa-romero gas std 2 front ohcv 6 mpfi 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
3 audi gas std 4 front ohc 4 mpfi 3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
4 audi gas std 4 front ohc 5 mpfi 3 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
Another approach to encoding values is to select an attribute and convert it to True or False. In this case, we can check if an engine is an OHC or not.
In [25]:
obj_df["engine_type"].value_counts()
Out[25]:
ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

Use np.where and the str accessor to do this in one efficient line

In [26]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)
In [27]:
obj_df[["make", "engine_type", "OHC_Code"]].head(20)
Out[27]:
make engine_type OHC_Code
0 alfa-romero dohc 1
1 alfa-romero dohc 1
2 alfa-romero ohcv 1
3 audi ohc 1
4 audi ohc 1
5 audi ohc 1
6 audi ohc 1
7 audi ohc 1
8 audi ohc 1
9 audi ohc 1
10 bmw ohc 1
11 bmw ohc 1
12 bmw ohc 1
13 bmw ohc 1
14 bmw ohc 1
15 bmw ohc 1
16 bmw ohc 1
17 bmw ohc 1
18 chevrolet l 0
19 chevrolet ohc 1

Encoding Values Using Scitkit-learn

Instantiate the LabelEncoder

In [28]:
lb_make = LabelEncoder()
In [29]:
obj_df["make_code"] = lb_make.fit_transform(obj_df["make"])
In [30]:
obj_df[["make", "make_code"]].head(11)
Out[30]:
make make_code
0 alfa-romero 0
1 alfa-romero 0
2 alfa-romero 0
3 audi 1
4 audi 1
5 audi 1
6 audi 1
7 audi 1
8 audi 1
9 audi 1
10 bmw 2

To accomplish something similar to pandas get_dummies, use LabelBinarizer

In [31]:
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])

The results are an array that needs to be converted to a DataFrame

In [32]:
lb_results
Out[32]:
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       ..., 
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0]])
In [33]:
pd.DataFrame(lb_results, columns=lb_style.classes_).head()
Out[33]:
convertible hardtop hatchback sedan wagon
0 1 0 0 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 1 0

Advanced Encoding

category_encoder library

In [34]:
# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()
In [35]:
obj_df.head()
Out[35]:
make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system
0 alfa-romero gas std two convertible rwd front dohc four mpfi
1 alfa-romero gas std two convertible rwd front dohc four mpfi
2 alfa-romero gas std two hatchback rwd front ohcv six mpfi
3 audi gas std four sedan fwd front ohc four mpfi
4 audi gas std four sedan 4wd front ohc five mpfi

Try out the Backward Difference Encoder on the engine_type column

In [36]:
encoder = ce.backward_difference.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)
Out[36]:
BackwardDifferenceEncoder(cols=['engine_type'], drop_invariant=False,
             return_df=True, verbose=0)
In [37]:
encoder.transform(obj_df).iloc[:,0:7].head()
Out[37]:
col_engine_type_0 col_engine_type_1 col_engine_type_2 col_engine_type_3 col_engine_type_4 col_engine_type_5 col_engine_type_6
0 1.0 0.142857 0.285714 0.428571 0.571429 0.714286 -0.142857
1 1.0 0.142857 0.285714 0.428571 0.571429 0.714286 -0.142857
2 1.0 0.142857 0.285714 0.428571 0.571429 0.714286 0.857143
3 1.0 0.142857 -0.714286 -0.571429 -0.428571 -0.285714 -0.142857
4 1.0 0.142857 -0.714286 -0.571429 -0.428571 -0.285714 -0.142857

Another approach is to use a polynomial encoding.

In [38]:
encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)
Out[38]:
PolynomialEncoder(cols=['engine_type'], drop_invariant=False, return_df=True,
         verbose=0)
In [39]:
encoder.transform(obj_df).iloc[:,0:7].head()
Out[39]:
col_engine_type_0 col_engine_type_1 col_engine_type_2 col_engine_type_3 col_engine_type_4 col_engine_type_5 col_engine_type_6
0 1.0 -5.669467e-01 5.455447e-01 -4.082483e-01 0.241747 -1.091089e-01 0.032898
1 1.0 -5.669467e-01 5.455447e-01 -4.082483e-01 0.241747 -1.091089e-01 0.032898
2 1.0 3.779645e-01 3.970680e-17 -4.082483e-01 -0.564076 -4.364358e-01 -0.197386
3 1.0 1.347755e-17 -4.364358e-01 1.528598e-17 0.483494 8.990141e-18 -0.657952
4 1.0 1.347755e-17 -4.364358e-01 1.528598e-17 0.483494 8.990141e-18 -0.657952
In [ ]: