Guide To Encoding Categorical Values in Python¶

Supporting notebook for article on Practical Business Python.

Import the pandas, scikit-learn, numpy and category_encoder libraries.

In [1]:

import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

import category_encoders as ce

Need to define the headers since the data does not contain any

In [2]:

headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", 
           "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

Read in the data from the url, add headers and convert ? to nan values

In [3]:

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                 header=None, names=headers, na_values="?" )

In [4]:

df.head()

Out[4]:

	symboling	normalized_losses	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	wheel_base	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

5 rows × 26 columns

Look at the data types contained in the dataframe

In [5]:

df.dtypes

Out[5]:

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Create a copy of the data with only the object columns.

In [6]:

obj_df = df.select_dtypes(include=['object']).copy()

In [7]:

obj_df.head()

Out[7]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
0	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
1	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
2	alfa-romero	gas	std	two	hatchback	rwd	front	ohcv	six	mpfi
3	audi	gas	std	four	sedan	fwd	front	ohc	four	mpfi
4	audi	gas	std	four	sedan	4wd	front	ohc	five	mpfi

Check for null values in the data

In [8]:

obj_df[obj_df.isnull().any(axis=1)]

Out[8]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
27	dodge	gas	turbo	NaN	sedan	fwd	front	ohc	four	mpfi
63	mazda	diesel	std	NaN	sedan	fwd	front	ohc	four	idi

Since the num_doors column contains the null values, look at what values are current options

In [9]:

obj_df["num_doors"].value_counts()

Out[9]:

four    114
two      89
Name: num_doors, dtype: int64

We will fill in the doors value with the most common element - four.

In [10]:

obj_df = obj_df.fillna({"num_doors": "four"})

In [11]:

obj_df[obj_df.isnull().any(axis=1)]

Out[11]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system

Encoding values using pandas¶

Convert the num_cylinders and num_doors values to numbers

In [12]:

obj_df["num_cylinders"].value_counts()

Out[12]:

four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num_cylinders, dtype: int64

In [13]:

cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}

In [14]:

obj_df = obj_df.replace(cleanup_nums)

In [15]:

obj_df.head()

Out[15]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
0	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
1	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
2	alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi
3	audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi
4	audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi

In [16]:

obj_df.dtypes

Out[16]:

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

One approach to encoding labels is to convert the values to a pandas category

In [17]:

obj_df["body_style"].value_counts()

Out[17]:

sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64

In [18]:

obj_df["body_style"] = obj_df["body_style"].astype('category')

In [19]:

obj_df.dtypes

Out[19]:

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

We can assign the category codes to a new column so we have a clean numeric representation

In [20]:

obj_df["body_style_cat"] = obj_df["body_style"].cat.codes

In [21]:

obj_df.head()

Out[21]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system	body_style_cat
0	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
1	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
2	alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi	2
3	audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi	3
4	audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi	3

In [22]:

obj_df.dtypes

Out[22]:

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
body_style_cat         int8
dtype: object

In order to do one hot encoding, use pandas get_dummies

In [23]:

pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

Out[23]:

	make	fuel_type	aspiration	num_doors	body_style	engine_location	engine_type	num_cylinders	fuel_system	body_style_cat	drive_wheels_4wd	drive_wheels_fwd	drive_wheels_rwd
0	alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
1	alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
2	alfa-romero	gas	std	2	hatchback	front	ohcv	6	mpfi	2	0	0	1
3	audi	gas	std	4	sedan	front	ohc	4	mpfi	3	0	1	0
4	audi	gas	std	4	sedan	front	ohc	5	mpfi	3	1	0	0

get_dummiers has options for selecting the columns and adding prefixes to make the resulting data easier to understand.

In [24]:

pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

Out[24]:

	make	fuel_type	aspiration	num_doors	engine_location	engine_type	num_cylinders	fuel_system	body_style_cat	body_convertible	body_hatchback	body_sedan	drive_4wd	drive_fwd	drive_rwd
0	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
1	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
2	alfa-romero	gas	std	2	front	ohcv	6	mpfi	2	0	1	0	0	0	1
3	audi	gas	std	4	front	ohc	4	mpfi	3	0	0	1	0	1	0
4	audi	gas	std	4	front	ohc	5	mpfi	3	0	0	1	1	0	0

In [25]:

obj_df["engine_type"].value_counts()

Out[25]:

ohc      148
ohcf      15
ohcv      13
l         12
dohc      12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

Use np.where and the str accessor to do this in one efficient line

In [26]:

obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

In [27]:

obj_df[["make", "engine_type", "OHC_Code"]].head(20)

Out[27]:

	make	engine_type	OHC_Code
0	alfa-romero	dohc	1
1	alfa-romero	dohc	1
2	alfa-romero	ohcv	1
3	audi	ohc	1
4	audi	ohc	1
5	audi	ohc	1
6	audi	ohc	1
7	audi	ohc	1
8	audi	ohc	1
9	audi	ohc	1
10	bmw	ohc	1
11	bmw	ohc	1
12	bmw	ohc	1
13	bmw	ohc	1
14	bmw	ohc	1
15	bmw	ohc	1
16	bmw	ohc	1
17	bmw	ohc	1
18	chevrolet	l	0
19	chevrolet	ohc	1

Encoding Values Using Scitkit-learn¶

Instantiate the LabelEncoder

In [28]:

ord_enc = OrdinalEncoder()

In [29]:

obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])

In [30]:

obj_df[["make", "make_code"]].head(11)

Out[30]:

	make	make_code
0	alfa-romero	0.0
1	alfa-romero	0.0
2	alfa-romero	0.0
3	audi	1.0
4	audi	1.0
5	audi	1.0
6	audi	1.0
7	audi	1.0
8	audi	1.0
9	audi	1.0
10	bmw	2.0

To accomplish something similar to pandas get_dummies, use LabelBinarizer

In [31]:

oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(obj_df[["body_style"]])

The results are an array that needs to be converted to a DataFrame

In [32]:

oe_results.toarray()

Out[32]:

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]])

In [33]:

pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

Out[33]:

	convertible	hatchback	sedan
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0

Advanced Encoding¶

category_encoder library

In [34]:

# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()

In [35]:

obj_df.head()

Out[35]:

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
0	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
1	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
2	alfa-romero	gas	std	two	hatchback	rwd	front	ohcv	six	mpfi
3	audi	gas	std	four	sedan	fwd	front	ohc	four	mpfi
4	audi	gas	std	four	sedan	4wd	front	ohc	five	mpfi

Try out the Backward Difference Encoder on the engine_type column

In [36]:

# Specify the columns to encode then fit and transform
encoder = ce.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)

/home/chris/miniconda3/envs/pbpcode/lib/python3.8/site-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

Out[36]:

BackwardDifferenceEncoder(cols=['engine_type'],
                          mapping=[{'col': 'engine_type',
                                    'mapping':     engine_type_0  engine_type_1  engine_type_2  engine_type_3  engine_type_4  \
 1      -0.857143      -0.714286      -0.571429      -0.428571      -0.285714   
 2       0.142857      -0.714286      -0.571429      -0.428571      -0.285714   
 3       0.142857       0.285714      -0.571429      -0.428571      -0.285714   
 4       0.142857       0.285714       0.428571      -0.428571      -0.285714   
 5       0.142857       0.285714       0.428571       0.571429      -0.285714   
 6       0.142857       0.285714       0.428571       0.571429       0.714286   
 7       0.142857       0.285714       0.428571       0.571429       0.714286   
-1       0.000000       0.000000       0.000000       0.000000       0.000000   
-2       0.000000       0.000000       0.000000       0.000000       0.000000   

    engine_type_5  
 1      -0.142857  
 2      -0.142857  
 3      -0.142857  
 4      -0.142857  
 5      -0.142857  
 6      -0.142857  
 7       0.857143  
-1       0.000000  
-2       0.000000  }])

In [37]:

encoder.fit_transform(obj_df).iloc[:,8:14].head()

/home/chris/miniconda3/envs/pbpcode/lib/python3.8/site-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

Out[37]:

	engine_type_0	engine_type_1	engine_type_2	engine_type_3	engine_type_4	engine_type_5
0	-0.857143	-0.714286	-0.571429	-0.428571	-0.285714	-0.142857
1	-0.857143	-0.714286	-0.571429	-0.428571	-0.285714	-0.142857
2	0.142857	-0.714286	-0.571429	-0.428571	-0.285714	-0.142857
3	0.142857	0.285714	-0.571429	-0.428571	-0.285714	-0.142857
4	0.142857	0.285714	-0.571429	-0.428571	-0.285714	-0.142857

Another approach is to use a polynomial encoding.

In [38]:

encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit_transform(obj_df, verbose=1).iloc[:,8:14].head()

/home/chris/miniconda3/envs/pbpcode/lib/python3.8/site-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

Out[38]:

	engine_type_0	engine_type_1	engine_type_2	engine_type_3	engine_type_4	engine_type_5
0	-0.566947	0.545545	-0.408248	0.241747	-0.109109	0.032898
1	-0.566947	0.545545	-0.408248	0.241747	-0.109109	0.032898
2	-0.377964	0.000000	0.408248	-0.564076	0.436436	-0.197386
3	-0.188982	-0.327327	0.408248	0.080582	-0.545545	0.493464
4	-0.188982	-0.327327	0.408248	0.080582	-0.545545	0.493464

Scikit-learn pipeline¶

Show an example of how to incorporate the encoding strategies into a scikit-learn pipeline

In [39]:

# for the purposes of this analysis, only use a small subset of features
feature_cols = [
    'fuel_type', 'make', 'aspiration', 'highway_mpg', 'city_mpg',
    'curb_weight', 'drive_wheels'
]

# Remove the empty price rows
df_ml = df.dropna(subset=['price'])

X = df_ml[feature_cols]
y = df_ml['price']

In [40]:

column_trans = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),
                                        ['fuel_type', 'make', 'drive_wheels']),
                                      (OrdinalEncoder(), ['aspiration']),
                                      remainder='passthrough')

In [41]:

linreg = LinearRegression()
pipe = make_pipeline(column_trans, linreg)

In [42]:

cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_absolute_error')

Out[42]:

array([-4476.0937653 , -1014.54842052, -4227.68553953, -4936.79899194,
       -1591.8291911 , -3716.06617255, -4293.79197464, -1390.00486495,
       -1600.57946369, -2124.30041954])

In [43]:

# Get the average of the errors after 10 iterations
cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_absolute_error').mean().round(2)

Out[43]:

-2937.17

In [ ]:

	convertible	hatchback	sedan
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0

	convertible	hatchback	sedan
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0

	convertible	hatchback	sedan
0	1.0	0.0	0.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0