This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good case study.
Read the data into Pandas
import pandas as pd
# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
header=None, names=headers, na_values="?" )
df.head()
symboling | normalized_losses | make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | wheel_base | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
5 rows × 26 columns
df.shape
(205, 26)
df.dtypes
symboling int64 normalized_losses float64 make object fuel_type object aspiration object num_doors object body_style object drive_wheels object engine_location object wheel_base float64 length float64 width float64 height float64 curb_weight int64 engine_type object num_cylinders object engine_size int64 fuel_system object bore float64 stroke float64 compression_ratio float64 horsepower float64 peak_rpm float64 city_mpg int64 highway_mpg int64 price float64 dtype: object
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | engine_type | num_cylinders | fuel_system | |
---|---|---|---|---|---|---|---|---|---|---|
0 | alfa-romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
1 | alfa-romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
2 | alfa-romero | gas | std | two | hatchback | rwd | front | ohcv | six | mpfi |
3 | audi | gas | std | four | sedan | fwd | front | ohc | four | mpfi |
4 | audi | gas | std | four | sedan | 4wd | front | ohc | five | mpfi |
Does the database contain missing values? If so, replace them using one of the methods explained in class
Split the data into training and testing sets
Train a Random Forest Regressor to predict the price of a car using the numeric features
Create dummy variables for the categorical features
Train a Random Forest Regressor and compare