"Alternative-fuel" vehicles in Delaware: Data cleaning

Peter Attia, petermattia.com

Last updated November 8, 2017

This notebook investigates alternative-fuel vehicle purchases in Delaware. Alternative-fuel vehicles include:

  • Battery electric vehicles (BEVs)
  • Plug-in hybrid electric vehicles (PHEVs)
  • Propane or natural gas vehicles

Rebates data downloaded on October 22, 2017 (data last updated October 2, 2017) from this link

More information about the program:

This notebook contains the data cleaning process for this dataset. Another notebook shows the data analysis.

Imports and defaults

In [1]:
# IMPORTS
import pandas as pd
from collections import Counter

# DEFAULTS
pd.set_option("display.max_rows",10)
pd.set_option("display.max_columns",20)

Load the rebates dataset

In [2]:
rebates = pd.read_csv('State_Rebates_for_Alternative-Fuel_Vehicles.csv')
rebates.columns = [c.replace(' ', '_') for c in rebates.columns] # add underscores to column names
rebates
Out[2]:
Award_Number Rebate_Amount City State County Zip Age Gender Date_of_Purchase Dealership Vehicle_Type Make Model Year Lease? Gasoline_Emissions Alt-Fuel_Emissions
0 CVR071501 $2200.00 Hockessin DE New Castle 19707 81.0 M 07/20/2015 Sheridan Ford Plug-in Hybrid Ford Fusion Energi 2016 No 14815.0 6575.0
1 CVR071502 $2200.00 Wilmington DE New Castle 19809 47.0 M 07/20/2015 Porter Ford Plug-in Hybrid Ford C-Max Energi 2015 Yes 14815.0 6575.0
2 CVR071503 $2200.00 Wilmington DE New Castle 19810 47.0 M 07/22/2015 Darcars of Lanham Severn. Inc. Electric Ford Focus Electric 2015 Yes 14815.0 5539.0
3 CVR071504 $2200.00 Wilmington DE New Castle 19808 66.0 M 07/20/2015 Sheridan Ford Plug-in Hybrid Ford C-Max Energi 2015 Yes 14815.0 6575.0
4 CVR071505 $2200.00 New Castle DE New Castle 19720 51.0 M 07/24/2015 Sheridan Ford Plug-in Hybrid Ford Fusion Energi 2015 Yes 14815.0 6575.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
510 CVR0616147 $1100.00 Dover DE Kent 19901 NaN NaN 04/20/2016 Coach and Equipment/ Coach Bus Sales Propane Ford E-450 Phoenix 2016 No NaN NaN
511 CVR0616148 $1100.00 Dover DE Kent 19901 NaN NaN 04/20/2016 Coach and Equipment/ Coach Bus Sales Propane Ford E-450 Phoenix 2016 No NaN NaN
512 CVR0616149 $1100.00 Dover DE Kent 19901 NaN NaN 04/20/2016 Coach and Equipment/ Coach Bus Sales Propane Ford E-450 Phoenix 2016 No NaN NaN
513 CVR0616150 $1100.00 Dover DE Kent 19901 NaN NaN 04/20/2016 Coach and Equipment/ Coach Bus Sales Propane Ford E-450 Phoenix 2016 No NaN NaN
514 CVR0616151 $1100.00 Dover DE Kent 19901 NaN NaN 04/20/2016 Coach and Equipment/ Coach Bus Sales Propane Ford E-450 Phoenix 2016 No NaN NaN

515 rows × 17 columns

Data cleaning

County

All alternative fuel vehicles

In [3]:
Counter(rebates["County"])
Out[3]:
Counter({'Kent': 76, 'Kent ': 1, 'New Castle': 330, 'Sussex': 108})

Remove whitespace in 'Kent' with str.strip()

In [4]:
rebates["County"] = rebates["County"].str.strip()
Counter(rebates["County"])
Out[4]:
Counter({'Kent': 77, 'New Castle': 330, 'Sussex': 108})

Makes

In [5]:
Counter(rebates["Make"])
Out[5]:
Counter({'Audi': 5,
         'BMW': 18,
         'BMW ': 7,
         'Cadillac': 1,
         'Chevrolet': 89,
         'FORD': 1,
         'Ford': 268,
         'Ford ': 2,
         'Hyundai': 3,
         'Kia': 1,
         'Mercedes-Benz': 2,
         'Nissan': 20,
         'Porsche ': 2,
         'Tesla': 69,
         'Tesla ': 1,
         'Toyota': 22,
         'Volvo': 2,
         'ford': 2})
In [6]:
rebates["Make"] = rebates["Make"].str.title() # change FORD and ford to Ford
rebates["Make"] = rebates["Make"].str.strip() # remove trailing whitespace
rebates["Make"] = rebates["Make"].str.replace("Bmw","BMW") # to be anal
Counter(rebates["Make"]).most_common()
Out[6]:
[('Ford', 273),
 ('Chevrolet', 89),
 ('Tesla', 70),
 ('BMW', 25),
 ('Toyota', 22),
 ('Nissan', 20),
 ('Audi', 5),
 ('Hyundai', 3),
 ('Mercedes-Benz', 2),
 ('Volvo', 2),
 ('Porsche', 2),
 ('Cadillac', 1),
 ('Kia', 1)]

Models

In [7]:
Counter(rebates["Model"])
Out[7]:
Counter({'330C': 1,
         '330e': 3,
         '530E': 1,
         'A3 Sportback e-tron': 5,
         'B250e': 1,
         'Bolt': 17,
         'C-Max Energi': 11,
         'C-Max Energi ': 106,
         'Cayenne S Hybrid': 2,
         'E-450 Phoenix': 5,
         'ELR': 1,
         'F150': 2,
         'F450': 2,
         'Focus Electric': 9,
         'Fusion Energi': 137,
         'GLE 550': 1,
         'Leaf': 20,
         'Model S': 49,
         'Model X': 20,
         'Prius Prime': 22,
         'S90D': 1,
         'Sonata': 3,
         'Soul EV': 1,
         'Volt': 68,
         'Volt ': 5,
         'X5 40E': 14,
         'XC90': 2,
         'i3': 5,
         'i3 Rex': 1})

This one looks good without any processing. We'll strip whitespace just in case

In [8]:
rebates["Model"] = rebates["Model"].str.strip() # clean whitespace
Counter(rebates["Model"]).most_common()
Out[8]:
[('Fusion Energi', 137),
 ('C-Max Energi', 117),
 ('Volt', 73),
 ('Model S', 49),
 ('Prius Prime', 22),
 ('Leaf', 20),
 ('Model X', 20),
 ('Bolt', 17),
 ('X5 40E', 14),
 ('Focus Electric', 9),
 ('i3', 5),
 ('A3 Sportback e-tron', 5),
 ('E-450 Phoenix', 5),
 ('330e', 3),
 ('Sonata', 3),
 ('XC90', 2),
 ('F150', 2),
 ('Cayenne S Hybrid', 2),
 ('F450', 2),
 ('ELR', 1),
 ('B250e', 1),
 ('Soul EV', 1),
 ('S90D', 1),
 ('330C', 1),
 ('i3 Rex', 1),
 ('GLE 550', 1),
 ('530E', 1)]

Vehicle type

In [9]:
Counter(rebates["Vehicle_Type"]).most_common()
Out[9]:
[('Plug-in Hybrid', 383),
 ('Electric', 123),
 ('Propane', 7),
 ('Propane (bi-fuel)', 2)]

Looks good

City

In [10]:
Counter(rebates["City"])
Out[10]:
Counter({'Bear': 14,
         'Bellefonte': 1,
         'Bethany Beach': 3,
         'Bridgeville': 5,
         'Camden': 3,
         'Camden ': 2,
         'Camden Wyoming': 4,
         'Claymont': 2,
         'Clayton': 6,
         'Clayton ': 1,
         'Dagsboro': 3,
         'Delaware City': 1,
         'Delmar': 1,
         'Dover': 41,
         'Elsmere': 1,
         'Felton': 3,
         'Frankford': 1,
         'Georgetown': 2,
         'Georgetown ': 1,
         'Greenville': 1,
         'Greenwood': 2,
         'Hartly': 1,
         'Hockessin': 14,
         'Kenton': 1,
         'Laurel': 1,
         'Lewes': 24,
         'Lincoln': 2,
         'Magnolia': 3,
         'Magnolia ': 1,
         'Marshallton': 1,
         'Middletown': 34,
         'Middletown ': 5,
         'Milford': 7,
         'Millsboro': 29,
         'Milton': 8,
         'Montchanin': 1,
         'New Castle': 13,
         'New Castle ': 3,
         'Newark': 65,
         'Newark ': 9,
         'Ocean View': 3,
         'Port Penn': 1,
         'Rehoboth Beach': 9,
         'Rockland': 1,
         'Seaford': 5,
         'Seaford ': 2,
         'Selbyville': 2,
         'Selbyville ': 1,
         'Smyrna': 14,
         'Smyrna ': 1,
         'Talleyville': 1,
         'Townsend': 14,
         'Wilmington': 127,
         'Wilmington ': 14})
In [11]:
rebates["City"] = rebates["City"].str.strip() # clean whitespace
Counter(rebates["City"]).most_common()
Out[11]:
[('Wilmington', 141),
 ('Newark', 74),
 ('Dover', 41),
 ('Middletown', 39),
 ('Millsboro', 29),
 ('Lewes', 24),
 ('New Castle', 16),
 ('Smyrna', 15),
 ('Hockessin', 14),
 ('Townsend', 14),
 ('Bear', 14),
 ('Rehoboth Beach', 9),
 ('Milton', 8),
 ('Clayton', 7),
 ('Seaford', 7),
 ('Milford', 7),
 ('Camden', 5),
 ('Bridgeville', 5),
 ('Magnolia', 4),
 ('Camden Wyoming', 4),
 ('Selbyville', 3),
 ('Georgetown', 3),
 ('Ocean View', 3),
 ('Dagsboro', 3),
 ('Bethany Beach', 3),
 ('Felton', 3),
 ('Greenwood', 2),
 ('Claymont', 2),
 ('Lincoln', 2),
 ('Hartly', 1),
 ('Bellefonte', 1),
 ('Talleyville', 1),
 ('Rockland', 1),
 ('Frankford', 1),
 ('Marshallton', 1),
 ('Montchanin', 1),
 ('Delmar', 1),
 ('Elsmere', 1),
 ('Port Penn', 1),
 ('Kenton', 1),
 ('Greenville', 1),
 ('Delaware City', 1),
 ('Laurel', 1)]

Gender

In [12]:
Counter(rebates["Gender"])
Out[12]:
Counter({'M': 335, 'F': 159, nan: 21})

This looks good. I don't trust this gender breakdown, since it's just the purchaser (i.e. the beneficiary could be a couple or a spouse)

Award type

In [13]:
Counter(rebates["Rebate_Amount"])
Out[13]:
Counter({'$1000.00': 37,
         '$1100.00': 7,
         '$1350.00': 2,
         '$1500.00': 142,
         '$2000.00': 1,
         '$2200.00': 292,
         '$3500.00': 34})

Dates

In [14]:
rebates["Date_of_Purchase"] = pd.to_datetime(rebates["Date_of_Purchase"])
rebates["Date_of_Purchase"] < '2016-10-31'
Out[14]:
0      True
1      True
2      True
3      True
4      True
       ... 
510    True
511    True
512    True
513    True
514    True
Name: Date_of_Purchase, Length: 515, dtype: bool

Save cleaned file

In [15]:
rebates.to_csv('State_Rebates_for_Alternative-Fuel_Vehicles_cleaned.csv',index=False)