Guided Project: Exploring eBay Car Sales Data¶

Project is to clean this dataset of used cars from German eBay website. We are going to clean the data and make sure to remove outliers.

In [1]:

import pandas as pd
import numpy as np

In [2]:

autos = pd.read_csv("autos.csv", encoding='Windows-1252')

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Issues¶

Missing Values¶

Here we see that vehicle type, gearbox, model, fuel type, and unrepaired damage have missing values. These will have to be removed later on to have a cleaner, more complete dataset down the road.

Spelling¶

Need to change spelling of manual and automatic

Characters¶

Need to remove extra characters like dollar signs

Letters included in integer/float columns¶

Need to remove km from odometer column

Camelcase not snakecase¶

Names of columns are in camcelcase not snakecase.

In [4]:

#autos.head()

In [5]:

col_names = autos.columns
print(col_names)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:

def clean_str(str):
    str = str.replace("yearOfRegistration","registration_year")
    str = str.replace("monthOfRegistration","registration_month")
    str = str.replace("notRepairedDamage","unrepaired_damage")
    str = str.replace("dateCreated","ad_created")
    str = str.replace("offerType","offer_type")
    str = str.replace("vehicleType","vehicle_type")
    str = str.replace("powerPS","power_ps")
    str = str.replace("fuelType","fuel_type")
    str = str.replace("nrOfPictures","nr_of_pictures")
    str = str.replace("postalCode","postal_code")
    str = str.replace("lastSeen","last_seen")
    str = str.replace("dateCrawled","date_crawled")
    return str

new_col_names = []
for c in autos.columns:
    clean_c = clean_str(c)
    new_col_names.append(clean_c)

autos.columns = new_col_names
print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [7]:

#autos.head()

Explaining how I renamed the columns¶

There were no spaces in the code to turn into underscores, so unfortunately, the text for most columns had to be replaced rather than use str.strip(). The latter columns were all lowercase or I would have used str.lower(). After creating this function, using a for loop, I was able to incorporate the changes into the original dataset with cleaned column names.

In [8]:

#autos.describe(include="all")

In [9]:

autos["nr_of_pictures"].unique()

Out[9]:

array([0])

Useless Columns¶

Some of these columns seem to have limited information such as nr_of_pictures, seller and offer_type as they are all primarily the same for all car sales. ab_test doesn't seem necessary when analying this dataframe.

In [10]:

autos = autos.drop(["nr_of_pictures","seller","offer_type","abtest"], axis=1)

In [11]:

#autos.head()

In [12]:

autos["price"]=autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename({"price":"price_usd"}, axis=1,inplace=True)

In [13]:

autos["odometer"]=autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)

In [14]:

#autos.describe(include="all")

Looking for data that doesn't seem right¶

In the price_usd column, we see the min is at 0. That doesn't seem realistic. Neither does the max for a used car being priced in the millions. In the power_ps column, we see there's a min of 0 for horsepower so that doesn't seem realistic either. Need to understand what registration_month = 0 means. Is it January? Or no data?

In [15]:

autos["price_usd"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])

Out[15]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
5%       2.000000e+02
10%      5.000000e+02
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
90%      1.400000e+04
95%      1.990000e+04
max      1.000000e+08
Name: price_usd, dtype: float64

In [16]:

autos["price_usd"].value_counts().sort_index()

Out[16]:

0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price_usd, Length: 2357, dtype: int64

Price in USD¶

I'm going to remove datapoints that in the highest and lowest 5% of the data. The range is very large and for used cars, it seems that there are a handful of datapoints that are exhorbitant that seems to be skewing the mean.

In [17]:

autos = autos[(autos["price_usd"].between(autos["price_usd"].quantile(0.05),autos["price_usd"].quantile(0.95)))]

In [18]:

autos["price_usd"].describe()

Out[18]:

count    45207.000000
mean      4578.824076
std       4443.771737
min        200.000000
25%       1249.000000
50%       2900.000000
75%       6500.000000
max      19900.000000
Name: price_usd, dtype: float64

In [19]:

autos["odometer_km"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])

Out[19]:

count     45207.000000
mean     128827.062181
std       36883.551984
min        5000.000000
5%        40000.000000
10%       70000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
90%      150000.000000
95%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Odometer in km¶

Seems like there's a huge skew in cars reading 150,000km. One realistic way of interpreting this data is 150K+ on the odometer.

In [20]:

autos["registration_year"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])
#autos["registration_year"].value_counts().sort_index()

Out[20]:

count    45207.000000
mean      2004.322848
std         82.234293
min       1000.000000
5%        1993.000000
10%       1996.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
90%       2012.000000
95%       2016.000000
max       9999.000000
Name: registration_year, dtype: float64

Registration Year¶

We should remove years past 2019. The steep drop-off in car registratins in 2018 is probably due to when the data was scraped, so perhaps it's wise to eliminate 2018 as well since it wouldn't show a complete picture of the data from that year. Also, will remove cars registered before 1997. Just realistically, speaking when getting a used car, it's probably best to not look for cars older than 20 years (1997-2017)

In [21]:

autos = autos[autos["registration_year"].between(1997,2017)]

In [22]:

autos["registration_year"].describe()

Out[22]:

count    39206.000000
mean      2004.771515
std          5.241971
min       1997.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       2017.000000
Name: registration_year, dtype: float64

In [23]:

autos["registration_month"].describe()

Out[23]:

count    39206.000000
mean         5.897031
std          3.654661
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

Registration Month¶

0 doesn't make sense for month so we'll remove those datapoints.

In [24]:

autos = autos[autos["registration_month"].between(1,12)]

In [25]:

autos["power_ps"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])

Out[25]:

count    36140.000000
mean       118.685529
std        206.662289
min          0.000000
5%           0.000000
10%         54.000000
25%         75.000000
50%        109.000000
75%        147.000000
90%        184.000000
95%        224.000000
max      17700.000000
Name: power_ps, dtype: float64

Horsepower¶

Googled some facts on horsepower. It seems like the "normal" range to expect (based on car size and type) is somewhere between 70 to 200/300. I think I'm going to eliminate all datapoints in the lower quartile and eliminate anything in the upper 5th percentile.

In [26]:

autos = autos[autos["power_ps"].between(autos["power_ps"].quantile(0.25),autos["power_ps"].quantile(0.95))]

Exploring the Date Columns¶

We need to convert the string values that currently represent the dates to a numerical representation. Also, it would be good to ensure the formatting of the date is consistent all across the board.

In [27]:

#autos["ad_created"].value_counts(normalize=True,dropna=False).sort_index()

Date Columns¶

It seems like ad_created doesn't have a time block in it. Maybe, it would be useful to remove the times so they don't say "00:00:00". As for the other two, they seem to be in the same format.

In [28]:

top_brands = autos["brand"].value_counts(ascending=False).head(20)
top_brands.index

Out[28]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'skoda', 'seat', 'mazda', 'fiat', 'toyota', 'nissan', 'mini',
       'citroen', 'hyundai', 'volvo', 'mitsubishi', 'alfa_romeo'],
      dtype='object')

Brand Data¶

I've chosen the top 20 most popular brands in the dataset.

In [29]:

top_brands_price = {}

for b in top_brands.index:
    brand = autos[autos["brand"] == b]
    mean_price = brand["price_usd"].mean()
    top_brands_price[b] = mean_price

sorted(top_brands_price.items(),key=lambda x:x[1])

Out[29]:

[('renault', 3225.7273489932886),
 ('alfa_romeo', 3330.1487603305786),
 ('mitsubishi', 3425.072463768116),
 ('fiat', 3496.433106575964),
 ('opel', 3824.672874209417),
 ('peugeot', 3835.035591274397),
 ('ford', 4072.5909580193756),
 ('mazda', 4303.145228215768),
 ('citroen', 4445.994285714286),
 ('volvo', 5227.584192439863),
 ('seat', 5372.139959432048),
 ('toyota', 5821.07397260274),
 ('nissan', 5823.0329670329675),
 ('volkswagen', 6128.094465476835),
 ('mercedes_benz', 6228.172503725782),
 ('hyundai', 6251.53144654088),
 ('skoda', 6543.923076923077),
 ('bmw', 6808.131986106726),
 ('audi', 7024.9432565789475),
 ('mini', 9659.374647887324)]

Brand Data Analysis¶

Here we can see the mean price for the top 20 brands on the eBay car website. I've sorted them in ascending order. Alfa Romeo is the most inexpensive overall of the top 20 brands while Audi, Mini and Skoda are the top 3 expensive brands.

In [35]:

top6 = autos["brand"].value_counts(ascending=False).head(6)

top6_mean_price = {}
top6_mean_mile = {}

for b in top6.index:
    brand = autos[autos["brand"] == b]
    mean_price = brand["price_usd"].mean()
    mean_mile = brand["odometer_km"].mean()
    top6_mean_price[b] = mean_price
    top6_mean_mile[b] = mean_mile
    

In [36]:

top6mp_series = pd.Series(top6_mean_price)
top6mm_series = pd.Series(top6_mean_mile)

In [68]:

#top6mm_series

In [69]:

#top6mp_series

In [70]:

top6_df = pd.DataFrame(top6mp_series, columns=["mean_price"])
#top6_df

In [46]:

top6_df.insert(1,"mean_mileage",top6mm_series)

In [47]:

top6_df

Out[47]:

	mean_price	mean_mileage
audi	7024.943257	135326.891447
bmw	6808.131986	136071.992422
ford	4072.590958	127516.146394
mercedes_benz	6228.172504	136132.637854
opel	3824.672874	128411.806044
volkswagen	6128.094465	129640.346133

Alternative Way of Creating DataFrame shown above¶

In [74]:

top6 = autos["brand"].value_counts(ascending=False).head(6)

top6_dict = {brand : {'price': [], 'mile': []} for brand in top6.index } 

for b in top6.index: 
  
    brand = autos[autos["brand"] == b]

    mean_price = brand["price_usd"].mean()
    mean_mile = brand["odometer_km"].mean()
    
    top6_dict[b]['price'] = mean_price
    top6_dict[b]['mile'] = mean_mile

In [75]:

#top6_dict

In [76]:

x = pd.DataFrame.from_dict(top6_dict).T

In [77]:

Out[77]:

	mile	price
audi	135326.891447	7024.943257
bmw	136071.992422	6808.131986
ford	127516.146394	4072.590958
mercedes_benz	136132.637854	6228.172504
opel	128411.806044	3824.672874
volkswagen	129640.346133	6128.094465

Comparison of mean price to mean mileage on the top 6 car brands¶

The hypothesis going into this would be that the smaller the mileage on the car, the greater the price. However, we see that the top 3 brands in mean price are also the top 3 brands with mean mileage. One may say the confound here is that these brands are all luxury car brands. However, with the latter 3 (Volkswagen, Opel and Ford), we see that VW has the highest mileage, but also the highest price. Between Ford and opel, we see the correlation between increased miles meaning decreased price that we initially expected. Granted, these variables can be greatly influenced by other factors such as the age of the car, the state of repair it is in, and horsepower.