This project will clean up and analyze the dataset from Ebay Car Sales Data.
The dataset contains 50,000 data points with the following information:
dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing.
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head(5)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The observation from the first few rows of the dataset:
vehicleType
gearbox
model
fuelType
notRepairedDamage
def camel_to_snake(text):
import re
string = re.sub('(.)([A-Z][a-z]+)', r'\1_\2',text)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', string).lower()
new_camel_columns = []
for col in autos.columns:
new_col = camel_to_snake(col)
new_camel_columns.append(new_col)
autos.columns = new_camel_columns
autos.head(1)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | year_of_registration | gearbox | power_ps | model | odometer | month_of_registration | fuel_type | brand | not_repaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
'''Rename some columns to be more intuitive'''
new_columns = ({'year_of_registration':'registration_year','month_of_registration':'registration_month',
'not_repaired_damage':'unrepaired_damage','date_created':'ad_created','nr_of_pictures':'num_photos'})
autos.rename(new_columns,axis = 1, inplace = True)
autos.head(1)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
autos.describe(include ='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the descriptive statistic above:
autos['num_photos'].value_counts()
0 50000 Name: num_photos, dtype: int64
Not any of the listings has photo, this column doesn't have useful information, we can drop it
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
Only a few of different values found in seller and offer_type columns, we can drop them as well.
autos = autos.drop(['num_photos','seller','offer_type'],axis=1)
First, we need to convert these columns to integer by removing special characters
autos['price'] = (autos['price']
.str.replace('$','')
.str.replace(',','')
.astype(int)
)
autos['odometer'] = (autos['odometer']
.str.replace(',','')
.str.replace('km','')
.astype(int)
)
autos.rename({'odometer':'odometer_km'},axis=1, inplace = True)
autos[['price','odometer_km']]
price | odometer_km | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
3 | 4350 | 70000 |
4 | 1350 | 150000 |
... | ... | ... |
49995 | 24900 | 100000 |
49996 | 1980 | 150000 |
49997 | 13200 | 5000 |
49998 | 22900 | 40000 |
49999 | 1250 | 150000 |
50000 rows × 2 columns
autos['unrepaired_damage'].value_counts()
nein 35232 ja 4939 Name: unrepaired_damage, dtype: int64
Though unrepaired_damage has only 2 values, but we can't drop it as almost 5000 listings have 'ja' value. Dropping this column will impact our analysis.
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
There is at least one car is listed around 100 millions, it is way too high. We need to check it out!
autos['price'].value_counts().sort_index(ascending = False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
The prices increase gradually up to 350,000 and suddenly jump up to 999,990. It turns out that 14 cars has prices over $350,000 which makes it unrealistic. We can remove these rows.
autos['price'].value_counts().sort_index(ascending = False).tail(20)
35 1 30 7 29 1 25 5 20 4 18 1 17 3 15 2 14 1 13 2 12 3 11 2 10 7 9 1 8 1 5 2 3 1 2 3 1 156 0 1421 Name: price, dtype: int64
1,421 cars are marked at 0 which makes the listings look odd. We can remove these rows.
Some listings starts at 1 dollar and go up but less than 100 dollar, the prices do not look right for cars. However, given Ebay is an auction site, these prices could be open bid. We can keep it for now.
autos = autos[autos['price'].between(1,350000)]
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 |
mean | 5888.935591 | 2004.755421 | 117.197158 | 125770.101925 | 5.782251 | 50975.745207 |
std | 9059.854754 | 88.643887 | 200.649618 | 39788.636804 | 3.685595 | 25746.968398 |
min | 1.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1200.000000 | 1999.000000 | 71.000000 | 125000.000000 | 3.000000 | 30657.000000 |
50% | 3000.000000 | 2004.000000 | 107.000000 | 150000.000000 | 6.000000 | 49716.000000 |
75% | 7490.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71665.000000 |
max | 350000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
We have removed approximately 1,500 rows from the dataset. The price column now is good to use for analysis. Let's take a look at other columns
autos['odometer_km'].value_counts()
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 5000 836 40000 815 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
The data in odometer_km column looks right.
autos[['date_crawled','ad_created','last_seen']][:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
These are three columns look like datetime format but it is identified as string. We will access the first 10 of the strings to analyze the distribution of those dates.
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
All the listings were created in March and April 2016.
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
Some ads were created in 2015 though the earliest date in date_crawled is March 2016. The data for these rows is not reliable.
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
Last_seen column possibly means the date car being sold. It is noticeable that the distribution for the last three days significantly increased, from 6 to 10 times the previous days. Does it really mean sales on the last three day? We need to keep an eye on it.
Recall from the descriptive information on the top, we need to check out the listings with registration_year = 1000 and 9999.
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos[autos['registration_year'] == 1000]
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22316 | 2016-03-29 16:56:41 | VW_Kaefer.__Zwei_zum_Preis_von_einem. | 1500 | control | NaN | 1000 | manuell | 0 | kaefer | 5000 | 0 | benzin | volkswagen | NaN | 2016-03-29 00:00:00 | 48324 | 2016-03-31 10:15:28 |
autos[autos['registration_year'] == 9999]
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8012 | 2016-03-23 16:43:29 | Opel_GT_Karosserie_mit_Brief! | 700 | test | NaN | 9999 | NaN | 0 | andere | 10000 | 0 | NaN | opel | NaN | 2016-03-23 00:00:00 | 21769 | 2016-04-05 20:16:15 |
33950 | 2016-03-23 21:52:25 | 58er_karmann_ghia_lowlight_Kaefer__zum_restaur... | 7999 | test | NaN | 9999 | NaN | 0 | kaefer | 10000 | 0 | NaN | volkswagen | NaN | 2016-03-23 00:00:00 | 47638 | 2016-04-06 03:46:40 |
38076 | 2016-04-04 22:54:47 | Mercedes_Benz_A180 | 18000 | test | NaN | 9999 | NaN | 0 | a_klasse | 10000 | 0 | benzin | mercedes_benz | NaN | 2016-04-04 00:00:00 | 51379 | 2016-04-07 02:44:52 |
autos['registration_year'].value_counts().sort_index().head(20)
1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 Name: registration_year, dtype: int64
5 listings found for car registered before the year 1900
autos['registration_year'].value_counts().sort_index().tail(20)
2010 1589 2011 1623 2012 1310 2013 803 2014 663 2015 392 2016 1220 2017 1392 2018 470 2019 2 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, dtype: int64
15 listings found for car registered in the future.
Again, all the listings were created in 2016, it is unreasonable that the car is registered after 2016 (register after the listing), so it is safe to remove any car from 2017.
It said, we can remove cars with registration year before 1900 and after 2016.
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize = True).head(10)
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 Name: registration_year, dtype: float64
After removing listings with registration year before 1900 and after 2016, we can see most of the listings have cars registered from 1998 to 2007.
brand_dist = autos['brand'].value_counts(normalize = True)
brand_dist
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
u_brands = autos['brand'].unique()
major_brands = brand_dist[brand_dist > 0.05].index
major_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
mean_price_brands = {}
for brand in major_brands:
mean_brand_price = int(autos.loc[autos['brand'] == brand,'price'].mean())
mean_price_brands[brand] = mean_brand_price
mean_price_brands
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}
On average, audi, mercedes_benz and bmw are three most expensive brands.
It is noticeable that prices of these three brands are almost double or tripple prices of the others in top 50% of the listings.
The dataset includes milleage data, we would want to find out if any relation between the milleage and price.
mean_mileage_brands = {}
for brand in major_brands:
mean_brand_mileage = int(autos.loc[autos['brand']==brand,'odometer_km'].mean())
mean_mileage_brands[brand] = mean_brand_mileage
mean_mileage_brands
{'volkswagen': 128707, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}
price_series = pd.Series(mean_price_brands)
price_series
volkswagen 5402 bmw 8332 opel 2975 mercedes_benz 8628 audi 9336 ford 3749 dtype: int64
mileage_series = pd.Series(mean_mileage_brands)
mileage_series
volkswagen 128707 bmw 132572 opel 129310 mercedes_benz 130788 audi 129157 ford 124266 dtype: int64
brand_df = pd.DataFrame(price_series,columns = ['mean_price'])
brand_df
mean_price | |
---|---|
volkswagen | 5402 |
bmw | 8332 |
opel | 2975 |
mercedes_benz | 8628 |
audi | 9336 |
ford | 3749 |
brand_df['mean_mileage'] = mileage_series
brand_df
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5402 | 128707 |
bmw | 8332 | 132572 |
opel | 2975 | 129310 |
mercedes_benz | 8628 | 130788 |
audi | 9336 | 129157 |
ford | 3749 | 124266 |
There isn't a clear correlation between the price and mileage. The price itself is possibly driven by its brand. However, we can dig deeper by splitting the mileage into different groups to see if average price is different.
autos['odometer_km'].value_counts().sort_values()
10000 241 20000 742 30000 760 5000 785 40000 797 50000 993 60000 1128 70000 1187 80000 1375 90000 1673 100000 2058 125000 4857 150000 30085 Name: odometer_km, dtype: int64
low_miles = autos[autos['odometer_km'] < 50000]
med_miles = autos[autos['odometer_km'].between(50000,100000)]
high_miles = autos[autos['odometer_km'] > 100000]
def miles_price(dataframe):
miles_price = {}
for brand in u_brands:
mean_price = dataframe.loc[dataframe['brand'] == brand,'price'].mean()
miles_price[brand] = mean_price
series = pd.Series(miles_price)
return series
mile_price_group = pd.DataFrame(miles_price(low_miles),columns = ['50k_less_miles'])
mile_price_group['50k_100k_miles'] = miles_price(med_miles)
mile_price_group['100k_more_miles'] = miles_price(high_miles)
major_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
filtered_group = mile_price_group[mile_price_group.index.isin(major_brands)]
filtered_group
50k_less_miles | 50k_100k_miles | 100k_more_miles | |
---|---|---|---|
bmw | 20832.881857 | 16504.314574 | 6282.580461 |
volkswagen | 13465.113354 | 10348.783013 | 3762.342238 |
ford | 11230.000000 | 5969.274809 | 2440.400000 |
mercedes_benz | 21584.037815 | 17417.409091 | 6164.055756 |
audi | 25629.538721 | 18152.376364 | 6303.623669 |
opel | 7748.087719 | 5762.218134 | 2099.701207 |
By just looking at popular brands, we can see the more mileage the car has, the cheaper it is.
price_change = ((filtered_group['50k_less_miles'] - filtered_group['100k_more_miles'])
/ filtered_group['50k_less_miles'])
price_change
bmw 0.698430 volkswagen 0.720586 ford 0.782689 mercedes_benz 0.714416 audi 0.754048 opel 0.729004 dtype: float64
Cars with more than 100,000 miles are likely listed at 70% less at price comparing to cars with less than 50,000 miles.
So far, we have explored and analyzed the trend of price based on brands and odometer.
The following brands are the most popular brands in all the listings. Brands in the first line has highest listed price. Volkswagen is also popular but is more affordable.
If we categorize the mileage into three different groups, we can clearly see that cars with high miles are listed at lower prices.
autos['gearbox'].unique()
array(['manuell', 'automatik', nan], dtype=object)
autos['gearbox'] = autos['gearbox'].str.replace('manuell','manual')
autos['gearbox'] = autos['gearbox'].str.replace('automatik','automatic')
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
new_crawled_date = (pd.to_numeric(pd
.to_datetime(autos['date_crawled'].str[:10], format = "%Y-%m-%d")
.dt.strftime('%Y%m%d')))
autos['date_crawled'] = new_crawled_date
new_ad_date = (pd.to_numeric(pd
.to_datetime(autos['ad_created'].str[:10],format="%Y-%m-%d")
.dt.strftime('%Y%m%d')))
autos['ad_created'] = new_ad_date
new_last_seen = (pd.to_numeric(pd
.to_datetime(autos['last_seen'].str[:10],format = "%Y-%m-%d")
.dt.strftime('%Y%m%d')))
autos['last_seen'] = new_last_seen
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 20160404 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 20160312 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | benzin | ford | nein | 20160401 | 39218 | 20160401 |
split = autos['name'].str.partition('_')
new_name = split.iloc[:,2]
autos['name'] = new_name
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | 807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 79588 | 20160406 |
1 | 20160404 | 740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 20160404 | 71034 | 20160406 |
2 | 20160326 | Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 35394 | 20160406 |
3 | 20160312 | smart_fortwo_coupe_softouch/F1/Klima/Panorama | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 20160312 | 33729 | 20160315 |
4 | 20160401 | Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepflegt.mit... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | benzin | ford | nein | 20160401 | 39218 | 20160401 |
autos['brand_model'] = autos['brand'] +'_'+ autos['model']
brand_model = autos['brand_model'].value_counts(normalize = True).sort_values(ascending = False)
brand_model.head(1)
volkswagen_golf 0.083326 Name: brand_model, dtype: float64
Volkswagen's price is in the middle range of top six brands we found out ealier, it is understandable that this brand up again.