This is a guided project about a dataset of used cars from eBay Kleinanzeigen, a classified section of German eBay website. The dataset was originally scrapped and uploaded to Kaggle
The data dictionary provided from the dataset is as follow:
The aim of this project is to clean the data and analyze the used car listings. So let us begin by importing and reading the data.
import pandas as pd
import numpy as np
pd.set_option('float_format', '{:f}'. format) # this is to have float with 4 decimal points
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
print(autos.columns)
autos_test = autos.copy()
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Looking at the columns titles of the autos dataframe, they are using camelcase. We are going to change these to snakecase so as to make it more descriptive. To do that, we copied the autos dataframe using Dataframe.copy() method and change the columns titles into snakecase and then assigne them back to autos.columns.
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_columns
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We need to see the value for each columns. We will take notes on columns with string and almost all values are the same. We might drop these columns because not much to be analyzed (i.e. 'seller', 'offer_type', 'unrepaired_damage'). We also take notes on numeric data stored as text such as 'price', 'odometer'. We would need to convert these into numeric values.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.000000 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | nan | 2 | nan | 245 | 13 | nan | 7 | 40 | 2 | 76 | nan | nan | 39481 |
top | 2016-03-11 22:38:16 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | nan | manuell | nan | golf | 150,000km | nan | benzin | volkswagen | nein | 2016-04-03 00:00:00 | nan | nan | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | nan | 36993 | nan | 4024 | 32424 | nan | 30107 | 10687 | 35232 | 1946 | nan | nan | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.000000 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.000000 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.000000 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.000000 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.000000 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.000000 | 99998.000000 | NaN |
The first row of the dataframe (df) show the number or count for each column of the df. There are several columns which the total count were less than 50,000; 'vehicle-type','gearbox', 'model', 'fuel_type' and 'unrepaired_damage'. We will investigate these column later. Several columns are also got the unique instances of only 2. We will check several columns contain numeric data but stored as text: 'price', 'registration_year', 'odometer', 'power_ps'and 'registration_month'.
autos['price']. head()
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
The 'price' column is text object. We will change this to numeric later.
autos['odometer'].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
The 'odometer'column is also text object.
autos['power_ps']. head()
0 158 1 286 2 102 3 71 4 0 Name: power_ps, dtype: int64
The 'power_ps'columns is integer.
autos['registration_year'].head()
0 2004 1 1997 2 2009 3 2007 4 2003 Name: registration_year, dtype: int64
Now, we are going to change the 'price' and 'odometer' columns as numeric by firstly removing '$' sign and 'km'.
autos['price'] = autos['price'].str.replace('$', '').str.replace(',','').astype(int)
autos['odometer']= autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int)
# We rename the 'odometer' column to 'odometer_km'.
autos.rename({'odometer' : 'odometer_km'}, axis = 1, inplace = True)
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# Further exploration of 'price' column:
autos['price'].unique().shape
autos['price'].describe()
# We can see the minimum is $0 which is unlikely. Also max 100,000,000 which is unlikely also.
count 50000.000000 mean 9840.043760 std 481104.380500 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 99999999.000000 Name: price, dtype: float64
# We will get rid the outliers
autos['price'].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
# We can see that the minimum price $0 which is unlikely, maximum price is $99,999,999 which is way too much for an auction site
# Therefore, we are going to take out all price below $1000 and any price above $350000
autos=autos[autos['price'].between(999, 350000)]
autos['price'].describe()
count 39060.000000 mean 7185.860983 std 9666.679528 min 999.000000 25% 2150.000000 50% 4250.000000 75% 8900.000000 max 350000.000000 Name: price, dtype: float64
# Let's do the same for 'odometer_km' column.
autos['odometer_km']. unique().shape
autos['odometer_km'].describe()
count 39060.000000 mean 122983.358935 std 40696.652150 min 5000.000000 25% 100000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
# The minimum value is 5000 km and maximum is 150,000 km.
# However, there seem to be the same 150,000 km for max, 75% and 50%. We need to check this.
autos['odometer_km'].value_counts().sort_index(ascending=False)
150000 23678 125000 4372 100000 1866 90000 1579 80000 1342 70000 1156 60000 1102 50000 988 40000 797 30000 748 20000 693 10000 229 5000 510 Name: odometer_km, dtype: int64
# Since the count for 150,000 km is 23,678, it is more than half. It's make sense.
# We don't do further analysis for the 'odometer_km'.
We are going to explore columns that having to do with date:
Now, we want to analyze the distribution of values in the 'date_crawled', 'date_created', and 'last_seen' columns (all string columns) as percentage. The objective is to see the movements of car selling within this Ebay Car Sales site.
autos[['date_crawled','date_created','last_seen']] [0:5]
date_crawled | date_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
The date format is comprised of date and hour. We will extract the first 10 characters to isolate the date and ignore the hour.
# We use value_counts method (normalize = True) to get relative frequencies of unique values
print(autos['date_crawled'].str[:10].unique().shape)
autos['date_crawled']. str[:10].value_counts(normalize=True, dropna=False).sort_index()
(34,)
2016-03-05 0.025602 2016-03-06 0.013850 2016-03-07 0.035253 2016-03-08 0.032719 2016-03-09 0.032642 2016-03-10 0.033359 2016-03-11 0.032924 2016-03-12 0.037276 2016-03-13 0.016129 2016-03-14 0.036508 2016-03-15 0.033641 2016-03-16 0.029211 2016-03-17 0.030440 2016-03-18 0.012852 2016-03-19 0.035023 2016-03-20 0.038146 2016-03-21 0.037302 2016-03-22 0.032488 2016-03-23 0.031976 2016-03-24 0.028981 2016-03-25 0.030620 2016-03-26 0.033026 2016-03-27 0.031336 2016-03-28 0.035356 2016-03-29 0.033871 2016-03-30 0.033026 2016-03-31 0.031388 2016-04-01 0.034485 2016-04-02 0.036380 2016-04-03 0.039017 2016-04-04 0.036994 2016-04-05 0.013415 2016-04-06 0.003251 2016-04-07 0.001510 Name: date_crawled, dtype: float64
There was 34 days of crawling. The distribution percentage of crawlings almost equal.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False). sort_index()
2016-03-05 0.001101 2016-03-06 0.003584 2016-03-07 0.004634 2016-03-08 0.006221 2016-03-09 0.008961 2016-03-10 0.009985 2016-03-11 0.011751 2016-03-12 0.022376 2016-03-13 0.008449 2016-03-14 0.012058 2016-03-15 0.015028 2016-03-16 0.015438 2016-03-17 0.026472 2016-03-18 0.007399 2016-03-19 0.014593 2016-03-20 0.019764 2016-03-21 0.019892 2016-03-22 0.020840 2016-03-23 0.017921 2016-03-24 0.018536 2016-03-25 0.017768 2016-03-26 0.016103 2016-03-27 0.014107 2016-03-28 0.019611 2016-03-29 0.020840 2016-03-30 0.023451 2016-03-31 0.022785 2016-04-01 0.023195 2016-04-02 0.024987 2016-04-03 0.024475 2016-04-04 0.023425 2016-04-05 0.130594 2016-04-06 0.234357 2016-04-07 0.139299 Name: last_seen, dtype: float64
The 'last_seen' column shows relative equal distribution of percentage except for 5 April 2016 to 7 April 2016, the frequency is 6-10 times more. We could not tell what make this big difference.
print(autos['date_created'].str[:10].unique().shape)
autos['date_created'].str[:10]. value_counts(normalize=False, dropna=False).sort_index()
(74,)
2015-06-11 1 2015-08-10 1 2015-09-09 1 2015-11-10 1 2015-12-30 1 2016-01-03 1 2016-01-07 1 2016-01-10 2 2016-01-13 1 2016-01-14 1 2016-01-16 1 2016-01-22 1 2016-01-27 3 2016-01-29 1 2016-02-01 1 2016-02-02 2 2016-02-05 2 2016-02-07 1 2016-02-09 1 2016-02-11 1 2016-02-12 2 2016-02-14 2 2016-02-16 1 2016-02-17 1 2016-02-18 2 2016-02-19 3 2016-02-20 1 2016-02-21 2 2016-02-22 1 2016-02-23 4 ... 2016-03-09 1281 2016-03-10 1290 2016-03-11 1296 2016-03-12 1447 2016-03-13 694 2016-03-14 1364 2016-03-15 1305 2016-03-16 1163 2016-03-17 1177 2016-03-18 519 2016-03-19 1328 2016-03-20 1492 2016-03-21 1468 2016-03-22 1261 2016-03-23 1240 2016-03-24 1132 2016-03-25 1201 2016-03-26 1294 2016-03-27 1219 2016-03-28 1382 2016-03-29 1324 2016-03-30 1283 2016-03-31 1232 2016-04-01 1341 2016-04-02 1408 2016-04-03 1536 2016-04-04 1461 2016-04-05 471 2016-04-06 131 2016-04-07 52 Name: date_created, Length: 74, dtype: int64
Adverts were created mainly in the month of March and beginning of April 2016 But there are some adverts created much earlier in June 2015.
autos['registration_year'].describe()
count 39060.000000 mean 2005.615668 std 86.207336 min 1000.000000 25% 2001.000000 50% 2005.000000 75% 2009.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration year is the year when the car was firstly registered before use. There are 2 stange values which are year 1000 and 9999. These must be incorrect.
Car was invented in early 1900. Hence, registration year prior to this year is invalid. Secondly, cars must have been registered when they were advertised. Therefore, any registration date after 2016 must be invalid. We will see the proportion of values outside the range of year 1900 and 2016 the range that sounds realistic.
(~autos['registration_year']. between ( 1900,2016)). sum()/autos.shape[0]
0.036891961085509475
As the number is less than 4%, we will remove these rows.
autos=autos[autos['registration_year'].between(1900,2016)]
autos['registration_year']. value_counts(normalize=True). head(10)
2005 0.074377 2006 0.070549 2004 0.069592 2003 0.066190 2007 0.060049 2008 0.058587 2002 0.057524 2001 0.055743 2009 0.055185 2000 0.054361 Name: registration_year, dtype: float64
It appears that the registration year mostly happened within the past 16 year from 2016.
# Exploring Price by Brand
autos['brand']. value_counts(normalize=True)
volkswagen 0.210745 bmw 0.124512 mercedes_benz 0.111167 audi 0.097238 opel 0.089396 ford 0.059119 renault 0.037880 peugeot 0.027885 fiat 0.021292 skoda 0.018980 seat 0.017465 smart 0.016481 toyota 0.014594 mazda 0.014275 citroen 0.014009 nissan 0.013743 mini 0.010766 hyundai 0.010713 sonstige_autos 0.010394 volvo 0.008958 kia 0.007682 porsche 0.007390 honda 0.007310 mitsubishi 0.006965 chevrolet 0.006592 alfa_romeo 0.006194 suzuki 0.005742 chrysler 0.003243 dacia 0.003243 jeep 0.002738 land_rover 0.002605 jaguar 0.001834 subaru 0.001808 daihatsu 0.001728 saab 0.001409 daewoo 0.000930 trabant 0.000904 rover 0.000771 lancia 0.000665 lada 0.000638 Name: brand, dtype: float64
Top of the list are mostly German cars such as Volkswagen (21.3%), BMW (12.3%), Mercedes Benz (11.0%) and Audi (9.6%). They are more than 50% of the brand. Let's analyse 'price by brand' to know more about how brand affect the sale. We are going to take into account any brands at least 2% of all adverts.
major_brands = autos['brand'] . value_counts(normalize=True)
price_by_brand={}
for brand in major_brands.index:
if major_brands[brand]>= 0.02:
price_by_brand[brand] = autos.loc[autos['brand'] ==brand, 'price'].mean().astype(int)
print(price_by_brand)
{'opel': 4173, 'peugeot': 3924, 'audi': 10253, 'mercedes_benz': 9243, 'fiat': 3944, 'volkswagen': 6584, 'ford': 5253, 'bmw': 9084, 'renault': 3521}
As commonly perceived, we can see the mean price for brands like Audi (10,134), Mercedes Benz (9,167), BMW (9,021) are almost triple than the price from brands like Renault, Peugeot (cheapest brands) and then the next cheaper brands are Ford and Opel while Volkswagen is mid-price brand. We will assign price_by_brand Series to bmp_series then we would convert the dictionary to a series and then to a dataframe.
bmp_series= pd.Series(price_by_brand).sort_values(ascending=False)
pd.DataFrame(bmp_series, columns = ['mean_price'])
mean_price | |
---|---|
audi | 10253 |
mercedes_benz | 9243 |
bmw | 9084 |
volkswagen | 6584 |
ford | 5253 |
opel | 4173 |
fiat | 3944 |
peugeot | 3924 |
renault | 3521 |
# Now, let's have a look whether mileage has any bearing on price
mileage_by_brand={}
for brand in major_brands.index:
if major_brands[brand]>=0.02:
mileage_by_brand[brand] = autos.loc[autos['brand']==brand]['odometer_km'].mean(). astype(int)
print(mileage_by_brand)
{'opel': 124271, 'peugeot': 122526, 'audi': 127621, 'mercedes_benz': 130205, 'fiat': 108701, 'volkswagen': 125927, 'ford': 119867, 'bmw': 132052, 'renault': 121740}
mileage_by_brand_series = pd.Series(mileage_by_brand)
df=pd.DataFrame(bmp_series, columns=['mean_price'])
df['mean_mileage']=mileage_by_brand_series
df
mean_price | mean_mileage | |
---|---|---|
audi | 10253 | 127621 |
mercedes_benz | 9243 | 130205 |
bmw | 9084 | 132052 |
volkswagen | 6584 | 125927 |
ford | 5253 | 119867 |
opel | 4173 | 124271 |
fiat | 3944 | 108701 |
peugeot | 3924 | 122526 |
renault | 3521 | 121740 |
The mileage shown by brands on average do not have any significant difference on particular brand hence price.
We are going to do some data cleanings with regards to translating German words to English.
autos.head(10)
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
# German words appear in these columns: 'vehicle_type', 'gearbox','model', 'fuel_type' and 'unrepaired_damage'.
# Let's make the changes.
# We will also drop the column 'seller', 'offer_type'and 'nr_of_pictures' because not much to be analyzed.
columns= ['vehicle_type', 'gearbox', 'fuel_type', 'model', 'unrepaired_damage', 'brand' ]
change_words={'kleinwagen':'super mini', 'kombi': 'station wagon', 'cabrio': 'convertible', 'limousine': 'sedan', 'andere': 'other','manuell': 'manual', 'automatik': 'automatic', 'benzin': 'gas', 'elektro': 'electric', 'sonstige_autos': 'other','nein':'no', 'ja': 'yes'}
autos[columns]= autos[columns].replace(change_words)
autos= autos.drop(['seller', 'offer_type', 'nr_of_pictures'],axis=1)
autos.head(20)
date_crawled | name | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | date_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | other | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | sedan | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | sedan | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | super mini | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | station wagon | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | 7900 | test | bus | 2006 | automatic | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 22962 | 2016-04-06 09:45:21 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | 1990 | control | sedan | 1998 | manual | 90 | golf | 150000 | 12 | diesel | volkswagen | no | 2016-03-16 00:00:00 | 53474 | 2016-04-07 03:17:32 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | 5299 | control | super mini | 2010 | automatic | 71 | fortwo | 50000 | 9 | gas | smart | no | 2016-03-31 00:00:00 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | 1350 | control | sedan | 1999 | manual | 101 | a3 | 150000 | 11 | gas | audi | no | 2016-03-23 00:00:00 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | 3999 | test | super mini | 2007 | manual | 75 | clio | 150000 | 9 | gas | renault | NaN | 2016-03-23 00:00:00 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | 18900 | test | coupe | 1982 | automatic | 203 | NaN | 80000 | 6 | gas | other | no | 2016-04-01 00:00:00 | 61276 | 2016-04-02 21:10:48 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | 5500 | test | coupe | 1990 | manual | 205 | scirocco | 150000 | 6 | gas | volkswagen | no | 2016-03-29 00:00:00 | 74821 | 2016-04-05 20:46:26 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | 4150 | control | suv | 2004 | manual | 124 | other | 150000 | 2 | lpg | mazda | no | 2016-03-17 00:00:00 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | 3500 | test | station wagon | 2003 | manual | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | 41500 | test | convertible | 2004 | manual | 320 | 911 | 150000 | 4 | gas | porsche | no | 2016-03-06 00:00:00 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | 25450 | control | convertible | 2015 | manual | 184 | cooper | 10000 | 1 | gas | mini | no | 2016-03-28 00:00:00 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | 7999 | control | bus | 2010 | manual | 120 | NaN | 150000 | 2 | diesel | peugeot | no | 2016-03-10 00:00:00 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | 48500 | control | sedan | 2014 | automatic | 306 | 5er | 30000 | 12 | gas | bmw | no | 2016-04-03 00:00:00 | 22547 | 2016-04-07 13:16:50 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | 5250 | control | super mini | 2007 | manual | 110 | cooper | 150000 | 7 | diesel | mini | yes | 2016-03-19 00:00:00 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | 4999 | test | station wagon | 2004 | automatic | 204 | e_klasse | 150000 | 10 | diesel | mercedes_benz | no | 2016-04-02 00:00:00 | 47638 | 2016-04-02 12:45:44 |
# Find out the age of vehicle
# Append the result into a new column called 'age'in the DataFrame.
autos['age']= 2016-autos['registration_year']
autos.head(5)
date_crawled | name | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | date_created | postal_code | last_seen | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | other | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 | 12 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | sedan | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 | 19 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | sedan | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 | 7 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | super mini | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 | 9 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | station wagon | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 | 13 |
Having the age column we can now find out how many days lapse to sell the car. We make use the 'date_created' and 'last_seen' date. The difference would be the time lapse to sell the car. To work with date and time, we need to convert the strings into datetime datatype.
# We pick up first 10 string characters representing yyyy-mm-dd and convert to datetime object
crawling= pd.to_datetime(autos['date_crawled'].str[:10])
posting = pd.to_datetime(autos['date_created'].str[:10])
sold = pd.to_datetime(autos['last_seen'].str[:10])
time_lapse = sold - posting # how many days it took to sell
autos['time_lapse'] = time_lapse # we append new column in the autos DataFrame
autos['time_lapse'].value_counts(bins=10). sort_index()
(-1 days +16:47:59.999999, 30 days 00:00:00] 37005 (30 days 00:00:00, 60 days 00:00:00] 603 (60 days 00:00:00, 90 days 00:00:00] 5 (90 days 00:00:00, 120 days 00:00:00] 2 (120 days 00:00:00, 150 days 00:00:00] 1 (150 days 00:00:00, 180 days 00:00:00] 0 (180 days 00:00:00, 210 days 00:00:00] 1 (210 days 00:00:00, 240 days 00:00:00] 1 (240 days 00:00:00, 270 days 00:00:00] 0 (270 days 00:00:00, 300 days 00:00:00] 1 Name: time_lapse, dtype: int64
Clearly most cars ( more than 98%) sold within the first 30 days after posting. We are going to use time_lapse even more with other values such as brand, price, odometer and age. To make it easier for comparison we will split it and put the integer in separate column.
# Convert the time_lapse column into string and split it taking the integer into new column
autos['time_lapse_days']= autos['time_lapse']. astype(str).str.split().str[0].astype(int)
autos.head(5)
date_crawled | name | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | date_created | postal_code | last_seen | age | time_lapse | time_lapse_days | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | other | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 | 12 | 11 days | 11 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | sedan | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 | 19 | 2 days | 2 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | sedan | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 | 7 | 11 days | 11 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | super mini | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 | 9 | 3 days | 3 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | station wagon | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 | 13 | 0 days | 0 |
There are now 3 new columns: 'age', 'time_lapse' and the same values without string 'time_lapse-days'. We will use this new column more intensively in the following analysis.
We are going to dig deeper how the mean values of brand price, brand odometer, brand age affecting the time-lapse to sell. We want to see whether the turn around between posting a car listing until it got sold have anything to do with pricing, odometer and age of car.
# Create 4 dictionaries, brand_price_dict, brand_odometer_dict, brand_age_dict and brand_sold_time_dict:
brand_price_dict={}
brand_odometer_dict={}
brand_age_dict={}
brand_sold_time_dict={}
# Select unique values of brand names
brand_names=autos['brand'].unique()
for brand_name in brand_names:
mean_price = autos.loc[autos['brand']==brand_name, 'price'].mean()
mean_odometer= autos.loc[autos['brand']==brand_name, 'odometer_km'].mean()
mean_age= autos.loc[autos['brand']==brand_name, 'age'].mean()
mean_sold_time= autos.loc[autos['brand']==brand_name, 'time_lapse_days'].mean()
brand_price_dict[brand_name]=int(mean_price)
brand_odometer_dict[brand_name]=int(mean_odometer)
brand_age_dict[brand_name]=int(mean_age)
brand_sold_time_dict[brand_name]= int(mean_sold_time)
# Convert dictionaries into pandas series for easier analysis
brand_mean_price_series= pd.Series(brand_price_dict)
brand_mean_odometer_series= pd.Series(brand_odometer_dict)
brand_mean_age_series=pd.Series(brand_age_dict)
brand_mean_sold_time_series= pd.Series(brand_sold_time_dict)
# Create a new DataFrame containing all new mean values
mean_values = pd.DataFrame(brand_mean_price_series, columns=['mean_price'])
# Append the other mean values as well
mean_values['mean_odometer']= brand_mean_odometer_series
mean_values['mean_age']= brand_mean_age_series
mean_values['mean_sold_time']= brand_mean_sold_time_series
mean_values['total_count']= autos['brand'].value_counts(normalize=True)*100
mean_values.sort_values('total_count', ascending=False)
mean_price | mean_odometer | mean_age | mean_sold_time | total_count | |
---|---|---|---|---|---|
volkswagen | 6584 | 125927 | 12 | 8 | 21.074457 |
bmw | 9084 | 132052 | 12 | 9 | 12.451155 |
mercedes_benz | 9243 | 130205 | 13 | 9 | 11.116723 |
audi | 10253 | 127621 | 11 | 9 | 9.723810 |
opel | 4173 | 124271 | 12 | 8 | 8.939632 |
ford | 5253 | 119867 | 11 | 9 | 5.911906 |
renault | 3521 | 121740 | 11 | 9 | 3.787979 |
peugeot | 3924 | 122526 | 10 | 9 | 2.788485 |
fiat | 3944 | 108701 | 11 | 9 | 2.129243 |
skoda | 6795 | 110308 | 8 | 9 | 1.897977 |
seat | 5539 | 116697 | 9 | 9 | 1.746458 |
smart | 3771 | 97943 | 10 | 9 | 1.648103 |
toyota | 5531 | 113852 | 11 | 8 | 1.459369 |
mazda | 5253 | 120186 | 11 | 9 | 1.427470 |
citroen | 4546 | 114715 | 10 | 9 | 1.400888 |
nissan | 6323 | 110967 | 10 | 9 | 1.374306 |
mini | 10715 | 88679 | 8 | 10 | 1.076584 |
hyundai | 6143 | 101997 | 7 | 9 | 1.071267 |
other | 14385 | 89910 | 22 | 10 | 1.039368 |
volvo | 6105 | 137344 | 14 | 9 | 0.895824 |
kia | 6749 | 108148 | 8 | 9 | 0.768229 |
porsche | 46955 | 98812 | 14 | 9 | 0.738988 |
honda | 5266 | 116236 | 13 | 9 | 0.731014 |
mitsubishi | 4725 | 120381 | 12 | 9 | 0.696457 |
chevrolet | 7127 | 98891 | 12 | 10 | 0.659241 |
alfa_romeo | 5237 | 129549 | 14 | 9 | 0.619368 |
suzuki | 5092 | 104282 | 10 | 9 | 0.574178 |
dacia | 5959 | 84877 | 5 | 8 | 0.324304 |
chrysler | 4443 | 129098 | 14 | 10 | 0.324304 |
jeep | 11965 | 127524 | 12 | 10 | 0.273798 |
land_rover | 19108 | 118010 | 12 | 9 | 0.260507 |
jaguar | 12295 | 125072 | 14 | 9 | 0.183418 |
subaru | 5572 | 125000 | 14 | 9 | 0.180760 |
daihatsu | 2499 | 104846 | 11 | 9 | 0.172785 |
saab | 4395 | 141886 | 15 | 8 | 0.140886 |
daewoo | 1524 | 118142 | 12 | 8 | 0.093038 |
trabant | 3006 | 52058 | 33 | 10 | 0.090380 |
rover | 2734 | 133103 | 16 | 9 | 0.077089 |
lancia | 6135 | 110800 | 16 | 11 | 0.066456 |
lada | 2947 | 82500 | 16 | 7 | 0.063798 |
Notice that, taking out the mean_price, mean_odometer, mean_age and mean_sold_time are more or less about the same, yet the most dominant count is Volkswagen (21%). It's quite consistent if we look into published car market share in general in Germany such as from this site.
When we look back at 'time-lapse' to sell we could see there were some cars took very long to sell. Some of them were taking 90 days and some up to 150 days. Let's take a deeper look to find out why that's the case.
autos[autos['time_lapse_days']>60]. sort_values('time_lapse_days', ascending=False)[['brand','age', 'price','odometer_km','time_lapse']]
brand | age | price | odometer_km | time_lapse | |
---|---|---|---|---|---|
22781 | mercedes_benz | 2 | 47900 | 20000 | 300 days |
20649 | peugeot | 10 | 5200 | 80000 | 239 days |
34883 | opel | 3 | 9500 | 50000 | 209 days |
2243 | peugeot | 7 | 5499 | 70000 | 149 days |
36993 | smart | 12 | 2400 | 90000 | 98 days |
2232 | volkswagen | 10 | 14499 | 90000 | 95 days |
24851 | mercedes_benz | 8 | 15490 | 150000 | 88 days |
39270 | opel | 19 | 1099 | 150000 | 68 days |
15081 | audi | 11 | 5800 | 150000 | 63 days |
49743 | bmw | 9 | 7399 | 100000 | 62 days |
36429 | mercedes_benz | 11 | 9900 | 150000 | 61 days |
Narrowing our attention to Mercedes Benz with 2 years old and listed price of 47,900 (more than 5 times mean price) taking 300 days time lapse to get sold (price extreme) and then Opel with 19 years old (almost double mean age) and listed price of 1099 (almost a quarter of mean price), taking 68 days time lapse to get sold (age extreme), we might conclude that these two parameters (list price and age relative to mean price and mean age) affect the time lapse to get sold.
Now, let us take a look at fastest selling cars.
autos.sort_values('time_lapse_days')[['brand', 'age','price', 'odometer_km', 'time_lapse', 'date_created','last_seen', 'date_crawled']].head(20)
brand | age | price | odometer_km | time_lapse | date_created | last_seen | date_crawled | |
---|---|---|---|---|---|---|---|---|
21056 | mercedes_benz | 14 | 1100 | 150000 | 0 days | 2016-03-09 00:00:00 | 2016-03-09 17:44:48 | 2016-03-09 17:44:48 |
47283 | ford | 10 | 10299 | 125000 | 0 days | 2016-03-12 00:00:00 | 2016-03-12 19:52:59 | 2016-03-12 19:52:59 |
13875 | bmw | 10 | 11900 | 150000 | 0 days | 2016-03-09 00:00:00 | 2016-03-09 23:44:30 | 2016-03-09 14:39:10 |
13870 | fiat | 11 | 1700 | 150000 | 0 days | 2016-03-09 00:00:00 | 2016-03-09 18:16:07 | 2016-03-09 08:52:53 |
13867 | volkswagen | 7 | 19800 | 150000 | 0 days | 2016-04-05 00:00:00 | 2016-04-05 18:06:19 | 2016-04-05 18:06:19 |
47286 | citroen | 12 | 2700 | 90000 | 0 days | 2016-03-20 00:00:00 | 2016-03-20 22:39:51 | 2016-03-20 22:39:51 |
42888 | peugeot | 8 | 5000 | 150000 | 0 days | 2016-03-30 00:00:00 | 2016-03-30 10:42:11 | 2016-03-30 08:54:31 |
37049 | volkswagen | 12 | 3500 | 150000 | 0 days | 2016-03-12 00:00:00 | 2016-03-12 18:44:19 | 2016-03-12 18:44:19 |
47295 | volkswagen | 9 | 3200 | 150000 | 0 days | 2016-03-28 00:00:00 | 2016-03-28 19:53:23 | 2016-03-28 19:53:23 |
13839 | bmw | 10 | 3500 | 150000 | 0 days | 2016-03-17 00:00:00 | 2016-03-17 14:06:22 | 2016-03-17 13:58:30 |
37062 | renault | 18 | 1250 | 150000 | 0 days | 2016-03-15 00:00:00 | 2016-03-15 09:49:26 | 2016-03-15 09:49:26 |
13821 | volvo | 25 | 3800 | 150000 | 0 days | 2016-04-05 00:00:00 | 2016-04-05 21:36:20 | 2016-04-05 21:36:20 |
13820 | mercedes_benz | 5 | 12350 | 150000 | 0 days | 2016-03-31 00:00:00 | 2016-03-31 19:41:52 | 2016-03-31 18:36:23 |
13899 | mercedes_benz | 13 | 2000 | 150000 | 0 days | 2016-03-27 00:00:00 | 2016-03-27 21:47:26 | 2016-03-27 21:47:26 |
13807 | bmw | 31 | 4450 | 150000 | 0 days | 2016-04-02 00:00:00 | 2016-04-02 18:58:37 | 2016-04-02 18:58:37 |
13783 | trabant | 30 | 1200 | 60000 | 0 days | 2016-03-26 00:00:00 | 2016-03-26 13:40:06 | 2016-03-26 12:36:25 |
13768 | toyota | 9 | 2100 | 150000 | 0 days | 2016-03-18 00:00:00 | 2016-03-18 19:50:36 | 2016-03-18 19:50:36 |
13745 | bmw | 5 | 31999 | 80000 | 0 days | 2016-03-21 00:00:00 | 2016-03-21 08:46:09 | 2016-03-21 02:01:33 |
42866 | opel | 10 | 2450 | 150000 | 0 days | 2016-04-04 00:00:00 | 2016-04-04 09:52:16 | 2016-04-04 09:52:16 |
13742 | volkswagen | 13 | 2400 | 150000 | 0 days | 2016-03-23 00:00:00 | 2016-03-23 18:42:30 | 2016-03-23 18:42:30 |
The DataFrame shows listings that gone within 0 days. Reasons could be:
We won't know if the reason is due to deletion. Therefore, let's examine if the rice was really the reason. To do that we will see the difference between listed price and mean price.
for i in range(20):
name_of_brand=autos.sort_values('time_lapse_days').iloc[i]['brand']
list_price = autos.sort_values('time_lapse_days').iloc[i]['price']
mean_price = mean_values.loc[name_of_brand]['mean_price']
difference = mean_price - list_price
print(difference, ':', name_of_brand)
8143.0 : mercedes_benz -5046.0 : ford -2816.0 : bmw 2244.0 : fiat -13216.0 : volkswagen 1846.0 : citroen -1076.0 : peugeot 3084.0 : volkswagen 3384.0 : volkswagen 5584.0 : bmw 2271.0 : renault 2305.0 : volvo -3107.0 : mercedes_benz 7243.0 : mercedes_benz 4634.0 : bmw 1806.0 : trabant 3431.0 : toyota -22915.0 : bmw 1723.0 : opel 4184.0 : volkswagen
We can see that 70% of the above data shows the difference between list price and mean price is more that 1700. It's a strong indication that this is the main cause for quick time_lapse to sell.
German Ebay used cars has been analyzed using dataset crawled in 2016. After cleaning from several anomalies and inconsistencies, several conclusions can be drawn:
got mean age and if anyone want to achieve quicker time_lapse to sell, the age of the car should not be too much older than its mean age.