In this project, we'll work on used cars data set from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The aim of this project is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding="latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the above, we can determine the following information:
column_names = autos.columns # attribute that provides column names
print(column_names)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
We will change the names from camel case to snake case. change the names to meaniningful names.
col_names = {'yearOfRegistration':'registration_year',
'monthOfRegistration':'registration_month',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created'}
autos.rename(col_names,axis=1,inplace=True) # replace the column names using dictionary
autos.columns = autos.columns.str.lower()
autos.head()
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all') # include even empty values - NaN
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-04 16:40:33 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
'seller' and 'offertype' columns have only 1 value. They can be dropped.
autos.drop(['seller','offertype'], axis=1, inplace=True) #removing seller and offertype columns
autos.columns
Index(['datecrawled', 'name', 'price', 'abtest', 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer', 'registration_month', 'fueltype', 'brand', 'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode', 'lastseen'], dtype='object')
'nrofpictures' column doesn't contain any value. Let's check the frequency of the values from this column.
autos.loc[:,'nrofpictures'].value_counts(dropna=False) # by default value_counts() method doesn't include NaN
0 50000 Name: nrofpictures, dtype: int64
Let's remove 'nrofpictures' column as well as non of the fields contain data.
autos.drop('nrofpictures',axis=1, inplace=True) # removing nrofpictures column
autos.columns
Index(['datecrawled', 'name', 'price', 'abtest', 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer', 'registration_month', 'fueltype', 'brand', 'unrepaired_damage', 'ad_created', 'postalcode', 'lastseen'], dtype='object')
autos.loc[:, 'price'].value_counts(dropna=False)
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $5,749 1 $31,600 1 $46,990 1 $5,120 1 $13,560 1 Name: price, Length: 2357, dtype: int64
autos.loc[:, 'odometer'].value_counts(dropna=False)
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
'price' and 'odometer' columns are numeric values stored as text. We'll remove the extra characters (',','km','$'), convert them to int datatype and rename the column 'odometer' to odometer_km.
autos.loc[:,'price'] = autos.loc[:,'price'].str.replace("$","").str.replace(",","").astype(int) # using method chaining
autos.loc[:,'price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int32
autos.loc[:,'odometer'] = autos.loc[:,'odometer'].str.replace("km","").str.replace(",","").astype(int)
autos.rename({'odometer':'odometer_km'},axis=1,inplace=True)
autos.loc[:,'odometer_km'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int32
print(autos.loc[:,'price'].unique().shape[0])
print(autos.loc[:,'price'].describe())
print(autos.loc[:,'price'].value_counts(ascending=False).head(10))
2357 count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 Name: price, dtype: int64
print(autos.loc[:,'price'].value_counts(ascending=False).sort_index(ascending=False).head(20))
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
The car prices seem to jump suddenly from 350,000 dollars to 999,990 dollars and beyond. So, the prices beyond 350,000 are source of outliers. Hence, we need to remove rows in dataset with prices 0 and 350,000 above.
Let's also analyze the 'odometer_km' column futher.
print(autos.loc[:,'odometer_km'].unique().shape)
print(autos.loc[:,'odometer_km'].describe())
print(autos.loc[:,'odometer_km'].value_counts(ascending=False).head(10))
(13,) count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 Name: odometer_km, dtype: int64
autos = autos[autos.loc[:, "price"].between(1,351000)] # removing rows that contain <= 0$ or > 350,000$ in 'price' column
print(autos.loc[:,"price"].describe())
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
Let's analyze the date columns
autos[['datecrawled','ad_created','lastseen']][0:5]
datecrawled | ad_created | lastseen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
Let's find out the distribution for each of these columns by extracting the dates
autos.loc[:,'datecrawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: datecrawled, dtype: float64
autos.loc[:,'datecrawled'].str[:10].value_counts(normalize=True, dropna=False).sort_values()
2016-04-07 0.001400 2016-04-06 0.003171 2016-03-18 0.012911 2016-04-05 0.013096 2016-03-06 0.014043 2016-03-13 0.015670 2016-03-05 0.025327 2016-03-24 0.029342 2016-03-16 0.029610 2016-03-27 0.031092 2016-03-25 0.031607 2016-03-17 0.031628 2016-03-31 0.031834 2016-03-10 0.032184 2016-03-26 0.032204 2016-03-23 0.032225 2016-03-11 0.032575 2016-03-22 0.032987 2016-03-09 0.033090 2016-03-08 0.033296 2016-04-01 0.033687 2016-03-30 0.033687 2016-03-29 0.034099 2016-03-15 0.034284 2016-03-19 0.034778 2016-03-28 0.034860 2016-04-02 0.035478 2016-03-07 0.036014 2016-04-04 0.036487 2016-03-14 0.036549 2016-03-12 0.036920 2016-03-21 0.037373 2016-03-20 0.037887 2016-04-03 0.038608 Name: datecrawled, dtype: float64
The cars are crawled in March, April months of 2016. The crawling rate is almost the same each day. Suddenly, it dropped on the last 2 days of April
autos.loc[:,'ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
autos.loc[:,'ad_created'].str[:10].value_counts(normalize=True).sort_index().tail(40)
2016-02-28 0.000206 2016-02-29 0.000165 2016-03-01 0.000103 2016-03-02 0.000103 2016-03-03 0.000865 2016-03-04 0.001483 2016-03-05 0.022897 2016-03-06 0.015320 2016-03-07 0.034737 2016-03-08 0.033316 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, dtype: float64
autos.loc[:,'ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_values().tail(40)
2016-03-01 0.000103 2016-02-27 0.000124 2016-02-29 0.000165 2016-02-28 0.000206 2016-03-03 0.000865 2016-04-07 0.001256 2016-03-04 0.001483 2016-04-06 0.003253 2016-04-05 0.011819 2016-03-18 0.013590 2016-03-06 0.015320 2016-03-13 0.017008 2016-03-05 0.022897 2016-03-24 0.029280 2016-03-16 0.030125 2016-03-27 0.030989 2016-03-17 0.031278 2016-03-25 0.031751 2016-03-31 0.031875 2016-03-10 0.031895 2016-03-23 0.032060 2016-03-26 0.032266 2016-03-22 0.032801 2016-03-11 0.032904 2016-03-09 0.033151 2016-03-08 0.033316 2016-03-30 0.033501 2016-04-01 0.033687 2016-03-19 0.033687 2016-03-15 0.034016 2016-03-29 0.034037 2016-03-07 0.034737 2016-03-28 0.034984 2016-04-02 0.035149 2016-03-14 0.035190 2016-03-12 0.036755 2016-04-04 0.036858 2016-03-21 0.037579 2016-03-20 0.037949 2016-04-03 0.038855 Name: ad_created, dtype: float64
Most of the Ads are created in March, April of 2016. This could be the reason for crawling rate to be high in these two months. We also have Ads created in 2015. But, there are no crawling entreis. Hence, it seems like users are not interested in old Ads.
autos.loc[:,'lastseen'].str[:10].value_counts(normalize=True, dropna=False).sort_values()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-18 0.007351 2016-03-08 0.007413 2016-03-13 0.008895 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-14 0.012602 2016-03-27 0.015649 2016-03-19 0.015834 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-26 0.016802 2016-03-23 0.018532 2016-03-25 0.019211 2016-03-24 0.019767 2016-03-21 0.020632 2016-03-20 0.020653 2016-03-28 0.020859 2016-03-22 0.021373 2016-03-29 0.022341 2016-04-01 0.022794 2016-03-31 0.023783 2016-03-12 0.023783 2016-04-04 0.024483 2016-03-30 0.024771 2016-04-02 0.024915 2016-04-03 0.025203 2016-03-17 0.028086 2016-04-05 0.124761 2016-04-07 0.131947 2016-04-06 0.221806 Name: lastseen, dtype: float64
autos.loc[:,'lastseen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: lastseen, dtype: float64
The 'lastseen' abruptly increased in last 3 days of April. But, at the same time, there were less number of crawling rate during these days. This indicates, there were no sales of the crash.
We also have 'registration_year' which indicates the date of registration of the car. Let's check the distribution of this column
autos.loc[:, 'registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Minimum year is 1000 and Max year is 9999. These values doesn't seem to be realistic.
Car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
bool_map = (autos.loc[:, 'registration_year'] < 1900) | (autos.loc[:, 'registration_year'] > 2016)
print(autos.loc[bool_map, 'registration_year'].value_counts())
print(autos.loc[bool_map, 'registration_year'].describe())
print('Percentage :', (autos.loc[bool_map, 'registration_year'].shape[0]/autos.shape[0]) * 100)
2017 1392 2018 470 5000 4 9999 3 2019 2 1800 2 2800 1 6200 1 4500 1 8888 1 4800 1 1000 1 9000 1 1001 1 1111 1 5911 1 4100 1 Name: registration_year, dtype: int64 count 1884.000000 mean 2050.461783 std 446.321293 min 1000.000000 25% 2017.000000 50% 2017.000000 75% 2018.000000 max 9999.000000 Name: registration_year, dtype: float64 Percentage : 3.8793369710697
Many cars have 2017, 2018 as registration dates. Others are hardly 1 or 2 entires. All these entries account upto 4% of total entries. As these entries seem to be unrealistic and less, we will remove them.
autos = autos.loc[~bool_map]
print(autos.loc[:, 'registration_year'].value_counts(normalize=True).sort_values(ascending=False).head(22))
print(autos.loc[:, 'registration_year'].value_counts(normalize=True).sort_index(ascending=False).head(22))
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 2008 0.047450 2009 0.044665 1997 0.041794 2011 0.034768 2010 0.034040 1996 0.029412 2012 0.028063 1995 0.026285 2016 0.026135 2013 0.017202 2014 0.014203 1994 0.013474 Name: registration_year, dtype: float64 2016 0.026135 2015 0.008397 2014 0.014203 2013 0.017202 2012 0.028063 2011 0.034768 2010 0.034040 2009 0.044665 2008 0.047450 2007 0.048778 2006 0.057197 2005 0.062895 2004 0.057904 2003 0.057818 2002 0.053255 2001 0.056468 2000 0.067608 1999 0.062060 1998 0.050620 1997 0.041794 1996 0.029412 1995 0.026285 Name: registration_year, dtype: float64
Most of the registrations seem to have happened between 1995 and 2016 i.e. 20 yrs behind from 2016
# Create a function to retreive top brand cars which are above 5%
def top_brands_calc(dataf):
brands = dataf.loc[:, 'brand'].value_counts(normalize=True)
top_brands = brands[brands > 0.05].index
print(top_brands)
return top_brands
top_brands = top_brands_calc(autos)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
# Let's create a function to calculate mean prices and save in dictionary
mean_prices = {}
def mean_prices_calc(top_brands, dataf):
for brands in top_brands:
#print(brand)
brand_rows = dataf[dataf['brand'] == brands]
mean_price = brand_rows['price'].mean()
mean_prices[brands] = int(mean_price)
print(mean_prices)
return mean_prices
mean_prices = mean_prices_calc(top_brands, autos)
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}
Audi, BMW and Mercedes Benz are more expensive. Ford and Opel are less expensive. Volkswagen is in between.
# Let's create a function that creates series object for the mean_prices whose values are sorted descending order
def series_sorted(mean_prices):
mean_price_series = pd.Series(mean_prices)
mean_price_sorted = mean_price_series.sort_values(ascending=False)
return mean_price_sorted
mean_price_series_sorted = series_sorted(mean_prices)
print(mean_price_series_sorted)
audi 9336 mercedes_benz 8628 bmw 8332 volkswagen 5402 ford 3749 opel 2975 dtype: int64
# Let's create a function to calculate mean mileage and save in dictionary
mean_mileage = {}
def mean_mileage_calc(top_brands, dataf):
for brands in top_brands:
#print(brand)
brand_rows = dataf[dataf['brand'] == brands]
mileage = brand_rows['odometer_km'].mean()
mean_mileage[brands] = int(mileage)
print(mean_mileage)
return(mean_mileage)
mean_mileage = mean_mileage_calc(top_brands, autos)
{'volkswagen': 128707, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}
# create series object for the mean_mileage
mean_mileage_series = pd.Series(mean_mileage)
print(mean_mileage_series)
volkswagen 128707 bmw 132572 opel 129310 mercedes_benz 130788 audi 129157 ford 124266 dtype: int64
# create a dataframe that includes mean_price and mean_mileage series objects
df1 = pd.DataFrame(mean_price_series_sorted, columns=['mean_price'])
df1.loc[:, 'mean_mileage'] = mean_mileage_series
df1
mean_price | mean_mileage | |
---|---|---|
audi | 9336 | 129157 |
mercedes_benz | 8628 | 130788 |
bmw | 8332 | 132572 |
volkswagen | 5402 | 128707 |
ford | 3749 | 124266 |
opel | 2975 | 129310 |
The Car prices seem to vary a lot. But, mileage of the cars varies slightly. We have another column unrepaired_damage which specifies whether the car is damaged and repaired. lets check the distribution.
autos.loc[:, 'unrepaired_damage'].value_counts(dropna=False)
nein 33834 NaN 8307 ja 4540 Name: unrepaired_damage, dtype: int64
~18% of the cars doesn't have the damage info (NaN). ~10% of the cars are damaged, rest are not damaged. Let's remove the cars that doesn't have the damage info and calculate the mean values of prices and mileage.
# Let's remove all the rows from 'unrepaired_damage' column that contain NaN
print(autos.shape)
autos = autos[autos.loc[:, 'unrepaired_damage'].notnull()]
autos.loc[:, 'unrepaired_damage'].value_counts(dropna=False)
print(autos.shape)
(38374, 17) (38374, 17)
# Calculate the mean price and mileage table after removing the NaN values
top_brands = top_brands_calc(autos)
mean_prices = mean_prices_calc(top_brands,autos)
mean_mileage = mean_mileage_calc(top_brands,autos)
mean_price_series_sorted = series_sorted(mean_prices)
mean_mileage_series = pd.Series(mean_mileage)
df2 = pd.DataFrame(mean_price_series_sorted, columns=['mean_price'])
df2.loc[:, 'mean_mileage'] = mean_mileage_series
df2
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object') {'volkswagen': 6009, 'bmw': 8888, 'opel': 3320, 'mercedes_benz': 9269, 'audi': 10170, 'ford': 4137} {'volkswagen': 127004, 'bmw': 132051, 'opel': 128172, 'mercedes_benz': 130334, 'audi': 127240, 'ford': 123469}
mean_price | mean_mileage | |
---|---|---|
audi | 10170 | 127240 |
mercedes_benz | 9269 | 130334 |
bmw | 8888 | 132051 |
volkswagen | 6009 | 127004 |
ford | 4137 | 123469 |
opel | 3320 | 128172 |
print('Values before removing NaN rows')
print('Mileage: ', 'min: ', df1['mean_mileage'].min(), 'max:', df1['mean_mileage'].max(), 'mean: ',df1['mean_mileage'].mean())
print('Price : '' min:', df1['mean_price'].min(), ' max: ', df1['mean_price'].max(), ' mean:', df1['mean_price'].mean())
print('Values after removing NaN rows')
print('Mileage: ', 'min: ', df2['mean_mileage'].min(), 'max:', df2['mean_mileage'].max(), 'mean: ',df2['mean_mileage'].mean())
print('Price : '' min:', df2['mean_price'].min(), ' max: ', df2['mean_price'].max(), 'mean:', df2['mean_price'].mean())
Values before removing NaN rows Mileage: min: 124266 max: 132572 mean: 129133.33333333333 Price : min: 2975 max: 9336 mean: 6403.666666666667 Values after removing NaN rows Mileage: min: 123469 max: 132051 mean: 128045.0 Price : min: 3320 max: 10170 mean: 6965.5
We can notice that the price of the cars seem to be increased and mileage is reduced. We also have 18% of cars that are damaged.
# List of cars that have damage
autos.loc[autos.loc[:, 'unrepaired_damage'] == 'ja', 'brand'].value_counts(normalize=True).head(10)
volkswagen 0.186784 opel 0.130176 ford 0.091189 bmw 0.088767 mercedes_benz 0.075551 audi 0.073348 renault 0.062115 peugeot 0.041189 fiat 0.029956 nissan 0.020485 Name: brand, dtype: float64
Volkswagen topped the list with 18% of damaged cars . Audi which is higly priced so far consistitues to 7% of damaged cars. Opel which is low priced, has 13% damaged cars.
# List of cars that don't have damage
autos.loc[autos.loc[:, 'unrepaired_damage'] == 'nein', 'brand'].value_counts(normalize=True).head(10)
volkswagen 0.208755 bmw 0.116687 mercedes_benz 0.102412 opel 0.100254 audi 0.090560 ford 0.064698 renault 0.043093 peugeot 0.028935 fiat 0.023438 seat 0.018620 Name: brand, dtype: float64
Volkswagen again has 20% undamaged cars. Audi consistitues 9% and Opel 10%.
Let's consider Opel. It is low priced car. But, there are 3% more damaged cars than undamaged cars. Also, Audi which is high priced has 2% more undamaged cars than damaged cars. To conclude on which brand car to buy which is moderately priced, has less mileage and not damaged, let's remove all the damaged cars which are 10% of overall cars and determine the mean price and mileage values.
# Let's consider the rows that contain 'nein' in column i.e non damaged cars
autos_nein = autos.loc[autos.loc[:, 'unrepaired_damage'] == 'nein']
# Calculate the mean price and mileage table for the non damaged cars
top_brands = top_brands_calc(autos_nein)
mean_prices = mean_prices_calc(top_brands,autos_nein)
mean_mileage = mean_mileage_calc(top_brands,autos_nein)
mean_price_series_sorted = series_sorted(mean_prices)
mean_mileage_series = pd.Series(mean_mileage)
df3 = pd.DataFrame(mean_price_series_sorted, columns=['mean_price'])
df3.loc[:, 'mean_mileage'] = mean_mileage_series
df3
Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford'], dtype='object') {'volkswagen': 6469, 'bmw': 9437, 'opel': 3660, 'mercedes_benz': 9798, 'audi': 10914, 'ford': 4660} {'volkswagen': 125718, 'bmw': 131115, 'opel': 127022, 'mercedes_benz': 129389, 'audi': 125505, 'ford': 121343}
mean_price | mean_mileage | |
---|---|---|
audi | 10914 | 125505 |
mercedes_benz | 9798 | 129389 |
bmw | 9437 | 131115 |
volkswagen | 6469 | 125718 |
ford | 4660 | 121343 |
opel | 3660 | 127022 |
print('[1] Values before removing NaN rows')
print('Mileage: ''max :', df1['mean_mileage'].max(), 'min: ', df1['mean_mileage'].min(), 'mean: ',df1['mean_mileage'].mean())
print('Price : ''max :', df1['mean_price'].max(), 'min: ', df1['mean_price'].min(), 'mean:', df1['mean_price'].mean())
print('[2] Values after removing NaN rows')
print('Mileage: ''max :', df2['mean_mileage'].max(), 'min: ', df2['mean_mileage'].min(), 'mean: ',df2['mean_mileage'].mean())
print('Price : ''max :', df2['mean_price'].max(), 'min: ', df2['mean_price'].min(), 'mean:', df2['mean_price'].mean())
print("[3] Values of non damaged cars")
print('Mileage: ''max :', df3['mean_mileage'].max(), 'min: ', df3['mean_mileage'].min(), 'mean: ',df3['mean_mileage'].mean())
print('Price : ''max :', df3['mean_price'].max(), 'min: ', df3['mean_price'].min(), 'mean:', df3['mean_price'].mean())
[1] Values before removing NaN rows Mileage: max : 132572 min: 124266 mean: 129133.33333333333 Price : max : 9336 min: 2975 mean: 6403.666666666667 [2] Values after removing NaN rows Mileage: max : 132051 min: 123469 mean: 128045.0 Price : max : 10170 min: 3320 mean: 6965.5 [3] Values of non damaged cars Mileage: max : 131115 min: 121343 mean: 126682.0 Price : max : 10914 min: 3660 mean: 7489.666666666667
Car prices have further increased and mileage has decreased after removing the damaged cars.
Let's compare Audi which is higly priced in all these 3 cases with Opel and Volkswagen.
Comparing Audi with Opel (low priced car). Opels price is 66.47% less than Audi. Mileage is 1.21% increased than Audi.
Comparing Audi with Volkswagen (which is approximately closer to the mean price and mileage). VW price is 40.73% less than Audi. Mileage increased only 0.17% than Audi.
In both cases, there is huge price difference than the mileage. It's better to buy Volkswagen than Opel due to below reasons.
Opel is old brand car than Audi and Volkswagen.
Volkswagen is 40.73% less priced than Audi and from mileage prespective, increase is 0.17% which is negligible.
So, Volkswagen is the Winner :)