In this project we will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The data has been modified since first upload.
The aim of this project is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Our dataset contains 20 columns, most of which are stored as strings. There are some columns with null values (including gearbox, model, fuel type and not repaired damage), but no columns have more than ~20% null values. There are some columns that contain dates stored as strings including date crawled, date created and last seen.
We'll start by cleaning the column names to make the data easier to work with.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Firstly we will change all camelcase to snakecase.
Secondly we will change some names.
new_columns = []
for a in autos.columns:
import re
a = a.replace("yearOfRegistration","registration_year")
a = a.replace("monthOfRegistration","registration_month")
a = a.replace("notRepairedDamage","unrepaired_damage")
a = a.replace("dateCreated","ad_created")
if a not in ["ad_created","unrepaired_damage","registration_month","registration_year"]:
a = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', a)
a = re.sub('([a-z0-9])([A-Z])', r'\1_\2', a).lower()
new_columns.append(a)
autos.columns = new_columns
autos.head()
# print(new_columns)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
Various columns have specific German words. We will map translations of these German words by their English meanings.
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos["gearbox"].unique()
array(['manuell', 'automatik', nan], dtype=object)
mapping_dict = {
'manuell': 'manual',
'automatik': 'automatic',
}
autos["gearbox"] = autos["gearbox"].map(mapping_dict)
autos["gearbox"].unique()
array(['manual', 'automatic', nan], dtype=object)
autos["fuel_type"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
mapping_dict = {
'benzin': 'petrol',
'elektro': 'electric',
'andere' : 'others'
}
autos["fuel_type"] = autos["fuel_type"].map(mapping_dict)
autos["fuel_type"].unique()
array([nan, 'petrol', 'electric', 'others'], dtype=object)
autos["unrepaired_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
mapping_dict = {
'nein': 'no',
'ja': 'yes'
}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(mapping_dict)
autos["unrepaired_damage"].unique()
array(['no', nan, 'yes'], dtype=object)
We will look for text columns where the values are the same, in case these can be dropped as not useful for analysis.
We will also look for examples of numeric data stored as text to be cleaned and converted.
autos.describe(include="all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 30148 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 3 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manual | NaN | golf | 150,000km | NaN | petrol | volkswagen | no | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Some columns demonstrate issues:
seller
, gearbox
, unrepaired_damage
and offer_type
only each have 2 unique data pointsNaN
unique datanr_of_pictures
seems to have NaN
and 0
onlyThese will be investigated further by running value_counts()
and head()
functions.
autos["nr_of_pictures"].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
autos["seller"].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos["offer_type"].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
nr_of_pictures
, seller
and offer_type
have no values of relevance and will be dropped.
autos = autos.drop(["nr_of_pictures", "seller","offer_type"], axis=1)
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
autos["unrepaired_damage"].value_counts()
no 35232 yes 4939 Name: unrepaired_damage, dtype: int64
autos["gearbox"].value_counts()
manual 36993 automatic 10327 Name: gearbox, dtype: int64
unrepaired_damage
and gearbox
will be retained even though they each have only two unique items in case these items are relevant to analysis.
price
, odometer
to int
type after deleting non-int characters¶price and
odometercolumns
data will be amended to delete string characters.
After that the data type of each of these columns will be changed to integer. In addition odometer
column name will be changed to odometer_km
.
For price
we will also check that there are no integers below 0.
autos["price"].value_counts(dropna=False)
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $1,933 1 $4,840 1 $5,997 1 $4,510 1 $3,780 1 Name: price, Length: 2357, dtype: int64
autos.dtypes
date_crawled object name object price object abtest object vehicle_type object registration_year int64 gearbox object power_ps int64 model object odometer object registration_month int64 fuel_type object brand object unrepaired_damage object ad_created object postal_code int64 last_seen object dtype: object
autos["price"].head()
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
autos["odometer"].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
new_col = []
for a in autos["price"]:
a = a.replace("$","")
a = a.replace(",","")
new_col.append(a)
# autos["price"] = autos["price"].as_numeric() # for some reason this fucntion does not work
# to be further investigated
autos["price"] = new_col
autos["price"] = autos["price"].astype(int)
autos["price"] = autos["price"]
autos["price"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos["price"].value_counts()
0 1421 500 781 1500 734 2500 643 1000 639 ... 20790 1 8970 1 846 1 2895 1 33980 1 Name: price, Length: 2357, dtype: int64
autos_below_zero = autos.loc[autos["price"] < 0]
print(autos_below_zero)
Empty DataFrame Columns: [date_crawled, name, price, abtest, vehicle_type, registration_year, gearbox, power_ps, model, odometer, registration_month, fuel_type, brand, unrepaired_damage, ad_created, postal_code, last_seen] Index: []
No price
data that is below zero.
new_col_2 = []
for a in autos["odometer"]:
a = a.replace("km","")
a = a.replace(",","")
new_col_2.append(a)
autos["odometer"] = new_col_2
autos["odometer"] = autos["odometer"].astype(int)
autos["odometer"].head(100)
0 150000 1 150000 2 70000 3 70000 4 150000 ... 95 40000 96 150000 97 125000 98 125000 99 60000 Name: odometer, Length: 100, dtype: int64
autos["odometer"].value_counts(dropna=False, bins=20)
(142750.0, 150000.0] 32424 (121000.0, 128250.0] 5170 (99250.0, 106500.0] 2169 (84750.0, 92000.0] 1757 (77500.0, 84750.0] 1436 (4854.999, 12250.0] 1231 (63000.0, 70250.0] 1230 (55750.0, 63000.0] 1164 (48500.0, 55750.0] 1027 (34000.0, 41250.0] 819 (26750.0, 34000.0] 789 (19500.0, 26750.0] 784 (106500.0, 113750.0] 0 (92000.0, 99250.0] 0 (113750.0, 121000.0] 0 (128250.0, 135500.0] 0 (135500.0, 142750.0] 0 (41250.0, 48500.0] 0 (12250.0, 19500.0] 0 (70250.0, 77500.0] 0 Name: odometer, dtype: int64
This test includes NaN in case there are any of such data and also aggregating in bins to identify concentrations. The majority (c.64%) of dataset has odometer
readings above 142,750, i.e. the cars are long into their useful lives. The remainder are spread somewhat evenly between 19,500 - 128,250.
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos.isnull().sum()
date_crawled 0 name 0 price 0 abtest 0 vehicle_type 5095 registration_year 0 gearbox 2680 power_ps 0 model 2758 odometer_km 0 registration_month 0 fuel_type 4482 brand 0 unrepaired_damage 9829 ad_created 0 postal_code 0 last_seen 0 dtype: int64
This is a check to claculate the sum of null values in each of the columns.
autos.dtypes
date_crawled object name object price int64 abtest object vehicle_type object registration_year int64 gearbox object power_ps int64 model object odometer_km int64 registration_month int64 fuel_type object brand object unrepaired_damage object ad_created object postal_code int64 last_seen object dtype: object
This is a check for which columns are int
.
price
, odometer_km
data further¶We will check in each of price
and odometer_km
for rows that could be dropped as anomolous data.
autos["odometer_km"].unique().shape
(13,)
autos["price"].unique().shape
(2357,)
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
std
and max
show that there are significantly large data points.
autos["odometer_km"].value_counts().sort_index(ascending=False).head(20)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
autos["price"].value_counts().sort_index(ascending=True).head(20)
$0 1391 $1 155 $1,000 630 $1,039 1 $1,040 1 $1,049 6 $1,050 94 $1,059 1 $1,070 1 $1,080 6 $1,090 4 $1,095 2 $1,098 1 $1,099 43 $1,100 373 $1,111 38 $1,112 1 $1,119 1 $1,120 2 $1,149 10 Name: price, dtype: int64
Approximately 2.8% of the dataset have zero price
. These could be deleted from the dataset on the basis that they are unlikely to have been sold. The price
of $
1 will be retained as it is possible for an Ebay site to have sold used cars at $
1.
autos["price"].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
The top price
data shows anomolies and it's seems unlikely that prices of between approximately 999,990 and 99,000,000 were achieved. These will also be deleted from the dataset.
autos = autos[autos["price"].between(1,351000)]
autos["price"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
autos["price"].value_counts().sort_index().tail()
265000 1 295000 1 299000 1 345000 1 350000 1 Name: price, dtype: int64
Checking that the price
outliers are not included in the dataset.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
(autos["date_crawled"]
.str[:10]
.value_counts(normalize=True,dropna=False)
.sort_index(ascending=True)
)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The crawled_date
data is well distributed across all dates. The period covered is approximately a year.
print(autos["ad_created"].str[:10].unique().shape)
(autos["ad_created"]
.str[:10]
.value_counts(normalize=True,dropna=False)
.sort_index(ascending=True)
)
(76,)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
The ad_created
data spans a year more than the date_crawled
data and has almost double the number of data points, with very small percentages of data on most dates.
(autos["last_seen"]
.str[:10]
.value_counts(normalize=True,dropna=False)
.sort_index(ascending=True)
)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
last_seen
data spans three months with the majority of data concentrated over last three dates. This may be because the crawlers only saw the data at that point. last_seen
data relates to when the cars would have been removed from the site, if sold or otherwise taken off. It is not likely that there was a spike in sales in last three days, rather that crawlers only saw the removed cars at that point.
registration_year
data¶autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.
autos["registration_year"].value_counts().sort_index(ascending=True).head(20)
1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 Name: registration_year, dtype: int64
registration_year
prior to 1950 is a possible outlier and would not necessarily affect the dataset given that the number of cars in the dataset that has registration_year
before 1950 amounts to approximately 0.05%.
autos["registration_year"].value_counts().sort_index(ascending=False).head(20)
9999 3 9000 1 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 2 2018 470 2017 1392 2016 1220 2015 392 2014 663 2013 803 2012 1310 2011 1623 2010 1589 Name: registration_year, dtype: int64
registration_year
after 2016 is a possible outlier given the ad_created
dates. Cars in the dataset with registration_year
after 2016 amounts to 3.8% of the dataset.
Outliers will be removed from both sides of the spectrum.
autos = autos[autos["registration_year"].between(1949,2017)]
autos["registration_year"].value_counts(normalize=True).sort_index().head(50)
1950 0.000061 1951 0.000040 1952 0.000020 1953 0.000020 1954 0.000040 1955 0.000040 1956 0.000101 1957 0.000040 1958 0.000081 1959 0.000142 1960 0.000687 1961 0.000121 1962 0.000081 1963 0.000182 1964 0.000243 1965 0.000344 1966 0.000445 1967 0.000546 1968 0.000526 1969 0.000384 1970 0.000910 1971 0.000546 1972 0.000708 1973 0.000526 1974 0.000485 1975 0.000384 1976 0.000546 1977 0.000445 1978 0.000950 1979 0.000708 1980 0.001961 1981 0.000627 1982 0.000869 1983 0.001072 1984 0.001072 1985 0.002123 1986 0.001537 1987 0.001516 1988 0.002871 1989 0.003660 1990 0.007987 1991 0.007198 1992 0.007906 1993 0.008998 1994 0.013345 1995 0.026548 1996 0.029197 1997 0.041005 1998 0.049599 1999 0.060659 Name: registration_year, dtype: float64
autos["brand"].value_counts(normalize=True,ascending=False)
volkswagen 0.212661 bmw 0.108838 opel 0.108360 mercedes_benz 0.095894 audi 0.086134 ford 0.069798 renault 0.047677 peugeot 0.029613 fiat 0.025846 seat 0.018625 skoda 0.016170 nissan 0.015316 mazda 0.015254 smart 0.014318 citroen 0.014109 toyota 0.012674 hyundai 0.009989 sonstige_autos 0.009469 volvo 0.009094 mini 0.008699 mitsubishi 0.008158 honda 0.007908 kia 0.007117 alfa_romeo 0.006639 porsche 0.005952 suzuki 0.005869 chevrolet 0.005640 chrysler 0.003517 dacia 0.002664 daihatsu 0.002497 jeep 0.002227 subaru 0.002102 land_rover 0.002060 saab 0.001623 daewoo 0.001561 jaguar 0.001519 trabant 0.001373 rover 0.001353 lancia 0.001124 lada 0.000562 Name: brand, dtype: float64
auto_price = autos["brand"].value_counts(normalize=True)
brand_select = auto_price[auto_price > 0.025].index
price_dict = {}
for a in brand_select:
brand_rows = autos[autos["brand"] == a]
mean_price = brand_rows["price"].mean()
price_dict[a] = int(mean_price)
price_dict
{'volkswagen': 5347, 'bmw': 8283, 'opel': 2953, 'mercedes_benz': 8525, 'audi': 9239, 'ford': 3720, 'renault': 2443, 'peugeot': 3069, 'fiat': 2791}
mileage_dict = {}
for i in brand_select:
brand_rows = autos[autos["brand"] == i]
mileage_rows = brand_rows["odometer_km"].mean()
mileage_dict[i] = int(mileage_rows)
mileage_dict
{'volkswagen': 128930, 'bmw': 132691, 'opel': 129441, 'mercedes_benz': 131004, 'audi': 129406, 'ford': 124360, 'renault': 128149, 'peugeot': 127315, 'fiat': 117367}
bmp_series = pd.Series(price_dict)
print(bmp_series)
volkswagen 5347 bmw 8283 opel 2953 mercedes_benz 8525 audi 9239 ford 3720 renault 2443 peugeot 3069 fiat 2791 dtype: int64
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
mean_price | |
---|---|
volkswagen | 5347 |
bmw | 8283 |
opel | 2953 |
mercedes_benz | 8525 |
audi | 9239 |
ford | 3720 |
renault | 2443 |
peugeot | 3069 |
fiat | 2791 |
mileage_series = pd.Series(mileage_dict)
df["mean_mileage"] = mileage_series
df
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5347 | 128930 |
bmw | 8283 | 132691 |
opel | 2953 | 129441 |
mercedes_benz | 8525 | 131004 |
audi | 9239 | 129406 |
ford | 3720 | 124360 |
renault | 2443 | 128149 |
peugeot | 3069 | 127315 |
fiat | 2791 | 117367 |
# alternative way to code the mean price
# autos_select = autos[(autos["brand"] == "ford")|(autos["brand"] == "audi")|(autos["brand"] == "mercedes_benz")|(autos["brand"] == "bmw")|(autos["brand"] == "volkswagen")|(autos["brand"] == "opel")]
# autos_select[["brand","price"]].mean()
# select_index = autos_select["brand"].value_counts(normalize=True).index
# # auto_select_index = autos_select[autos_select].index
# auto_dict = {}
# for a in select_index:
# row_selection = autos[autos["brand"] == a]
# mean_price = row_selection["price"].mean()
# auto_dict[a] = int(mean_price)
# auto_dict
The range of car mileages does not vary as much as the prices do, measured by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.
We will do the following additional analysis:
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
odometer_km
into groups and aggregate mean prices¶We will use value_counts()
function to identify each of the sub-categories in odometer_km
column and create an index of these data.
We will then create a dictionary of keys of these data points and mean values of corresponding odometer_km
data. This dictionary will be a source of data for a new dataframe.
odo_groups = autos["odometer_km"].value_counts()
odo_index = odo_groups[odo_groups > 0].index
odo_groups
150000 31087 125000 5012 100000 2092 90000 1714 80000 1400 70000 1207 60000 1147 50000 1005 40000 811 5000 802 30000 775 20000 756 10000 245 Name: odometer_km, dtype: int64
odo_index
Int64Index([150000, 125000, 100000, 90000, 80000, 70000, 60000, 50000, 40000, 5000, 30000, 20000, 10000], dtype='int64')
odo_dict = {}
for i in odo_index:
odo_rows = autos[autos["odometer_km"] == i]
price_rows = odo_rows["price"].mean()
odo_dict[i] = int(price_rows)
# sorted(odo_dict)
sorted_odo_dict = {k: v for k, v in sorted(odo_dict.items(), key=lambda item: item[0])}
sorted_odo_dict
{5000: 8689, 10000: 20363, 20000: 18191, 30000: 16447, 40000: 15505, 50000: 13754, 60000: 12251, 70000: 10874, 80000: 9642, 90000: 8395, 100000: 8042, 125000: 6163, 150000: 3734}
odo_series = pd.Series(sorted_odo_dict)
print(odo_series)
print(type(odo_series))
df = pd.DataFrame(odo_series, columns=['mean_price'])
df
5000 8689 10000 20363 20000 18191 30000 16447 40000 15505 50000 13754 60000 12251 70000 10874 80000 9642 90000 8395 100000 8042 125000 6163 150000 3734 dtype: int64 <class 'pandas.core.series.Series'>
mean_price | |
---|---|
5000 | 8689 |
10000 | 20363 |
20000 | 18191 |
30000 | 16447 |
40000 | 15505 |
50000 | 13754 |
60000 | 12251 |
70000 | 10874 |
80000 | 9642 |
90000 | 8395 |
100000 | 8042 |
125000 | 6163 |
150000 | 3734 |
The dataframe has been sorted by keys. These data points show that cars with lowest mileage in the dataset have relatively low mean prices. This could be anomolous and based on a low count of data (1.7%) at the 5,000 odometer_km
level. The rest of the data shows an expected curve of reduction in mean_price
as odometer_km
increases.
autos.groupby('odometer_km')['brand'].value_counts(normalize=True).head(39)
odometer_km brand 5000 volkswagen 0.194514 opel 0.128429 bmw 0.099751 mercedes_benz 0.071072 ford 0.067332 sonstige_autos 0.056110 renault 0.054863 audi 0.046135 fiat 0.036160 porsche 0.018703 seat 0.017456 skoda 0.017456 nissan 0.016209 chevrolet 0.014963 mazda 0.014963 peugeot 0.014963 trabant 0.013716 mini 0.011222 mitsubishi 0.011222 alfa_romeo 0.009975 citroen 0.009975 honda 0.009975 chrysler 0.007481 toyota 0.007481 daihatsu 0.006234 hyundai 0.006234 smart 0.006234 suzuki 0.006234 volvo 0.004988 kia 0.003741 land_rover 0.003741 dacia 0.002494 daewoo 0.002494 lada 0.002494 jaguar 0.001247 jeep 0.001247 lancia 0.001247 subaru 0.001247 10000 volkswagen 0.208163 Name: brand, dtype: float64
A check with groupby brand
in the 5,000 odometer_km
sub-category does not necessarily show anything that could influence the mean_price e.g. a higher frequency for lower priced cars.
autos["brand"].value_counts(normalize=True)
volkswagen 0.212661 bmw 0.108838 opel 0.108360 mercedes_benz 0.095894 audi 0.086134 ford 0.069798 renault 0.047677 peugeot 0.029613 fiat 0.025846 seat 0.018625 skoda 0.016170 nissan 0.015316 mazda 0.015254 smart 0.014318 citroen 0.014109 toyota 0.012674 hyundai 0.009989 sonstige_autos 0.009469 volvo 0.009094 mini 0.008699 mitsubishi 0.008158 honda 0.007908 kia 0.007117 alfa_romeo 0.006639 porsche 0.005952 suzuki 0.005869 chevrolet 0.005640 chrysler 0.003517 dacia 0.002664 daihatsu 0.002497 jeep 0.002227 subaru 0.002102 land_rover 0.002060 saab 0.001623 daewoo 0.001561 jaguar 0.001519 trabant 0.001373 rover 0.001353 lancia 0.001124 lada 0.000562 Name: brand, dtype: float64
Checking the percentages across all brands in the whole dataset shows generally similarly well- distributed as for the 5,000 odometer_km
subcategory.
brand_list = []
for x in autos["brand"]:
if x not in brand_list:
brand_list.append(x)
brand_odo_5000_dict = {}
for a in brand_list:
if a != "saab" and a != "rover":
selected_rows = autos.loc[(autos["brand"] == a) & (autos["odometer_km"] == 5000)]
# odo_mean_price = selected_rows["price"].value_counts()
odo_mean_price = selected_rows["price"].mean()
brand_odo_5000_dict[a] = int(odo_mean_price)
# brand_odo_5000_dict
We have set a condition for the two cars brands that appear as nan in the dictionary and do not permit a conversion to integers.
brand_5000_price_series = pd.Series(brand_odo_5000_dict)
# print(model_series)
brand_5000_price_df = pd.DataFrame(brand_5000_price_series, columns=['5000 odo mean price'])
brand_5000_price_df
5000 odo mean price | |
---|---|
peugeot | 3070 |
bmw | 9696 |
volkswagen | 4374 |
smart | 1175 |
ford | 7524 |
chrysler | 1136 |
seat | 6213 |
renault | 4201 |
mercedes_benz | 10451 |
audi | 14992 |
sonstige_autos | 15372 |
opel | 3883 |
mazda | 5736 |
porsche | 91342 |
mini | 10921 |
toyota | 6631 |
dacia | 10949 |
nissan | 3679 |
jeep | 45500 |
volvo | 2148 |
mitsubishi | 3390 |
jaguar | 53000 |
fiat | 4850 |
skoda | 7924 |
subaru | 31800 |
kia | 1533 |
citroen | 6573 |
chevrolet | 6301 |
hyundai | 7818 |
honda | 6447 |
daewoo | 995 |
suzuki | 5062 |
trabant | 1277 |
land_rover | 41163 |
alfa_romeo | 8562 |
lada | 3750 |
daihatsu | 1160 |
lancia | 3000 |
brand_odo_10000_dict = {}
for a in brand_list:
if a != "chrysler" and a != "jeep" and a != "saab" and a != "rover" and a != "volvo" and a != "mitsubishi" and a != "chevrolet" and a != "honda" and a != "daewoo" and a != "suzuki" and a != "land_rover" and a != "lada" and a != "daihatsu" and a != "lancia":
selected_rows = autos.loc[(autos["brand"] == a) & (autos["odometer_km"] == 10000)]
# odo_mean_price = selected_rows["price"].value_counts()
odo_mean_price = selected_rows["price"].mean()
odo_mean_price = int(odo_mean_price)
brand_odo_10000_dict[a] = odo_mean_price
brand_odo_10000_dict
{'peugeot': 11922, 'bmw': 31102, 'volkswagen': 19049, 'smart': 13800, 'ford': 14971, 'seat': 19950, 'renault': 7925, 'mercedes_benz': 29478, 'audi': 28339, 'sonstige_autos': 12546, 'opel': 12485, 'mazda': 18956, 'porsche': 114676, 'mini': 25045, 'toyota': 15958, 'dacia': 5450, 'nissan': 3058, 'jaguar': 70, 'fiat': 9718, 'skoda': 12535, 'subaru': 2600, 'kia': 8538, 'citroen': 18055, 'hyundai': 12749, 'trabant': 2847, 'alfa_romeo': 20125}
The list of cars brands with nan values to be excluded from the dictionary is much longer than for the 5000 odometer_km
sub-category.
brand_10000_price_series = pd.Series(brand_odo_10000_dict)
print(brand_10000_price_series)
brand_10000_price_df = pd.DataFrame(brand_10000_price_series, columns=['10000 odo mean price'])
brand_10000_price_df
peugeot 11922 bmw 31102 volkswagen 19049 smart 13800 ford 14971 seat 19950 renault 7925 mercedes_benz 29478 audi 28339 sonstige_autos 12546 opel 12485 mazda 18956 porsche 114676 mini 25045 toyota 15958 dacia 5450 nissan 3058 jaguar 70 fiat 9718 skoda 12535 subaru 2600 kia 8538 citroen 18055 hyundai 12749 trabant 2847 alfa_romeo 20125 dtype: int64
10000 odo mean price | |
---|---|
peugeot | 11922 |
bmw | 31102 |
volkswagen | 19049 |
smart | 13800 |
ford | 14971 |
seat | 19950 |
renault | 7925 |
mercedes_benz | 29478 |
audi | 28339 |
sonstige_autos | 12546 |
opel | 12485 |
mazda | 18956 |
porsche | 114676 |
mini | 25045 |
toyota | 15958 |
dacia | 5450 |
nissan | 3058 |
jaguar | 70 |
fiat | 9718 |
skoda | 12535 |
subaru | 2600 |
kia | 8538 |
citroen | 18055 |
hyundai | 12749 |
trabant | 2847 |
alfa_romeo | 20125 |
brand_5000_price_df["10000 odo mean price"] = brand_10000_price_series
brand_5000_price_df_sorted = brand_5000_price_df.sort_values(by="10000 odo mean price", ascending=False)
brand_5000_price_df_sorted
# brand_5000_price_df
5000 odo mean price | 10000 odo mean price | |
---|---|---|
porsche | 91342 | 114676.0 |
bmw | 9696 | 31102.0 |
mercedes_benz | 10451 | 29478.0 |
audi | 14992 | 28339.0 |
mini | 10921 | 25045.0 |
alfa_romeo | 8562 | 20125.0 |
seat | 6213 | 19950.0 |
volkswagen | 4374 | 19049.0 |
mazda | 5736 | 18956.0 |
citroen | 6573 | 18055.0 |
toyota | 6631 | 15958.0 |
ford | 7524 | 14971.0 |
smart | 1175 | 13800.0 |
hyundai | 7818 | 12749.0 |
sonstige_autos | 15372 | 12546.0 |
skoda | 7924 | 12535.0 |
opel | 3883 | 12485.0 |
peugeot | 3070 | 11922.0 |
fiat | 4850 | 9718.0 |
kia | 1533 | 8538.0 |
renault | 4201 | 7925.0 |
dacia | 10949 | 5450.0 |
nissan | 3679 | 3058.0 |
trabant | 1277 | 2847.0 |
subaru | 31800 | 2600.0 |
jaguar | 53000 | 70.0 |
chrysler | 1136 | NaN |
jeep | 45500 | NaN |
volvo | 2148 | NaN |
mitsubishi | 3390 | NaN |
chevrolet | 6301 | NaN |
honda | 6447 | NaN |
daewoo | 995 | NaN |
suzuki | 5062 | NaN |
land_rover | 41163 | NaN |
lada | 3750 | NaN |
daihatsu | 1160 | NaN |
lancia | 3000 | NaN |
There is no obvious reason why the mean prices for the odo 5000 category are mostly lower than the more used odo 10000 category. Only a few examples (Dacia, Subaru and Jaguar) are at a higher mean price in the odo 5000 category. In case of Jaguar the mean price for the odo 10000 category is extremely low and does not make sense unless the Jaguar cars that fall into that category are damaged. These results would be useful to explore further to explain the anomolies.
# a = autos.groupby("brand").mean().round(0).astype(int)
# a = a.sort_values("odometer_km")
# a = a.sort_values("odometer_km")
# a["odometer_km"]
# b = pd.DataFrame(a, columns=['mean_price'])
# b
We will find the most common brand/combination and sorted in reverse order according to model volume amounts.
brand_list = []
for x in autos["brand"]:
if x not in brand_list:
brand_list.append(x)
# brand_list
brand_model_dict = {}
for a in brand_list:
selected_rows = autos[autos["brand"] == a]
model_select = selected_rows["model"].value_counts(dropna=False)
model_select = str(model_select)
model_select_split = model_select.split()
model_name = model_select_split[0]
brand_model_dict[a] = model_name
model_series = pd.Series(brand_model_dict)
# print(model_series)
brand_model_df = pd.DataFrame(model_series, columns=['model_type'])
# brand_model_df
brand_model_dict_volumes = {}
for a in brand_list:
selected_rows = autos[autos["brand"] == a]
model_select = selected_rows["model"].value_counts(dropna=False)
model_select = str(model_select)
model_select_split = model_select.split()
model_volumes = model_select_split[1]
model_volumes = int(model_volumes)
brand_model_dict_volumes[a] = model_volumes
model_series_vol = pd.Series(brand_model_dict_volumes)
# print(model_series_vol)
brand_model_vol_df = pd.DataFrame(model_series_vol, columns=['model_volumes'])
# brand_model_vol_df
brand_model_df["model_volumes"] = model_series_vol
brand_model_df_vol_sorted = brand_model_df.sort_values("model_volumes")
# brand_model_df_vol_sorted
brand_model_dict_vol_pct = {}
for a in brand_list:
selected_rows = autos[autos["brand"] == a]
model_select = selected_rows["model"].value_counts(normalize=True,dropna=False)
model_select = str(model_select)
# pcts = selected_rows.value_counts('model', normalize=True).head(1)
model_select_split = model_select.split()
model_vol_pct = model_select_split[1]
model_vol_pct = int(float(model_vol_pct)*100)
# model_vol_pct = round(float(model_vol_pct), 2)
brand_model_dict_vol_pct[a] = model_vol_pct
model_series_vol_pct = pd.Series(brand_model_dict_vol_pct)
# print(model_series_vol_pct)
brand_model_vol_pct_df = pd.DataFrame(model_series_vol_pct, columns=['model_vol_%'])
# brand_model_vol_pct_df
brand_model_df["model_vol_%"] = model_series_vol_pct
brand_model_df_vol_pct_sorted = brand_model_df.sort_values(by="model_volumes", ascending=False)
brand_model_df_vol_pct_sorted
model_type | model_volumes | model_vol_% | |
---|---|---|---|
volkswagen | golf | 4024 | 37 |
bmw | 3er | 2761 | 50 |
opel | corsa | 1735 | 31 |
audi | a4 | 1291 | 30 |
mercedes_benz | c_klasse | 1172 | 24 |
ford | focus | 811 | 23 |
renault | twingo | 677 | 28 |
peugeot | 2_reihe | 621 | 42 |
smart | fortwo | 574 | 81 |
sonstige_autos | NaN | 546 | 100 |
fiat | punto | 458 | 35 |
seat | ibiza | 356 | 37 |
skoda | octavia | 309 | 39 |
mini | cooper | 279 | 65 |
nissan | micra | 257 | 34 |
citroen | andere | 238 | 33 |
honda | civic | 186 | 46 |
mazda | 3_reihe | 182 | 24 |
hyundai | i_reihe | 179 | 36 |
chevrolet | andere | 154 | 54 |
porsche | 911 | 143 | 48 |
toyota | yaris | 129 | 20 |
suzuki | andere | 128 | 43 |
mitsubishi | colt | 120 | 29 |
volvo | v70 | 97 | 21 |
alfa_romeo | 156 | 93 | 28 |
kia | andere | 69 | 19 |
daihatsu | cuore | 69 | 53 |
rover | andere | 59 | 85 |
trabant | 601 | 56 | 71 |
saab | andere | 55 | 68 |
chrysler | andere | 52 | 28 |
dacia | sandero | 47 | 36 |
jeep | grand | 43 | 39 |
jaguar | andere | 39 | 50 |
subaru | legacy | 32 | 29 |
land_rover | freelander | 32 | 32 |
daewoo | matiz | 25 | 31 |
lancia | ypsilon | 25 | 43 |
lada | niva | 15 | 48 |
Most popular car model by volume is the Volkswagen Golf that is also a reasonable percentage of all Volkswagen models in this dataset, confirming what we already are aware of that Volkswagen make up more than 20% of the entire dataset.
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
brand_damage_dict = {}
for a in brand_list:
selected_rows = autos.loc[(autos["brand"] == a) & ((autos["unrepaired_damage"] == "ja") | (autos["unrepaired_damage"] == "yes") )]
damage_mean_price = selected_rows["price"].mean()
brand_damage_dict[a] = int(damage_mean_price)
# brand_damage_dict
brand_damage_price_series = pd.Series(brand_damage_dict)
# print(brand_damage_price_series)
brand_damage_price_df = pd.DataFrame(brand_damage_price_series, columns=['mean_price_damaged'])
brand_damage_dict_no_repair = {}
for a in brand_list:
selected_rows = autos.loc[(autos["brand"] == a) & ((autos["unrepaired_damage"] == "nein") | (autos["unrepaired_damage"] == "no") )]
damage_mean_price = selected_rows["price"].mean()
brand_damage_dict_no_repair[a] = int(damage_mean_price)
brand_no_damage_price_series = pd.Series(brand_damage_dict_no_repair)
print(brand_no_damage_price_series)
brand_no_damage_price_df = pd.DataFrame(brand_no_damage_price_series, columns=['mean_price_not_damaged'])
brand_no_damage_price_df
peugeot 3602 bmw 9266 volkswagen 6307 smart 3875 ford 9925 chrysler 3902 seat 5045 renault 2999 mercedes_benz 9635 audi 10690 sonstige_autos 62159 opel 7013 mazda 5012 porsche 50174 mini 11032 toyota 5528 dacia 6194 nissan 5621 jeep 12537 saab 3830 volvo 42309 mitsubishi 4089 jaguar 13618 fiat 3406 skoda 7045 subaru 5264 kia 6780 citroen 4202 chevrolet 7258 hyundai 6214 honda 4756 daewoo 1177 suzuki 4724 trabant 2698 land_rover 21438 alfa_romeo 4709 lada 3099 rover 1848 daihatsu 1723 lancia 3042 dtype: int64
mean_price_not_damaged | |
---|---|
peugeot | 3602 |
bmw | 9266 |
volkswagen | 6307 |
smart | 3875 |
ford | 9925 |
chrysler | 3902 |
seat | 5045 |
renault | 2999 |
mercedes_benz | 9635 |
audi | 10690 |
sonstige_autos | 62159 |
opel | 7013 |
mazda | 5012 |
porsche | 50174 |
mini | 11032 |
toyota | 5528 |
dacia | 6194 |
nissan | 5621 |
jeep | 12537 |
saab | 3830 |
volvo | 42309 |
mitsubishi | 4089 |
jaguar | 13618 |
fiat | 3406 |
skoda | 7045 |
subaru | 5264 |
kia | 6780 |
citroen | 4202 |
chevrolet | 7258 |
hyundai | 6214 |
honda | 4756 |
daewoo | 1177 |
suzuki | 4724 |
trabant | 2698 |
land_rover | 21438 |
alfa_romeo | 4709 |
lada | 3099 |
rover | 1848 |
daihatsu | 1723 |
lancia | 3042 |
brand_damage_price_df["mean_price_no_damage"] = brand_no_damage_price_series
brand_damage_price_df_sorted = brand_damage_price_df.sort_values(by="mean_price_no_damage", ascending=False)
brand_damage_price_df_sorted
mean_price_damaged | mean_price_no_damage | |
---|---|---|
sonstige_autos | 6187 | 62159 |
porsche | 13454 | 50174 |
volvo | 1699 | 42309 |
land_rover | 5223 | 21438 |
jaguar | 3999 | 13618 |
jeep | 2549 | 12537 |
mini | 4645 | 11032 |
audi | 3131 | 10690 |
ford | 1303 | 9925 |
mercedes_benz | 3801 | 9635 |
bmw | 3287 | 9266 |
chevrolet | 3133 | 7258 |
skoda | 3477 | 7045 |
opel | 1291 | 7013 |
kia | 1939 | 6780 |
volkswagen | 3081 | 6307 |
hyundai | 2375 | 6214 |
dacia | 4459 | 6194 |
nissan | 1901 | 5621 |
toyota | 3388 | 5528 |
subaru | 2040 | 5264 |
seat | 1587 | 5045 |
mazda | 1311 | 5012 |
honda | 1495 | 4756 |
suzuki | 1326 | 4724 |
alfa_romeo | 1703 | 4709 |
citroen | 1914 | 4202 |
mitsubishi | 1265 | 4089 |
chrysler | 1838 | 3902 |
smart | 1410 | 3875 |
saab | 594 | 3830 |
peugeot | 1319 | 3602 |
fiat | 1105 | 3406 |
lada | 1400 | 3099 |
lancia | 2436 | 3042 |
renault | 1063 | 2999 |
trabant | 496 | 2698 |
rover | 710 | 1848 |
daihatsu | 659 | 1723 |
daewoo | 516 | 1177 |
brand_damage_price_df["pct_diff"] = ((brand_damage_price_df["mean_price_damaged"] / brand_damage_price_df["mean_price_no_damage"])*100).astype(int)
brand_damage_price_final_df_sorted = brand_damage_price_df.sort_values(by="pct_diff", ascending=False)
brand_damage_price_final_df_sorted
mean_price_damaged | mean_price_no_damage | pct_diff | |
---|---|---|---|
lancia | 2436 | 3042 | 80 |
dacia | 4459 | 6194 | 71 |
toyota | 3388 | 5528 | 61 |
skoda | 3477 | 7045 | 49 |
volkswagen | 3081 | 6307 | 48 |
chrysler | 1838 | 3902 | 47 |
lada | 1400 | 3099 | 45 |
citroen | 1914 | 4202 | 45 |
daewoo | 516 | 1177 | 43 |
chevrolet | 3133 | 7258 | 43 |
mini | 4645 | 11032 | 42 |
mercedes_benz | 3801 | 9635 | 39 |
daihatsu | 659 | 1723 | 38 |
rover | 710 | 1848 | 38 |
hyundai | 2375 | 6214 | 38 |
subaru | 2040 | 5264 | 38 |
alfa_romeo | 1703 | 4709 | 36 |
peugeot | 1319 | 3602 | 36 |
smart | 1410 | 3875 | 36 |
bmw | 3287 | 9266 | 35 |
renault | 1063 | 2999 | 35 |
nissan | 1901 | 5621 | 33 |
fiat | 1105 | 3406 | 32 |
seat | 1587 | 5045 | 31 |
honda | 1495 | 4756 | 31 |
mitsubishi | 1265 | 4089 | 30 |
jaguar | 3999 | 13618 | 29 |
audi | 3131 | 10690 | 29 |
kia | 1939 | 6780 | 28 |
suzuki | 1326 | 4724 | 28 |
porsche | 13454 | 50174 | 26 |
mazda | 1311 | 5012 | 26 |
land_rover | 5223 | 21438 | 24 |
jeep | 2549 | 12537 | 20 |
opel | 1291 | 7013 | 18 |
trabant | 496 | 2698 | 18 |
saab | 594 | 3830 | 15 |
ford | 1303 | 9925 | 13 |
sonstige_autos | 6187 | 62159 | 9 |
volvo | 1699 | 42309 | 4 |
The vast majority of cars brands are less expensive if damaged. Lancia is the outlier. However the extent of price difference for damaged versus non-damaged varies quite substantially. These aspects may be because of the age of the cars in each cohort (damaged and non-damaged), as well as variations in price for each model of the cars. It could be useful to explore these anomolies to gain a more granular insight.