Ebay Kleinanzeigen (Germany) Used Cars Dataset¶

In this project we will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The data has been modified since first upload.

The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which which year the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which which year the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

In [243]:

import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [244]:

autos

Out[244]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [245]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [246]:

autos.head()

Out[246]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Our dataset contains 20 columns, most of which are stored as strings. There are some columns with null values (including gearbox, model, fuel type and not repaired damage), but no columns have more than ~20% null values. There are some columns that contain dates stored as strings including date crawled, date created and last seen.

We'll start by cleaning the column names to make the data easier to work with.

Cleaning data¶

In [247]:

autos.columns

Out[247]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Firstly we will change all camelcase to snakecase.

Secondly we will change some names.

In [248]:

new_columns = []
for a in autos.columns:
    import re
    a = a.replace("yearOfRegistration","registration_year")
    a = a.replace("monthOfRegistration","registration_month")
    a = a.replace("notRepairedDamage","unrepaired_damage")
    a = a.replace("dateCreated","ad_created")
    if a not in ["ad_created","unrepaired_damage","registration_month","registration_year"]:
        a = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', a)
        a = re.sub('([a-z0-9])([A-Z])', r'\1_\2', a).lower()
    new_columns.append(a)
autos.columns = new_columns

In [249]:

autos.head()
# print(new_columns)

Out[249]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [250]:

autos.columns

Out[250]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Translation of German words¶

Various columns have specific German words. We will map translations of these German words by their English meanings.

In [251]:

autos.columns

Out[251]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [252]:

autos["gearbox"].unique()

Out[252]:

array(['manuell', 'automatik', nan], dtype=object)

In [253]:

mapping_dict = {
    'manuell': 'manual',
    'automatik': 'automatic',
    
}
autos["gearbox"] = autos["gearbox"].map(mapping_dict)
autos["gearbox"].unique()

Out[253]:

array(['manual', 'automatic', nan], dtype=object)

In [254]:

autos["fuel_type"].unique()

Out[254]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [255]:

mapping_dict = {
    'benzin': 'petrol',
    'elektro': 'electric',
    'andere' : 'others'
    
}
autos["fuel_type"] = autos["fuel_type"].map(mapping_dict)
autos["fuel_type"].unique()

Out[255]:

array([nan, 'petrol', 'electric', 'others'], dtype=object)

In [256]:

autos["unrepaired_damage"].unique()

Out[256]:

array(['nein', nan, 'ja'], dtype=object)

In [257]:

mapping_dict = {
    'nein': 'no',
    'ja': 'yes'
    
}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(mapping_dict)
autos["unrepaired_damage"].unique()

Out[257]:

array(['no', nan, 'yes'], dtype=object)

Basic data exploration and further cleaning¶

We will look for text columns where the values are the same, in case these can be dropped as not useful for analysis.

We will also look for examples of numeric data stored as text to be cleaned and converted.

In [258]:

autos.describe(include="all")

Out[258]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	30148	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	3	40	2	76	NaN	NaN	39481
top	2016-03-19 17:36:18	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manual	NaN	golf	150,000km	NaN	petrol	volkswagen	no	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Some columns demonstrate issues:

Columns seller, gearbox, unrepaired_damage and offer_type only each have 2 unique data points
Some columns have NaN unique data
nr_of_pictures seems to have NaN and 0 only

These will be investigated further by running value_counts() and head() functions.

In [259]:

autos["nr_of_pictures"].value_counts()

Out[259]:

0    50000
Name: nr_of_pictures, dtype: int64

In [260]:

autos["seller"].value_counts()

Out[260]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [261]:

autos["offer_type"].value_counts()

Out[261]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

Dropping columns¶

nr_of_pictures, seller and offer_type have no values of relevance and will be dropped.

In [262]:

autos = autos.drop(["nr_of_pictures", "seller","offer_type"], axis=1)
autos.columns

Out[262]:

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [263]:

autos["unrepaired_damage"].value_counts()

Out[263]:

no     35232
yes     4939
Name: unrepaired_damage, dtype: int64

In [264]:

autos["gearbox"].value_counts()

Out[264]:

manual       36993
automatic    10327
Name: gearbox, dtype: int64

unrepaired_damage and gearbox will be retained even though they each have only two unique items in case these items are relevant to analysis.

Convert `price`, `odometer` to `int` type after deleting non-int characters¶

price andodometercolumns data will be amended to delete string characters.

After that the data type of each of these columns will be changed to integer. In addition odometer column name will be changed to odometer_km.

For price we will also check that there are no integers below 0.

In [265]:

autos["price"].value_counts(dropna=False)

Out[265]:

$0        1421
$500       781
$1,500     734
$2,500     643
$1,000     639
          ... 
$1,933       1
$4,840       1
$5,997       1
$4,510       1
$3,780       1
Name: price, Length: 2357, dtype: int64

In [266]:

autos.dtypes

Out[266]:

date_crawled          object
name                  object
price                 object
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
postal_code            int64
last_seen             object
dtype: object

In [267]:

autos["price"].head()

Out[267]:

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

In [268]:

autos["odometer"].head()

Out[268]:

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

In [269]:

new_col = []
for a in autos["price"]:
    a = a.replace("$","")
    a = a.replace(",","")
    new_col.append(a)

# autos["price"] = autos["price"].as_numeric() # for some reason this fucntion does not work 
# to be further investigated

In [271]:

autos["price"] = new_col
autos["price"] = autos["price"].astype(int)
autos["price"] = autos["price"]

In [273]:

autos["price"].head()

Out[273]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [274]:

autos["price"].value_counts()

Out[274]:

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price, Length: 2357, dtype: int64

In [275]:

autos_below_zero = autos.loc[autos["price"] < 0]
print(autos_below_zero)

Empty DataFrame
Columns: [date_crawled, name, price, abtest, vehicle_type, registration_year, gearbox, power_ps, model, odometer, registration_month, fuel_type, brand, unrepaired_damage, ad_created, postal_code, last_seen]
Index: []

No price data that is below zero.

In [24]:

new_col_2 = []
for a in autos["odometer"]:
    a = a.replace("km","")
    a = a.replace(",","")
    new_col_2.append(a)
autos["odometer"] = new_col_2
autos["odometer"] = autos["odometer"].astype(int)

In [25]:

autos["odometer"].head(100)

Out[25]:

0     150000
1     150000
2      70000
3      70000
4     150000
       ...  
95     40000
96    150000
97    125000
98    125000
99     60000
Name: odometer, Length: 100, dtype: int64

In [26]:

autos["odometer"].value_counts(dropna=False, bins=20)

Out[26]:

(142750.0, 150000.0]    32424
(121000.0, 128250.0]     5170
(99250.0, 106500.0]      2169
(84750.0, 92000.0]       1757
(77500.0, 84750.0]       1436
(4854.999, 12250.0]      1231
(63000.0, 70250.0]       1230
(55750.0, 63000.0]       1164
(48500.0, 55750.0]       1027
(34000.0, 41250.0]        819
(26750.0, 34000.0]        789
(19500.0, 26750.0]        784
(106500.0, 113750.0]        0
(92000.0, 99250.0]          0
(113750.0, 121000.0]        0
(128250.0, 135500.0]        0
(135500.0, 142750.0]        0
(41250.0, 48500.0]          0
(12250.0, 19500.0]          0
(70250.0, 77500.0]          0
Name: odometer, dtype: int64

This test includes NaN in case there are any of such data and also aggregating in bins to identify concentrations. The majority (c.64%) of dataset has odometer readings above 142,750, i.e. the cars are long into their useful lives. The remainder are spread somewhat evenly between 19,500 - 128,250.

In [27]:

autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

In [28]:

autos.isnull().sum()

Out[28]:

date_crawled             0
name                     0
price                    0
abtest                   0
vehicle_type          5095
registration_year        0
gearbox               2680
power_ps                 0
model                 2758
odometer_km              0
registration_month       0
fuel_type             4482
brand                    0
unrepaired_damage     9829
ad_created               0
postal_code              0
last_seen                0
dtype: int64

This is a check to claculate the sum of null values in each of the columns.

In [29]:

autos.dtypes

Out[29]:

date_crawled          object
name                  object
price                  int64
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer_km            int64
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
postal_code            int64
last_seen             object
dtype: object

This is a check for which columns are int.

Exploring `price`, `odometer_km` data further¶

We will check in each of price and odometer_km for rows that could be dropped as anomolous data.

In [30]:

autos["odometer_km"].unique().shape

Out[30]:

(13,)

In [31]:

autos["price"].unique().shape

Out[31]:

(2357,)

In [32]:

autos["odometer_km"].describe()

Out[32]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [33]:

autos["price"].describe()

Out[33]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

std and max show that there are significantly large data points.

In [34]:

autos["odometer_km"].value_counts().sort_index(ascending=False).head(20)

Out[34]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

In [164]:

autos["price"].value_counts().sort_index(ascending=True).head(20)

Out[164]:

$0        1391
$1         155
$1,000     630
$1,039       1
$1,040       1
$1,049       6
$1,050      94
$1,059       1
$1,070       1
$1,080       6
$1,090       4
$1,095       2
$1,098       1
$1,099      43
$1,100     373
$1,111      38
$1,112       1
$1,119       1
$1,120       2
$1,149      10
Name: price, dtype: int64

Approximately 2.8% of the dataset have zero price. These could be deleted from the dataset on the basis that they are unlikely to have been sold. The price of $1 will be retained as it is possible for an Ebay site to have sold used cars at $1.

In [36]:

autos["price"].value_counts().sort_index(ascending=False).head(20)

Out[36]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

The top price data shows anomolies and it's seems unlikely that prices of between approximately 999,990 and 99,000,000 were achieved. These will also be deleted from the dataset.

In [37]:

autos = autos[autos["price"].between(1,351000)]
autos["price"].describe()

Out[37]:

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [38]:

autos["price"].value_counts().sort_index().tail()

Out[38]:

265000    1
295000    1
299000    1
345000    1
350000    1
Name: price, dtype: int64

Checking that the price outliers are not included in the dataset.

Exploring date format data¶

In [39]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[39]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [40]:

(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True,dropna=False)
        .sort_index(ascending=True)
        )

Out[40]:

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

The crawled_date data is well distributed across all dates. The period covered is approximately a year.

In [41]:

print(autos["ad_created"].str[:10].unique().shape)
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True,dropna=False)
        .sort_index(ascending=True)
        )

(76,)

Out[41]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

The ad_created data spans a year more than the date_crawled data and has almost double the number of data points, with very small percentages of data on most dates.

In [42]:

(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True,dropna=False)
        .sort_index(ascending=True)
        )

Out[42]:

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

last_seen data spans three months with the majority of data concentrated over last three dates. This may be because the crawlers only saw the data at that point. last_seen data relates to when the cars would have been removed from the site, if sold or otherwise taken off. It is not likely that there was a spike in sales in last three days, rather that crawlers only saw the removed cars at that point.

Exploring `registration_year` data¶

In [43]:

autos["registration_year"].describe()

Out[43]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

In [44]:

autos["registration_year"].value_counts().sort_index(ascending=True).head(20)

Out[44]:

1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    3
1951    2
1952    1
1953    1
1954    2
Name: registration_year, dtype: int64

registration_year prior to 1950 is a possible outlier and would not necessarily affect the dataset given that the number of cars in the dataset that has registration_year before 1950 amounts to approximately 0.05%.

In [45]:

autos["registration_year"].value_counts().sort_index(ascending=False).head(20)

Out[45]:

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
Name: registration_year, dtype: int64

registration_year after 2016 is a possible outlier given the ad_created dates. Cars in the dataset with registration_year after 2016 amounts to 3.8% of the dataset.

Outliers will be removed from both sides of the spectrum.

In [162]:

autos = autos[autos["registration_year"].between(1949,2017)]
autos["registration_year"].value_counts(normalize=True).sort_index().head(50)

Out[162]:

1950    0.000061
1951    0.000040
1952    0.000020
1953    0.000020
1954    0.000040
1955    0.000040
1956    0.000101
1957    0.000040
1958    0.000081
1959    0.000142
1960    0.000687
1961    0.000121
1962    0.000081
1963    0.000182
1964    0.000243
1965    0.000344
1966    0.000445
1967    0.000546
1968    0.000526
1969    0.000384
1970    0.000910
1971    0.000546
1972    0.000708
1973    0.000526
1974    0.000485
1975    0.000384
1976    0.000546
1977    0.000445
1978    0.000950
1979    0.000708
1980    0.001961
1981    0.000627
1982    0.000869
1983    0.001072
1984    0.001072
1985    0.002123
1986    0.001537
1987    0.001516
1988    0.002871
1989    0.003660
1990    0.007987
1991    0.007198
1992    0.007906
1993    0.008998
1994    0.013345
1995    0.026548
1996    0.029197
1997    0.041005
1998    0.049599
1999    0.060659
Name: registration_year, dtype: float64

Mileage and price exploration of car brands¶

In [47]:

autos["brand"].value_counts(normalize=True,ascending=False)

Out[47]:

volkswagen        0.212661
bmw               0.108838
opel              0.108360
mercedes_benz     0.095894
audi              0.086134
ford              0.069798
renault           0.047677
peugeot           0.029613
fiat              0.025846
seat              0.018625
skoda             0.016170
nissan            0.015316
mazda             0.015254
smart             0.014318
citroen           0.014109
toyota            0.012674
hyundai           0.009989
sonstige_autos    0.009469
volvo             0.009094
mini              0.008699
mitsubishi        0.008158
honda             0.007908
kia               0.007117
alfa_romeo        0.006639
porsche           0.005952
suzuki            0.005869
chevrolet         0.005640
chrysler          0.003517
dacia             0.002664
daihatsu          0.002497
jeep              0.002227
subaru            0.002102
land_rover        0.002060
saab              0.001623
daewoo            0.001561
jaguar            0.001519
trabant           0.001373
rover             0.001353
lancia            0.001124
lada              0.000562
Name: brand, dtype: float64

In [48]:

auto_price = autos["brand"].value_counts(normalize=True)
brand_select = auto_price[auto_price > 0.025].index

price_dict = {}

for a in brand_select:
    brand_rows = autos[autos["brand"] == a]
    mean_price = brand_rows["price"].mean()
    price_dict[a] = int(mean_price)

price_dict

Out[48]:

{'volkswagen': 5347,
 'bmw': 8283,
 'opel': 2953,
 'mercedes_benz': 8525,
 'audi': 9239,
 'ford': 3720,
 'renault': 2443,
 'peugeot': 3069,
 'fiat': 2791}

In [49]:

mileage_dict = {}

for i in brand_select:
    brand_rows = autos[autos["brand"] == i]
    mileage_rows = brand_rows["odometer_km"].mean()
    mileage_dict[i] = int(mileage_rows)
mileage_dict

Out[49]:

{'volkswagen': 128930,
 'bmw': 132691,
 'opel': 129441,
 'mercedes_benz': 131004,
 'audi': 129406,
 'ford': 124360,
 'renault': 128149,
 'peugeot': 127315,
 'fiat': 117367}

In [50]:

bmp_series = pd.Series(price_dict)
print(bmp_series)

volkswagen       5347
bmw              8283
opel             2953
mercedes_benz    8525
audi             9239
ford             3720
renault          2443
peugeot          3069
fiat             2791
dtype: int64

In [51]:

df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

Out[51]:

	mean_price
volkswagen	5347
bmw	8283
opel	2953
mercedes_benz	8525
audi	9239
ford	3720
renault	2443
peugeot	3069
fiat	2791

In [52]:

mileage_series = pd.Series(mileage_dict)

In [53]:

df["mean_mileage"] = mileage_series

In [54]:

df

Out[54]:

	mean_price	mean_mileage
volkswagen	5347	128930
bmw	8283	132691
opel	2953	129441
mercedes_benz	8525	131004
audi	9239	129406
ford	3720	124360
renault	2443	128149
peugeot	3069	127315
fiat	2791	117367

In [55]:

# alternative way to code the mean price
# autos_select = autos[(autos["brand"] == "ford")|(autos["brand"] == "audi")|(autos["brand"] == "mercedes_benz")|(autos["brand"] == "bmw")|(autos["brand"] == "volkswagen")|(autos["brand"] == "opel")]
# autos_select[["brand","price"]].mean()
# select_index = autos_select["brand"].value_counts(normalize=True).index
# # auto_select_index = autos_select[autos_select].index
# auto_dict = {}
# for a in select_index:
#     row_selection = autos[autos["brand"] == a]
#     mean_price = row_selection["price"].mean()
#     auto_dict[a] = int(mean_price)
# auto_dict
    

The range of car mileages does not vary as much as the prices do, measured by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

Additional exploration¶

We will do the following additional analysis:

Find the most common brand/model combinations
Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage
Check if cars with damage are cheaper than their non-damaged counterparts

In [56]:

autos.columns

Out[56]:

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

Split `odometer_km` into groups and aggregate mean prices¶

We will use value_counts() function to identify each of the sub-categories in odometer_km column and create an index of these data.

We will then create a dictionary of keys of these data points and mean values of corresponding odometer_km data. This dictionary will be a source of data for a new dataframe.

In [57]:

odo_groups = autos["odometer_km"].value_counts()
odo_index = odo_groups[odo_groups > 0].index

In [58]:

odo_groups

Out[58]:

150000    31087
125000     5012
100000     2092
90000      1714
80000      1400
70000      1207
60000      1147
50000      1005
40000       811
5000        802
30000       775
20000       756
10000       245
Name: odometer_km, dtype: int64

In [59]:

odo_index

Out[59]:

Int64Index([150000, 125000, 100000,  90000,  80000,  70000,  60000,  50000,
             40000,   5000,  30000,  20000,  10000],
           dtype='int64')

In [60]:

odo_dict = {}

for i in odo_index:
    odo_rows = autos[autos["odometer_km"] == i]
    price_rows = odo_rows["price"].mean()
    odo_dict[i] = int(price_rows)
# sorted(odo_dict)
sorted_odo_dict = {k: v for k, v in sorted(odo_dict.items(), key=lambda item: item[0])}
sorted_odo_dict

Out[60]:

{5000: 8689,
 10000: 20363,
 20000: 18191,
 30000: 16447,
 40000: 15505,
 50000: 13754,
 60000: 12251,
 70000: 10874,
 80000: 9642,
 90000: 8395,
 100000: 8042,
 125000: 6163,
 150000: 3734}

In [61]:

odo_series = pd.Series(sorted_odo_dict)
print(odo_series)
print(type(odo_series))
df = pd.DataFrame(odo_series, columns=['mean_price'])
df

5000       8689
10000     20363
20000     18191
30000     16447
40000     15505
50000     13754
60000     12251
70000     10874
80000      9642
90000      8395
100000     8042
125000     6163
150000     3734
dtype: int64
<class 'pandas.core.series.Series'>

Out[61]:

	mean_price
5000	8689
10000	20363
20000	18191
30000	16447
40000	15505
50000	13754
60000	12251
70000	10874
80000	9642
90000	8395
100000	8042
125000	6163
150000	3734

The dataframe has been sorted by keys. These data points show that cars with lowest mileage in the dataset have relatively low mean prices. This could be anomolous and based on a low count of data (1.7%) at the 5,000 odometer_km level. The rest of the data shows an expected curve of reduction in mean_price as odometer_km increases.

In [62]:

autos.groupby('odometer_km')['brand'].value_counts(normalize=True).head(39)

Out[62]:

odometer_km  brand         
5000         volkswagen        0.194514
             opel              0.128429
             bmw               0.099751
             mercedes_benz     0.071072
             ford              0.067332
             sonstige_autos    0.056110
             renault           0.054863
             audi              0.046135
             fiat              0.036160
             porsche           0.018703
             seat              0.017456
             skoda             0.017456
             nissan            0.016209
             chevrolet         0.014963
             mazda             0.014963
             peugeot           0.014963
             trabant           0.013716
             mini              0.011222
             mitsubishi        0.011222
             alfa_romeo        0.009975
             citroen           0.009975
             honda             0.009975
             chrysler          0.007481
             toyota            0.007481
             daihatsu          0.006234
             hyundai           0.006234
             smart             0.006234
             suzuki            0.006234
             volvo             0.004988
             kia               0.003741
             land_rover        0.003741
             dacia             0.002494
             daewoo            0.002494
             lada              0.002494
             jaguar            0.001247
             jeep              0.001247
             lancia            0.001247
             subaru            0.001247
10000        volkswagen        0.208163
Name: brand, dtype: float64

A check with groupby brand in the 5,000 odometer_km sub-category does not necessarily show anything that could influence the mean_price e.g. a higher frequency for lower priced cars.

In [63]:

autos["brand"].value_counts(normalize=True)

Out[63]:

volkswagen        0.212661
bmw               0.108838
opel              0.108360
mercedes_benz     0.095894
audi              0.086134
ford              0.069798
renault           0.047677
peugeot           0.029613
fiat              0.025846
seat              0.018625
skoda             0.016170
nissan            0.015316
mazda             0.015254
smart             0.014318
citroen           0.014109
toyota            0.012674
hyundai           0.009989
sonstige_autos    0.009469
volvo             0.009094
mini              0.008699
mitsubishi        0.008158
honda             0.007908
kia               0.007117
alfa_romeo        0.006639
porsche           0.005952
suzuki            0.005869
chevrolet         0.005640
chrysler          0.003517
dacia             0.002664
daihatsu          0.002497
jeep              0.002227
subaru            0.002102
land_rover        0.002060
saab              0.001623
daewoo            0.001561
jaguar            0.001519
trabant           0.001373
rover             0.001353
lancia            0.001124
lada              0.000562
Name: brand, dtype: float64

Checking the percentages across all brands in the whole dataset shows generally similarly well- distributed as for the 5,000 odometer_km subcategory.

In [64]:

brand_list = []

for x in autos["brand"]:
    if x not in brand_list:
        brand_list.append(x)

brand_odo_5000_dict = {}

for a in brand_list:
    if a != "saab" and a != "rover":
        selected_rows = autos.loc[(autos["brand"] == a) & (autos["odometer_km"] == 5000)]
#     odo_mean_price = selected_rows["price"].value_counts()
        odo_mean_price = selected_rows["price"].mean()
        brand_odo_5000_dict[a] = int(odo_mean_price)
# brand_odo_5000_dict

We have set a condition for the two cars brands that appear as nan in the dictionary and do not permit a conversion to integers.

In [65]:

brand_5000_price_series = pd.Series(brand_odo_5000_dict)
# print(model_series)
brand_5000_price_df = pd.DataFrame(brand_5000_price_series, columns=['5000 odo mean price'])
brand_5000_price_df

Out[65]:

	5000 odo mean price
peugeot	3070
bmw	9696
volkswagen	4374
smart	1175
ford	7524
chrysler	1136
seat	6213
renault	4201
mercedes_benz	10451
audi	14992
sonstige_autos	15372
opel	3883
mazda	5736
porsche	91342
mini	10921
toyota	6631
dacia	10949
nissan	3679
jeep	45500
volvo	2148
mitsubishi	3390
jaguar	53000
fiat	4850
skoda	7924
subaru	31800
kia	1533
citroen	6573
chevrolet	6301
hyundai	7818
honda	6447
daewoo	995
suzuki	5062
trabant	1277
land_rover	41163
alfa_romeo	8562
lada	3750
daihatsu	1160
lancia	3000

In [66]:

brand_odo_10000_dict = {}

for a in brand_list:
    if a != "chrysler" and a != "jeep" and a != "saab" and a != "rover" and a != "volvo" and a != "mitsubishi" and a != "chevrolet" and a != "honda" and a != "daewoo" and a != "suzuki" and a != "land_rover" and a != "lada" and a != "daihatsu" and a != "lancia":
        selected_rows = autos.loc[(autos["brand"] == a) & (autos["odometer_km"] == 10000)]
#     odo_mean_price = selected_rows["price"].value_counts()
        odo_mean_price = selected_rows["price"].mean()
        odo_mean_price = int(odo_mean_price)
        brand_odo_10000_dict[a] = odo_mean_price
brand_odo_10000_dict

Out[66]:

{'peugeot': 11922,
 'bmw': 31102,
 'volkswagen': 19049,
 'smart': 13800,
 'ford': 14971,
 'seat': 19950,
 'renault': 7925,
 'mercedes_benz': 29478,
 'audi': 28339,
 'sonstige_autos': 12546,
 'opel': 12485,
 'mazda': 18956,
 'porsche': 114676,
 'mini': 25045,
 'toyota': 15958,
 'dacia': 5450,
 'nissan': 3058,
 'jaguar': 70,
 'fiat': 9718,
 'skoda': 12535,
 'subaru': 2600,
 'kia': 8538,
 'citroen': 18055,
 'hyundai': 12749,
 'trabant': 2847,
 'alfa_romeo': 20125}

The list of cars brands with nan values to be excluded from the dictionary is much longer than for the 5000 odometer_km sub-category.

In [67]:

brand_10000_price_series = pd.Series(brand_odo_10000_dict)
print(brand_10000_price_series)
brand_10000_price_df = pd.DataFrame(brand_10000_price_series, columns=['10000 odo mean price'])
brand_10000_price_df

peugeot            11922
bmw                31102
volkswagen         19049
smart              13800
ford               14971
seat               19950
renault             7925
mercedes_benz      29478
audi               28339
sonstige_autos     12546
opel               12485
mazda              18956
porsche           114676
mini               25045
toyota             15958
dacia               5450
nissan              3058
jaguar                70
fiat                9718
skoda              12535
subaru              2600
kia                 8538
citroen            18055
hyundai            12749
trabant             2847
alfa_romeo         20125
dtype: int64

Out[67]:

	10000 odo mean price
peugeot	11922
bmw	31102
volkswagen	19049
smart	13800
ford	14971
seat	19950
renault	7925
mercedes_benz	29478
audi	28339
sonstige_autos	12546
opel	12485
mazda	18956
porsche	114676
mini	25045
toyota	15958
dacia	5450
nissan	3058
jaguar	70
fiat	9718
skoda	12535
subaru	2600
kia	8538
citroen	18055
hyundai	12749
trabant	2847
alfa_romeo	20125

In [68]:

brand_5000_price_df["10000 odo mean price"] = brand_10000_price_series
brand_5000_price_df_sorted = brand_5000_price_df.sort_values(by="10000 odo mean price", ascending=False)
brand_5000_price_df_sorted 
# brand_5000_price_df

Out[68]:

	5000 odo mean price	10000 odo mean price
porsche	91342	114676.0
bmw	9696	31102.0
mercedes_benz	10451	29478.0
audi	14992	28339.0
mini	10921	25045.0
alfa_romeo	8562	20125.0
seat	6213	19950.0
volkswagen	4374	19049.0
mazda	5736	18956.0
citroen	6573	18055.0
toyota	6631	15958.0
ford	7524	14971.0
smart	1175	13800.0
hyundai	7818	12749.0
sonstige_autos	15372	12546.0
skoda	7924	12535.0
opel	3883	12485.0
peugeot	3070	11922.0
fiat	4850	9718.0
kia	1533	8538.0
renault	4201	7925.0
dacia	10949	5450.0
nissan	3679	3058.0
trabant	1277	2847.0
subaru	31800	2600.0
jaguar	53000	70.0
chrysler	1136	NaN
jeep	45500	NaN
volvo	2148	NaN
mitsubishi	3390	NaN
chevrolet	6301	NaN
honda	6447	NaN
daewoo	995	NaN
suzuki	5062	NaN
land_rover	41163	NaN
lada	3750	NaN
daihatsu	1160	NaN
lancia	3000	NaN

There is no obvious reason why the mean prices for the odo 5000 category are mostly lower than the more used odo 10000 category. Only a few examples (Dacia, Subaru and Jaguar) are at a higher mean price in the odo 5000 category. In case of Jaguar the mean price for the odo 10000 category is extremely low and does not make sense unless the Jaguar cars that fall into that category are damaged. These results would be useful to explore further to explain the anomolies.

In [278]:

# a = autos.groupby("brand").mean().round(0).astype(int)
# a = a.sort_values("odometer_km")
# a = a.sort_values("odometer_km")
# a["odometer_km"]
# b = pd.DataFrame(a, columns=['mean_price'])
# b

Most common brand/model combination¶

We will find the most common brand/combination and sorted in reverse order according to model volume amounts.

In [70]:

brand_list = []

for x in autos["brand"]:
    if x not in brand_list:
        brand_list.append(x)
# brand_list

In [71]:

brand_model_dict = {}

for a in brand_list:
    selected_rows = autos[autos["brand"] == a]
    model_select = selected_rows["model"].value_counts(dropna=False)
    model_select = str(model_select)
    model_select_split = model_select.split()
    model_name = model_select_split[0]
    brand_model_dict[a] = model_name

In [72]:

model_series = pd.Series(brand_model_dict)
# print(model_series)
brand_model_df = pd.DataFrame(model_series, columns=['model_type'])
# brand_model_df

In [279]:

brand_model_dict_volumes = {}

for a in brand_list:
    selected_rows = autos[autos["brand"] == a]
    model_select = selected_rows["model"].value_counts(dropna=False)
    model_select = str(model_select)
    model_select_split = model_select.split()
    model_volumes = model_select_split[1]
    model_volumes = int(model_volumes)
    brand_model_dict_volumes[a] = model_volumes

In [280]:

model_series_vol = pd.Series(brand_model_dict_volumes)
# print(model_series_vol)
brand_model_vol_df = pd.DataFrame(model_series_vol, columns=['model_volumes'])
# brand_model_vol_df

In [281]:

brand_model_df["model_volumes"] = model_series_vol
brand_model_df_vol_sorted = brand_model_df.sort_values("model_volumes")
# brand_model_df_vol_sorted

In [282]:

brand_model_dict_vol_pct = {}

for a in brand_list:
    selected_rows = autos[autos["brand"] == a]
    model_select = selected_rows["model"].value_counts(normalize=True,dropna=False)
    model_select = str(model_select)
#     pcts = selected_rows.value_counts('model', normalize=True).head(1)
    model_select_split = model_select.split()
    model_vol_pct = model_select_split[1]
    model_vol_pct = int(float(model_vol_pct)*100)
#     model_vol_pct = round(float(model_vol_pct), 2)
    brand_model_dict_vol_pct[a] = model_vol_pct

In [283]:

model_series_vol_pct = pd.Series(brand_model_dict_vol_pct)
# print(model_series_vol_pct)
brand_model_vol_pct_df = pd.DataFrame(model_series_vol_pct, columns=['model_vol_%'])
# brand_model_vol_pct_df

In [284]:

brand_model_df["model_vol_%"] = model_series_vol_pct
brand_model_df_vol_pct_sorted = brand_model_df.sort_values(by="model_volumes", ascending=False)
brand_model_df_vol_pct_sorted

Out[284]:

	model_type	model_volumes	model_vol_%
volkswagen	golf	4024	37
bmw	3er	2761	50
opel	corsa	1735	31
audi	a4	1291	30
mercedes_benz	c_klasse	1172	24
ford	focus	811	23
renault	twingo	677	28
peugeot	2_reihe	621	42
smart	fortwo	574	81
sonstige_autos	NaN	546	100
fiat	punto	458	35
seat	ibiza	356	37
skoda	octavia	309	39
mini	cooper	279	65
nissan	micra	257	34
citroen	andere	238	33
honda	civic	186	46
mazda	3_reihe	182	24
hyundai	i_reihe	179	36
chevrolet	andere	154	54
porsche	911	143	48
toyota	yaris	129	20
suzuki	andere	128	43
mitsubishi	colt	120	29
volvo	v70	97	21
alfa_romeo	156	93	28
kia	andere	69	19
daihatsu	cuore	69	53
rover	andere	59	85
trabant	601	56	71
saab	andere	55	68
chrysler	andere	52	28
dacia	sandero	47	36
jeep	grand	43	39
jaguar	andere	39	50
subaru	legacy	32	29
land_rover	freelander	32	32
daewoo	matiz	25	31
lancia	ypsilon	25	43
lada	niva	15	48

Most popular car model by volume is the Volkswagen Golf that is also a reasonable percentage of all Volkswagen models in this dataset, confirming what we already are aware of that Volkswagen make up more than 20% of the entire dataset.

How much cheaper are damaged versus non-damaged cars?¶

In [285]:

autos.columns

Out[285]:

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [288]:

brand_damage_dict = {}

for a in brand_list:
    selected_rows = autos.loc[(autos["brand"] == a) & ((autos["unrepaired_damage"] == "ja") | (autos["unrepaired_damage"] == "yes") )]
    damage_mean_price = selected_rows["price"].mean()
    brand_damage_dict[a] = int(damage_mean_price)
# brand_damage_dict

In [289]:

brand_damage_price_series = pd.Series(brand_damage_dict)
# print(brand_damage_price_series)
brand_damage_price_df = pd.DataFrame(brand_damage_price_series, columns=['mean_price_damaged'])

In [290]:

brand_damage_dict_no_repair = {}

for a in brand_list:
    selected_rows = autos.loc[(autos["brand"] == a) & ((autos["unrepaired_damage"] == "nein") | (autos["unrepaired_damage"] == "no") )]
    damage_mean_price = selected_rows["price"].mean()
    brand_damage_dict_no_repair[a] = int(damage_mean_price)

In [291]:

brand_no_damage_price_series = pd.Series(brand_damage_dict_no_repair)
print(brand_no_damage_price_series)
brand_no_damage_price_df = pd.DataFrame(brand_no_damage_price_series, columns=['mean_price_not_damaged'])
brand_no_damage_price_df

peugeot            3602
bmw                9266
volkswagen         6307
smart              3875
ford               9925
chrysler           3902
seat               5045
renault            2999
mercedes_benz      9635
audi              10690
sonstige_autos    62159
opel               7013
mazda              5012
porsche           50174
mini              11032
toyota             5528
dacia              6194
nissan             5621
jeep              12537
saab               3830
volvo             42309
mitsubishi         4089
jaguar            13618
fiat               3406
skoda              7045
subaru             5264
kia                6780
citroen            4202
chevrolet          7258
hyundai            6214
honda              4756
daewoo             1177
suzuki             4724
trabant            2698
land_rover        21438
alfa_romeo         4709
lada               3099
rover              1848
daihatsu           1723
lancia             3042
dtype: int64

Out[291]:

	mean_price_not_damaged
peugeot	3602
bmw	9266
volkswagen	6307
smart	3875
ford	9925
chrysler	3902
seat	5045
renault	2999
mercedes_benz	9635
audi	10690
sonstige_autos	62159
opel	7013
mazda	5012
porsche	50174
mini	11032
toyota	5528
dacia	6194
nissan	5621
jeep	12537
saab	3830
volvo	42309
mitsubishi	4089
jaguar	13618
fiat	3406
skoda	7045
subaru	5264
kia	6780
citroen	4202
chevrolet	7258
hyundai	6214
honda	4756
daewoo	1177
suzuki	4724
trabant	2698
land_rover	21438
alfa_romeo	4709
lada	3099
rover	1848
daihatsu	1723
lancia	3042

In [292]:

brand_damage_price_df["mean_price_no_damage"] = brand_no_damage_price_series
brand_damage_price_df_sorted = brand_damage_price_df.sort_values(by="mean_price_no_damage", ascending=False)
brand_damage_price_df_sorted

Out[292]:

	mean_price_damaged	mean_price_no_damage
sonstige_autos	6187	62159
porsche	13454	50174
volvo	1699	42309
land_rover	5223	21438
jaguar	3999	13618
jeep	2549	12537
mini	4645	11032
audi	3131	10690
ford	1303	9925
mercedes_benz	3801	9635
bmw	3287	9266
chevrolet	3133	7258
skoda	3477	7045
opel	1291	7013
kia	1939	6780
volkswagen	3081	6307
hyundai	2375	6214
dacia	4459	6194
nissan	1901	5621
toyota	3388	5528
subaru	2040	5264
seat	1587	5045
mazda	1311	5012
honda	1495	4756
suzuki	1326	4724
alfa_romeo	1703	4709
citroen	1914	4202
mitsubishi	1265	4089
chrysler	1838	3902
smart	1410	3875
saab	594	3830
peugeot	1319	3602
fiat	1105	3406
lada	1400	3099
lancia	2436	3042
renault	1063	2999
trabant	496	2698
rover	710	1848
daihatsu	659	1723
daewoo	516	1177

In [293]:

brand_damage_price_df["pct_diff"] = ((brand_damage_price_df["mean_price_damaged"] / brand_damage_price_df["mean_price_no_damage"])*100).astype(int)
brand_damage_price_final_df_sorted = brand_damage_price_df.sort_values(by="pct_diff", ascending=False)
brand_damage_price_final_df_sorted

Out[293]:

	mean_price_damaged	mean_price_no_damage	pct_diff
lancia	2436	3042	80
dacia	4459	6194	71
toyota	3388	5528	61
skoda	3477	7045	49
volkswagen	3081	6307	48
chrysler	1838	3902	47
lada	1400	3099	45
citroen	1914	4202	45
daewoo	516	1177	43
chevrolet	3133	7258	43
mini	4645	11032	42
mercedes_benz	3801	9635	39
daihatsu	659	1723	38
rover	710	1848	38
hyundai	2375	6214	38
subaru	2040	5264	38
alfa_romeo	1703	4709	36
peugeot	1319	3602	36
smart	1410	3875	36
bmw	3287	9266	35
renault	1063	2999	35
nissan	1901	5621	33
fiat	1105	3406	32
seat	1587	5045	31
honda	1495	4756	31
mitsubishi	1265	4089	30
jaguar	3999	13618	29
audi	3131	10690	29
kia	1939	6780	28
suzuki	1326	4724	28
porsche	13454	50174	26
mazda	1311	5012	26
land_rover	5223	21438	24
jeep	2549	12537	20
opel	1291	7013	18
trabant	496	2698	18
saab	594	3830	15
ford	1303	9925	13
sonstige_autos	6187	62159	9
volvo	1699	42309	4

The vast majority of cars brands are less expensive if damaged. Lancia is the outlier. However the extent of price difference for damaged versus non-damaged varies quite substantially. These aspects may be because of the age of the cars in each cohort (damaged and non-damaged), as well as variations in price for each model of the cars. It could be useful to explore these anomolies to gain a more granular insight.