Exploring Ebay Car Sales Data¶

The aim of this project is to clean the data and analyze the included used car listings.

1. Introduction to the Data Set¶

The data set consists of 50,000 data points, organized in the following fields:

Column Name	Col Description
dateCrawled	When this ad was first crawled. All field-values are taken from this date.
name	Name of the car.
seller	Whether the seller is private or a dealer.
offerType	The type of listing
price	The price on the ad to sell the car.
abtest	Whether the listing is included in an A/B test.
vehicleType	The vehicle Type.
yearOfRegistration	The year in which the car was first registered.
gearbox	The transmission type.
powerPS	The power of the car in PS.
model	The car model name.
kilometer	How many kilometers the car has driven.
monthOfRegistration	The month in which the car was first registered.
fuelType	What type of fuel the car uses.
brand	The brand of the car.
notRepairedDamage	If the car has a damage which is not yet repaired.
dateCreated	The date on which the eBay listing was created.
nrOfPictures	The number of pictures in the ad.
postalCode	The postal code for the location of the vehicle.
lastSeenOnline	When the crawler saw this ad last online.

In [1]:

import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:

autos.head(3)

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37

In [3]:

for i in range(len(autos.columns)):
    print(i, autos.columns[i])

0 dateCrawled
1 name
2 seller
3 offerType
4 price
5 abtest
6 vehicleType
7 yearOfRegistration
8 gearbox
9 powerPS
10 model
11 odometer
12 monthOfRegistration
13 fuelType
14 brand
15 notRepairedDamage
16 dateCreated
17 nrOfPictures
18 postalCode
19 lastSeen

2. Cleaning Column Names¶

In [4]:

#Populate a dictionary with mapping criteria for renaming columns.
c = autos.columns.copy()
m = {
    'yearOfRegistration' : 'registration_year',
    'monthOfRegistration' : 'registration_month',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created'
    }
for i in c:
    if i not in m:
        split_points = []
        start = 0
        for j in range(len(i)):
            if i[j] in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
                split_points.append(i[start:j].lower())
                start = j
        split_points.append(i[start:].lower())
        if len(split_points) > 0:
            m[i]='_'.join(split_points)
        else:
            m[i] = i
m

Out[4]:

{'abtest': 'abtest',
 'brand': 'brand',
 'dateCrawled': 'date_crawled',
 'dateCreated': 'ad_created',
 'fuelType': 'fuel_type',
 'gearbox': 'gearbox',
 'lastSeen': 'last_seen',
 'model': 'model',
 'monthOfRegistration': 'registration_month',
 'name': 'name',
 'notRepairedDamage': 'unrepaired_damage',
 'nrOfPictures': 'nr_of_pictures',
 'odometer': 'odometer',
 'offerType': 'offer_type',
 'postalCode': 'postal_code',
 'powerPS': 'power_p_s',
 'price': 'price',
 'seller': 'seller',
 'vehicleType': 'vehicle_type',
 'yearOfRegistration': 'registration_year'}

In [5]:

#Use the mapping dic to change column names
autos.rename(columns = m, inplace = True)
autos.head(2)

Out[5]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08

New Column Names and Indexes¶

In [6]:

for i in range(len(autos.columns)):
    print(i, autos.columns[i])

0 date_crawled
1 name
2 seller
3 offer_type
4 price
5 abtest
6 vehicle_type
7 registration_year
8 gearbox
9 power_p_s
10 model
11 odometer
12 registration_month
13 fuel_type
14 brand
15 unrepaired_damage
16 ad_created
17 nr_of_pictures
18 postal_code
19 last_seen

Thus far we have reviewed the dataset and changed the column headings. For the sake of simplicity and clarity we converted 'camelcase' to 'snakecase' and shortened some of the names. All headings are lower case now.

3. Initial Exploration and Cleaning.¶

In [7]:

autos.describe(include = 'all')

Out[7]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-09 11:54:38	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [8]:

autos['seller' ].value_counts()

Out[8]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [9]:

autos[ 'offer_type'].value_counts()

Out[9]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [10]:

autos.loc[(autos['registration_year']<1900) | (autos['registration_year']>2020),'registration_year'].sort_values(ascending = False)

Out[10]:

14341    9999
38076    9999
33950    9999
8012     9999
6308     9996
49910    9000
13559    9000
25003    8888
8360     6200
27618    5911
49153    5000
4164     5000
24519    5000
22799    5000
42079    4800
453      4500
4549     4100
27578    2800
10556    1800
32585    1800
35238    1500
24511    1111
49283    1001
22316    1000
Name: registration_year, dtype: int64

In [11]:

autos['nr_of_pictures'].value_counts()

Out[11]:

0    50000
Name: nr_of_pictures, dtype: int64

In [12]:

autos[['price', 'registration_year', 'power_p_s', 'odometer','registration_month', 'nr_of_pictures']].head()

Out[12]:

	price	registration_year	power_p_s	odometer	registration_month
0	$5,000	2004	158	150,000km	3
1	$8,500	1997	286	150,000km	6
2	$8,990	2009	102	70,000km	7
3	$4,350	2007	71	70,000km	6
4	$1,350	2003	0	150,000km	7

In [13]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_p_s             50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
nr_of_pictures        50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

The examination above reveals the following issues:

colums with only one value that would be of no use:

'seller', 'offer_type', 'nr_of_pictures'

columns, containing numeric data, stored as text: 'price', 'odometer';
other issues: column 'registration_year' contains years later than current year and earlier than 1900 that need to be fixed.

In [14]:

# replaceing non-numeric characters and converting the columns to 'int' type.
autos['price'] = autos['price'].str.replace("$","") 
autos['price'] = autos['price'].str.replace(",","").astype(int)
autos['odometer'] = autos['odometer'].str.replace(",","")
autos['odometer'] = autos['odometer'].str.replace("km","").astype(int)

autos[['price', 'odometer']].head()

Out[14]:

	price	odometer
0	5000	150000
1	8500	150000
2	8990	70000
3	4350	70000
4	1350	150000

In [15]:

# Rename the 'odometer' column to keep the information
autos.rename(columns = {'odometer': 'odometer_km'}, inplace = True)
autos[['price', 'odometer_km']].head()

Out[15]:

	price	odometer_km
0	5000	150000
1	8500	150000
2	8990	70000
3	4350	70000
4	1350	150000

4. Exploring the 'Odometer' and 'Price' Columns¶

In [16]:

print('Odometer minimum = ',autos['odometer_km'].min())
print('Odometer maximum = ',autos['odometer_km'].max())
print('Price minimum = ',autos['price'].min())
print('Price maximum = ',autos['price'].max())

Odometer minimum =  5000
Odometer maximum =  150000
Price minimum =  0
Price maximum =  99999999

While the 'odometer' data seems okay, 'price' data has some erroneous min and max values. Let's investigate it further.

In [17]:

print('Unique values in "odometer_km" column: ',autos['odometer_km'].unique().shape)
print('Unique values in "price" column: ',autos['price'].unique().shape)

Unique values in "odometer_km" column:  (13,)
Unique values in "price" column:  (2357,)

In [18]:

autos['odometer_km'].describe()

Out[18]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [19]:

autos['price'].describe()

Out[19]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [20]:

autos['odometer_km'].value_counts()

Out[20]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

Inadequately high or low values in the 'price' column.

In [21]:

autos.loc[autos['price']<501,'price'].value_counts().sort_index(ascending = True)

Out[21]:

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
       ... 
375       7
378       1
379       1
380      29
385       1
388       1
390      26
395       2
399      71
400     321
410       1
414       1
420       9
425       4
430      13
435       1
440       1
444       8
449      12
450     265
459       1
460       5
470       7
475       4
480      25
485       2
490      34
495       4
499      92
500     781
Name: price, Length: 140, dtype: int64

In [22]:

autos.loc[autos['price']>100000,'price'].value_counts().sort_index(ascending = True)

Out[22]:

104900      1
105000      2
109999      1
114400      1
115000      1
115991      1
116000      1
119500      1
119900      1
120000      2
128000      1
129000      1
130000      1
135000      1
137999      1
139997      1
145000      1
151990      1
155000      1
163500      1
163991      1
169000      1
169999      1
175000      1
180000      1
190000      1
194000      1
197000      1
198000      1
220000      1
250000      1
259000      1
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

In [23]:

autos.loc[autos['price']>100000,['price', 'name', 'registration_year', 'vehicle_type']].sort_values('price')

Out[23]:

	price	name	registration_year	vehicle_type
29286	104900	Porsche_911_Turbo_PDK__Sport_Chrono_Garantie_D...	2011	coupe
17540	105000	Porsche_930_Turbo_classic_analytics_Gutachten_2+	1988	coupe
16964	105000	Bentley_Continental_Supersports	2010	coupe
49391	109999	Lamborghini_Gallardo_LP560_4_E_Gear_"Callisto_...	2008	coupe
22060	114400	Tesla_Model_X90D_Autopilot_Leder_AHK_Kaltwette...	2016	suv
7402	115000	Porsche_911_Carrera_4S_Cabrio_PDK__BOSE__NEU__...	2016	cabrio
21783	115991	Porsche_911_Carrera_4S_Cabriolet_PDK	2013	cabrio
33884	116000	Porsche_911_Carrera_4S_Cabriolet_PDK	2013	cabrio
38814	119500	Porsche_Porsche_964_TURBO_S_X33_WLS_ab_Werk_35...	1992	coupe
43282	119900	Porsche_911_Carrera_4S_Cabriolet_PDK	2014	cabrio
2751	120000	Porsche_911___993_4S	1998	coupe
44406	120000	Mercedes_Benz_170_S_Cabriolet_A__Startklar_fue...	1951	cabrio
8232	128000	Porsche_993_S_Schalter_BRD_neuwertig	1997	coupe
1878	129000	Porsche_911_Turbo	1995	coupe
49815	130000	SUCHE_TIPPS___Ford_Mustang_Shelby_GT_350_500_K...	1968	coupe
14268	135000	Porsche_997_GT3_Approved_Lift_Unfallfrei_Mj.2011	2010	coupe
2454	137999	Porsche_911_GT3	2010	coupe
32185	139997	Porsche_997_GT3_MKII_EIN_SAMMLERSTÜCK_IN_VOLLA...	2010	coupe
49668	145000	Porsche_911__993__Carrera_S_"SONDERMODELL"	1998	coupe
33638	151990	Porsche_911_GT3_RS	2007	coupe
10500	155000	Porsche_991	2013	coupe
45387	163500	Porsche_911_GT3__Clubsp._Lift_nur_bis_25.03._v...	2015	coupe
22673	163991	Porsche_911_991_GT3_LED_Licht_Chrono_Navi_Club...	2014	coupe
18509	169000	Porsche_991_GT3	2013	coupe
32840	169999	Mercedes_Benz_G_500_7G_TRONIC_Cabrio_Desingno_...	2012	cabrio
11433	175000	Audi_R8_V10_plus_5.2_FSI_quattro_S_tronic_Akra...	2016	coupe
17140	180000	Mercedes_Benz_SLS_AMG_Roadster_AMG_SPEEDSHIFT_DCT	2012	cabrio
20351	190000	BMW_Z8_roadster	2000	cabrio
28090	194000	Tesla_Model_X_P90D_Signature_Sondermodel__Neuw...	2016	suv
43668	197000	Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts...	1998	coupe
40918	198000	Porsche_911_991_GT3_RS	2015	coupe
37840	220000	Porsche_997	2008	coupe
38299	250000	Glas_BMW_mit_Wasser	2015	NaN
47337	259000	BMW_Z8_roadster	2001	cabrio
12682	265000	Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...	2016	coupe
35923	295000	Porsche_911_Targa_Exclusive_Edition__1_von_15_...	2015	cabrio
34723	299000	Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au...	1977	coupe
14715	345000	Rolls_Royce_Phantom_Drophead_Coupe	2012	cabrio
36818	350000	Porsche_991	2016	coupe
37585	999990	Volkswagen_Jetta_GT	1985	limousine
43049	999999	2_VW_Busse_T3	1981	bus
514	999999	Ford_Focus_Turnier_1.6_16V_Style	2009	kombi
22947	1234566	Bmw_530d_zum_ausschlachten	1999	kombi
7814	1300000	Ferrari_F40	1992	coupe
47634	3890000	Ferrari_FXX	2006	coupe
11137	10000000	suche_maserati_3200_gt_Zustand_unwichtig_laufe...	1960	coupe
2897	11111111	Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000	1973	limousine
24384	11111111	Schlachte_Golf_3_gt_tdi	1995	NaN
27371	12345678	Fiat_Punto	2017	NaN
47598	12345678	Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...	2001	limousine
39377	12345678	Tausche_volvo_v40_gegen_van	2018	NaN
42221	27322222	Leasinguebernahme	2014	limousine
39705	99999999	Tausch_gegen_gleichwertiges	1999	limousine

We can conclude that data enrties containing prices below 399 and above 350000 can be removed

In [24]:

autos = autos[autos['price'].between(399,351000)]
autos.sort_values('price').head()

Out[24]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
5179	2016-03-14 00:48:15	Fiat_Punto_60_Selecta_SX	privat	Angebot	399	control	kleinwagen	1997	manuell	60	punto	150000	5	benzin	fiat	nein	2016-03-13 00:00:00	53117	2016-04-07 03:16:34
6263	2016-03-30 11:52:01	BMW_316i_LEDER_UND_KLIMA_BITTE_KEINE_MAILS	privat	Angebot	399	test	kleinwagen	1998	manuell	102	3er	150000	12	benzin	bmw	nein	2016-03-30 00:00:00	10827	2016-04-03 01:21:43
45964	2016-04-05 11:57:35	Corsa_opel	privat	Angebot	399	test	NaN	2000	NaN	0	corsa	150000	3	NaN	opel	NaN	2016-04-05 00:00:00	33334	2016-04-05 12:39:37
12965	2016-03-29 01:55:57	Opel_Astra_Kombi_mit_AHK	privat	Angebot	399	control	kombi	1998	automatik	101	astra	150000	3	benzin	opel	ja	2016-03-29 00:00:00	36304	2016-04-05 17:26:08
42442	2016-03-30 15:50:26	Fiat_Punto	privat	Angebot	399	test	kleinwagen	1997	manuell	50	punto	150000	11	benzin	fiat	NaN	2016-03-30 00:00:00	48369	2016-04-07 06:44:43

In [25]:

autos.sort_values('price', ascending = False).head()

Out[25]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
36818	2016-03-27 18:37:37	Porsche_991	privat	Angebot	350000	control	coupe	2016	manuell	500	911	5000	3	benzin	porsche	nein	2016-03-27 00:00:00	70499	2016-03-27 18:37:37
14715	2016-03-30 08:37:24	Rolls_Royce_Phantom_Drophead_Coupe	privat	Angebot	345000	control	cabrio	2012	automatik	460	NaN	20000	8	benzin	sonstige_autos	nein	2016-03-30 00:00:00	73525	2016-04-07 00:16:26
34723	2016-03-23 16:37:29	Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au...	privat	Angebot	299000	test	coupe	1977	manuell	260	911	100000	7	benzin	porsche	nein	2016-03-23 00:00:00	61462	2016-04-06 16:44:50
35923	2016-04-03 07:56:23	Porsche_911_Targa_Exclusive_Edition__1_von_15_...	privat	Angebot	295000	test	cabrio	2015	automatik	400	911	5000	6	benzin	porsche	nein	2016-04-03 00:00:00	74078	2016-04-03 08:56:20
12682	2016-03-28 22:48:01	Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...	privat	Angebot	265000	control	coupe	2016	automatik	500	911	5000	3	benzin	porsche	nein	2016-03-28 00:00:00	70193	2016-04-05 03:44:51

Remaining data entries containg only prices between 399 and 350000.

5. Exploring the Date Columns¶

In [26]:

autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[26]:

2016-03-05    0.025534
2016-03-06    0.014137
2016-03-07    0.036235
2016-03-08    0.033081
2016-03-09    0.032864
2016-03-10    0.032559
2016-03-11    0.032820
2016-03-12    0.037105
2016-03-13    0.015573
2016-03-14    0.036235
2016-03-15    0.034234
2016-03-16    0.029514
2016-03-17    0.031319
2016-03-18    0.012854
2016-03-19    0.034495
2016-03-20    0.038105
2016-03-21    0.037627
2016-03-22    0.032994
2016-03-23    0.032429
2016-03-24    0.028992
2016-03-25    0.031211
2016-03-26    0.032385
2016-03-27    0.031080
2016-03-28    0.034865
2016-03-29    0.033668
2016-03-30    0.033560
2016-03-31    0.031950
2016-04-01    0.033908
2016-04-02    0.035691
2016-04-03    0.038779
2016-04-04    0.036539
2016-04-05    0.013137
2016-04-06    0.003132
2016-04-07    0.001392
Name: date_crawled, dtype: float64

In [27]:

autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[27]:

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
2016-01-03    0.000022
2016-01-07    0.000022
2016-01-10    0.000043
2016-01-13    0.000022
2016-01-14    0.000022
2016-01-16    0.000022
2016-01-22    0.000022
2016-01-27    0.000065
2016-01-29    0.000022
2016-02-01    0.000022
2016-02-02    0.000043
2016-02-05    0.000043
2016-02-07    0.000022
2016-02-08    0.000022
2016-02-09    0.000022
2016-02-11    0.000022
2016-02-12    0.000043
2016-02-14    0.000043
2016-02-16    0.000022
2016-02-17    0.000022
2016-02-18    0.000043
2016-02-19    0.000065
2016-02-20    0.000043
2016-02-21    0.000043
                ...   
2016-03-09    0.032951
2016-03-10    0.032298
2016-03-11    0.033125
2016-03-12    0.036887
2016-03-13    0.017052
2016-03-14    0.034799
2016-03-15    0.033995
2016-03-16    0.030014
2016-03-17    0.030971
2016-03-18    0.013463
2016-03-19    0.033407
2016-03-20    0.038214
2016-03-21    0.037844
2016-03-22    0.032777
2016-03-23    0.032255
2016-03-24    0.028949
2016-03-25    0.031363
2016-03-26    0.032407
2016-03-27    0.031015
2016-03-28    0.034908
2016-03-29    0.033690
2016-03-30    0.033386
2016-03-31    0.032015
2016-04-01    0.033864
2016-04-02    0.035386
2016-04-03    0.039019
2016-04-04    0.036909
2016-04-05    0.011853
2016-04-06    0.003219
2016-04-07    0.001240
Name: ad_created, Length: 76, dtype: float64

In [28]:

autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[28]:

2016-03-05    0.001087
2016-03-06    0.004219
2016-03-07    0.005263
2016-03-08    0.007134
2016-03-09    0.009483
2016-03-10    0.010353
2016-03-11    0.012114
2016-03-12    0.023859
2016-03-13    0.008809
2016-03-14    0.012354
2016-03-15    0.015703
2016-03-16    0.016225
2016-03-17    0.027839
2016-03-18    0.007308
2016-03-19    0.015507
2016-03-20    0.020532
2016-03-21    0.020553
2016-03-22    0.021293
2016-03-23    0.018509
2016-03-24    0.019553
2016-03-25    0.018813
2016-03-26    0.016399
2016-03-27    0.015594
2016-03-28    0.020749
2016-03-29    0.021728
2016-03-30    0.024142
2016-03-31    0.023576
2016-04-01    0.022881
2016-04-02    0.024860
2016-04-03    0.024947
2016-04-04    0.024359
2016-04-05    0.126343
2016-04-06    0.224325
2016-04-07    0.133586
Name: last_seen, dtype: float64

We confirm that there are no missing values in the three columns, shown in the distributions above.

In [29]:

autos['registration_year'].describe()

Out[29]:

count    45978.000000
mean      2004.937427
std         88.905588
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

As we saw earlier the 'registration_year' column contains some inacurate entries lower than 1900 and higher than current year.

6.Dealing with Incorrect Registration Year Data¶

We'll remove all data enries containing a registration year earlier than 1900 and later than 2020.

In [30]:

autos = autos[autos['registration_year'].between(1900, 2016)]
autos.sort_values('registration_year').head(3)

Out[30]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
28693	2016-03-22 17:48:41	Renault_Twingo	privat	Angebot	599	control	kleinwagen	1910	manuell	0	NaN	5000	0	benzin	renault	NaN	2016-03-22 00:00:00	70376	2016-04-06 09:16:59
22659	2016-03-14 08:51:18	Opel_Corsa_B	privat	Angebot	500	test	NaN	1910	NaN	0	corsa	150000	0	NaN	opel	NaN	2016-03-14 00:00:00	52393	2016-04-03 07:53:55
21416	2016-03-12 08:36:21	Essex_super_six__Ford_A	privat	Angebot	16500	control	cabrio	1927	manuell	40	andere	5000	5	benzin	ford	NaN	2016-03-12 00:00:00	74821	2016-03-15 12:45:12

In [31]:

autos.sort_values('registration_year', ascending = False).head(3)

Out[31]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
19492	2016-03-09 19:43:08	Mercedes_Benz_C_180	privat	Angebot	1700	test	NaN	2016	manuell	115	c_klasse	150000	6	benzin	mercedes_benz	nein	2016-03-09 00:00:00	6217	2016-04-06 04:46:05
2906	2016-03-09 17:57:44	Golf_4_sehr_sparsam	privat	Angebot	2200	test	NaN	2016	manuell	0	golf	150000	6	diesel	volkswagen	NaN	2016-03-09 00:00:00	27283	2016-03-14 10:44:23
13898	2016-03-16 16:57:08	Audi_A4_B5_mit_TÜV!!!	privat	Angebot	750	control	NaN	2016	manuell	174	a4	150000	12	NaN	audi	NaN	2016-03-16 00:00:00	26169	2016-03-19 15:18:16

Here is the description of our data set after removing erroneous car registration years.

In [32]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44161 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          44161 non-null object
name                  44161 non-null object
seller                44161 non-null object
offer_type            44161 non-null object
price                 44161 non-null int64
abtest                44161 non-null object
vehicle_type          42122 non-null object
registration_year     44161 non-null int64
gearbox               42484 non-null object
power_p_s             44161 non-null int64
model                 42267 non-null object
odometer_km           44161 non-null int64
registration_month    44161 non-null int64
fuel_type             41476 non-null object
brand                 44161 non-null object
unrepaired_damage     37037 non-null object
ad_created            44161 non-null object
nr_of_pictures        44161 non-null int64
postal_code           44161 non-null int64
last_seen             44161 non-null object
dtypes: int64(7), object(13)
memory usage: 7.1+ MB

In [33]:

autos['registration_year'].value_counts(normalize = True).sort_index()

Out[33]:

1910    0.000045
1927    0.000023
1929    0.000023
1931    0.000023
1934    0.000045
1937    0.000091
1938    0.000023
1939    0.000023
1941    0.000045
1943    0.000023
1948    0.000023
1950    0.000023
1951    0.000045
1952    0.000023
1953    0.000023
1954    0.000045
1955    0.000045
1956    0.000091
1957    0.000045
1958    0.000091
1959    0.000136
1960    0.000430
1961    0.000136
1962    0.000091
1963    0.000181
1964    0.000272
1965    0.000385
1966    0.000476
1967    0.000589
1968    0.000589
          ...   
1987    0.001562
1988    0.002966
1989    0.003533
1990    0.006454
1991    0.006839
1992    0.007292
1993    0.008333
1994    0.011956
1995    0.021512
1996    0.025950
1997    0.036661
1998    0.047599
1999    0.060823
2000    0.063495
2001    0.058174
2002    0.055230
2003    0.060415
2004    0.060868
2005    0.065239
2006    0.060393
2007    0.051448
2008    0.049954
2009    0.047055
2010    0.035937
2011    0.036616
2012    0.029596
2013    0.018048
2014    0.014809
2015    0.008265
2016    0.023188
Name: registration_year, Length: 78, dtype: float64

7. Exploring Price by Brand¶

First we are going to summarize the mean price for all brands and store the data in a dictionary called "brand_pivot"

In [34]:

brand = autos['brand'].unique()
brand_pivot = {}
for i in brand:
    brand_pivot[i] = round(autos.loc[autos['brand'] == i, 'price'].mean(),0)
brand_pivot

Out[34]:

{'alfa_romeo': 4269.0,
 'audi': 9516.0,
 'bmw': 8521.0,
 'chevrolet': 6810.0,
 'chrysler': 3567.0,
 'citroen': 3955.0,
 'dacia': 5916.0,
 'daewoo': 1207.0,
 'daihatsu': 1860.0,
 'fiat': 3152.0,
 'ford': 4167.0,
 'honda': 4296.0,
 'hyundai': 5571.0,
 'jaguar': 12130.0,
 'jeep': 11650.0,
 'kia': 6090.0,
 'lada': 2780.0,
 'lancia': 3726.0,
 'land_rover': 19108.0,
 'mazda': 4321.0,
 'mercedes_benz': 8731.0,
 'mini': 10691.0,
 'mitsubishi': 3721.0,
 'nissan': 5083.0,
 'opel': 3281.0,
 'peugeot': 3300.0,
 'porsche': 46788.0,
 'renault': 2723.0,
 'rover': 1720.0,
 'saab': 3416.0,
 'seat': 4679.0,
 'skoda': 6542.0,
 'smart': 3601.0,
 'sonstige_autos': 13317.0,
 'subaru': 4415.0,
 'suzuki': 4354.0,
 'toyota': 5250.0,
 'trabant': 2222.0,
 'volkswagen': 5688.0,
 'volvo': 5143.0}

Next we see how brands are distributed by the number of adds for each brand, sorted in descending order.

In [35]:

br_pr = autos['brand']
top_br = br_pr.value_counts(ascending = False)
top_br

Out[35]:

volkswagen        9347
bmw               5021
opel              4519
mercedes_benz     4449
audi              3963
ford              2917
renault           1982
peugeot           1299
fiat              1058
seat               799
skoda              745
mazda              673
nissan             663
smart              657
citroen            623
toyota             583
hyundai            450
sonstige_autos     424
volvo              410
mini               406
honda              349
mitsubishi         348
kia                324
alfa_romeo         296
porsche            279
chevrolet          261
suzuki             260
chrysler           159
dacia              123
jeep               106
daihatsu           101
land_rover          98
subaru              89
saab                72
jaguar              70
daewoo              59
rover               57
trabant             51
lancia              45
lada                26
Name: brand, dtype: int64

Finally we can choose what portion of the brands we want to examine more closely. Lets say we want to see the mean prices for each brand that has more than 1000 adds:

In [36]:

popular_brands = []
for i in brand_pivot:
    if top_br[i] > 1000:
        popular_brands.append((brand_pivot[i], i))
sorted(popular_brands, reverse = True)

Out[36]:

[(9516.0, 'audi'),
 (8731.0, 'mercedes_benz'),
 (8521.0, 'bmw'),
 (5688.0, 'volkswagen'),
 (4167.0, 'ford'),
 (3300.0, 'peugeot'),
 (3281.0, 'opel'),
 (3152.0, 'fiat'),
 (2723.0, 'renault')]

No surprises there - the top three most expensive brands (at least as far as mass production cars go) are Audi, Mercedes and BMW. The fourth - Volkswagen is far behind.

In [37]:

mpr_series = pd.Series(brand_pivot).sort_values(ascending = False)
mpr_series

Out[37]:

porsche           46788.0
land_rover        19108.0
sonstige_autos    13317.0
jaguar            12130.0
jeep              11650.0
mini              10691.0
audi               9516.0
mercedes_benz      8731.0
bmw                8521.0
chevrolet          6810.0
skoda              6542.0
kia                6090.0
dacia              5916.0
volkswagen         5688.0
hyundai            5571.0
toyota             5250.0
volvo              5143.0
nissan             5083.0
seat               4679.0
subaru             4415.0
suzuki             4354.0
mazda              4321.0
honda              4296.0
alfa_romeo         4269.0
ford               4167.0
citroen            3955.0
lancia             3726.0
mitsubishi         3721.0
smart              3601.0
chrysler           3567.0
saab               3416.0
peugeot            3300.0
opel               3281.0
fiat               3152.0
lada               2780.0
renault            2723.0
trabant            2222.0
daihatsu           1860.0
rover              1720.0
daewoo             1207.0
dtype: float64

In [38]:

df_mean = pd.DataFrame(mpr_series, columns = ['mean_price'])
df_mean

Out[38]:

	mean_price
porsche	46788.0
land_rover	19108.0
sonstige_autos	13317.0
jaguar	12130.0
jeep	11650.0
mini	10691.0
audi	9516.0
mercedes_benz	8731.0
bmw	8521.0
chevrolet	6810.0
skoda	6542.0
kia	6090.0
dacia	5916.0
volkswagen	5688.0
hyundai	5571.0
toyota	5250.0
volvo	5143.0
nissan	5083.0
seat	4679.0
subaru	4415.0
suzuki	4354.0
mazda	4321.0
honda	4296.0
alfa_romeo	4269.0
ford	4167.0
citroen	3955.0
lancia	3726.0
mitsubishi	3721.0
smart	3601.0
chrysler	3567.0
saab	3416.0
peugeot	3300.0
opel	3281.0
fiat	3152.0
lada	2780.0
renault	2723.0
trabant	2222.0
daihatsu	1860.0
rover	1720.0
daewoo	1207.0

In [39]:

miles_pivot = {}
for i in brand:
    miles_pivot[i] = round(autos.loc[autos['brand'] == i, 'odometer_km'].mean(),0)
miles_pivot

Out[39]:

{'alfa_romeo': 131436.0,
 'audi': 129045.0,
 'bmw': 132881.0,
 'chevrolet': 100441.0,
 'chrysler': 132736.0,
 'citroen': 118965.0,
 'dacia': 84268.0,
 'daewoo': 118220.0,
 'daihatsu': 115198.0,
 'fiat': 115369.0,
 'ford': 123615.0,
 'honda': 122135.0,
 'hyundai': 105656.0,
 'jaguar': 125429.0,
 'jeep': 127123.0,
 'kia': 111836.0,
 'lada': 84808.0,
 'lancia': 119889.0,
 'land_rover': 118010.0,
 'mazda': 124264.0,
 'mercedes_benz': 131068.0,
 'mini': 88682.0,
 'mitsubishi': 125690.0,
 'nissan': 117526.0,
 'opel': 128522.0,
 'peugeot': 126532.0,
 'porsche': 98477.0,
 'renault': 126975.0,
 'rover': 136579.0,
 'saab': 144028.0,
 'seat': 120776.0,
 'skoda': 110792.0,
 'smart': 99658.0,
 'sonstige_autos': 90401.0,
 'subaru': 125955.0,
 'suzuki': 106654.0,
 'toyota': 115669.0,
 'trabant': 52647.0,
 'volkswagen': 128417.0,
 'volvo': 138573.0}

In [40]:

miles_series = pd.Series(miles_pivot, name = 'mean_km').sort_values(ascending = False)
count_series = pd.Series(top_br)

In [41]:

import pprint
df_mean ['mean_km'] = miles_series
df_mean['num_adds'] = count_series
top_6 = df_mean[df_mean['num_adds']>2000]
pprint.pprint(top_6.sort_values('mean_price', ascending = False))

               mean_price   mean_km  num_adds
audi               9516.0  129045.0      3963
mercedes_benz      8731.0  131068.0      4449
bmw                8521.0  132881.0      5021
volkswagen         5688.0  128417.0      9347
ford               4167.0  123615.0      2917
opel               3281.0  128522.0      4519

This is the summary data for the top 6 most popular brands, collected in a single data frame. As the milage column contains only 13 unique milage numbers (probably ranges and not real values) we cannot see if milage can explain the wide differene in mean prices.

9. Next Steps¶

Translating categorical data from German to English¶

Let's review the content of these columns and translate into English.

In [42]:

autos['abtest'].value_counts()

Out[42]:

test       22785
control    21376
Name: abtest, dtype: int64

No need for translation.

In [43]:

print(autos['vehicle_type'].value_counts(dropna = False))

limousine     12224
kleinwagen     9558
kombi          8672
bus            3973
cabrio         2983
coupe          2395
NaN            2039
suv            1959
andere          358
Name: vehicle_type, dtype: int64

In [44]:

map_dict_1 ={#'limousine':'limousine'
              'kleinwagen': 'small car', 
              'kombi': 'station wagon',
              'bus': 'van',
            # 'coupe':'coupe',
#             'suv':'suv',
              'cabrio':'convertible',
              'andere':'other'}

              
eng_type = autos['vehicle_type'].replace(map_dict_1, inplace = True)
autos['vehicle_type'].value_counts(dropna = False)

Out[44]:

limousine        12224
small car         9558
station wagon     8672
van               3973
convertible       2983
coupe             2395
NaN               2039
suv               1959
other              358
Name: vehicle_type, dtype: int64

In [45]:

print(autos['offer_type'].value_counts())

Angebot    44161
Name: offer_type, dtype: int64

In [46]:

autos['offer_type'].replace({'Angebot':'Offer'}, inplace = True)
autos['offer_type'].value_counts()

Out[46]:

Offer    44161
Name: offer_type, dtype: int64

In [47]:

print(autos['seller'].value_counts())

privat    44161
Name: seller, dtype: int64

In [48]:

autos['seller'].replace({'privat':'private'}, inplace = True)
autos['seller'].value_counts()

Out[48]:

private    44161
Name: seller, dtype: int64

In [49]:

print(autos['gearbox'].value_counts())

manuell      32771
automatik     9713
Name: gearbox, dtype: int64

In [50]:

autos['gearbox'].replace({'manuell':'manual', 'automatik':'automatic'}, inplace = True)
autos['gearbox'].value_counts()

Out[50]:

manual       32771
automatic     9713
Name: gearbox, dtype: int64

In [51]:

print(autos['fuel_type'].value_counts())

benzin     26774
diesel     13932
lpg          637
cng           67
hybrid        37
elektro       18
andere        11
Name: fuel_type, dtype: int64

In [52]:

autos['fuel_type'].replace({'benzin':'gasoline', 
                            'elektro':'electric',
                            'andere':'other'}, inplace = True)
autos['fuel_type'].value_counts()

Out[52]:

gasoline    26774
diesel      13932
lpg           637
cng            67
hybrid         37
electric       18
other          11
Name: fuel_type, dtype: int64

In [53]:

print(autos['unrepaired_damage'].value_counts())

nein    33251
ja       3786
Name: unrepaired_damage, dtype: int64

In [54]:

autos['unrepaired_damage'].replace({'nein':'no', 
                            'ja':'yes'}, inplace = True)
autos['unrepaired_damage'].value_counts()

Out[54]:

no     33251
yes     3786
Name: unrepaired_damage, dtype: int64

Converting date columns to integer format - yyyymmdd.¶

The tree columns, containing dates are: date_crawled; ad_created; last_seen.

In [55]:

autos['date_crawled'] = autos['date_crawled'].str[:10].str.replace('-','').astype(int)
autos['ad_created']=autos['ad_created'].str[:10].str.replace('-','').astype(int)
autos['last_seen']=autos['last_seen'].str[:10].str.replace('-','').astype(int)
autos[['date_crawled', 'ad_created', 'last_seen']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44161 entries, 0 to 49999
Data columns (total 3 columns):
date_crawled    44161 non-null int64
ad_created      44161 non-null int64
last_seen       44161 non-null int64
dtypes: int64(3)
memory usage: 1.3 MB

In [56]:

autos[['date_crawled', 'ad_created','last_seen']].head()

Out[56]:

	date_crawled	ad_created	last_seen
0	20160326	20160326	20160406
1	20160404	20160404	20160406
2	20160326	20160326	20160406
3	20160312	20160312	20160315
4	20160401	20160401	20160401

Exploring column 'name' for useful information¶

In [57]:

autos.loc[autos['name'].str.find('Klima')> 0,'AC'] = 'Yes'
autos.sort_values('AC', ascending = False).head()

Out[57]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	...	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen	AC
3	20160312	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	private	Offer	4350	control	small car	2007	automatic	71	...	70000	6	gasoline	smart	no	20160312	33729	20160315	Yes
31743	20160320	BMW_116d_Klimautomatik_/_PDC_/_LM_/_Sitzheizung	private	Offer	15290	test	limousine	2013	manual	116	...	100000	11	diesel	bmw	no	20160320	28325	20160407	Yes
32412	20160305	Ford_Fusion_1_4_Klima	private	Offer	2590	control	limousine	2006	manual	80	...	125000	3	gasoline	ford	yes	20160305	61267	20160311	Yes
32371	20160326	Audi_A3_Ambition_1.6_FSI_S_Line__Xenon/SSD/Kli...	private	Offer	4990	test	small car	2006	manual	115	...	150000	11	gasoline	audi	no	20160326	41334	20160326	Yes
32350	20160306	Mercedes_Benz_SLK_230_Kompressor__Automatik__K...	private	Offer	6400	control	convertible	1999	automatic	193	...	150000	5	gasoline	mercedes_benz	no	20160306	76846	20160406	Yes

5 rows × 21 columns

We explored the text in column 'name' and if we found a string containing 'Klima' we put a value of 'Yes' in a new column called 'AC'

Finding the most common 'brand' - 'model' combinations¶

In [58]:

brand_mod = autos['brand']+'|'+autos['model']
brand_mod.value_counts().sort_values(ascending = False).head(10)

Out[58]:

volkswagen|golf           3504
bmw|3er                   2546
volkswagen|polo           1407
opel|corsa                1343
volkswagen|passat         1324
opel|astra                1260
audi|a4                   1210
mercedes_benz|c_klasse    1128
bmw|5er                   1120
mercedes_benz|e_klasse     949
dtype: int64

Checking for Patterns of Relation between Mileage and Prices¶

In [59]:

autos['odometer_km'].value_counts()

Out[59]:

150000    28133
125000     4709
100000     1994
90000      1634
80000      1356
70000      1165
60000      1110
50000       980
40000       791
30000       748
20000       713
5000        601
10000       227
Name: odometer_km, dtype: int64

It seems that mileage data has already been aggregated but lets group it further in the following buckets and then ckeck what's the mean price for each bucket:

0 - 20k;
21 - 50k;
51 - 70k;
71 - 90k;
91 - 100k;
101 - 125k
> 125k.

In [67]:

buckets = (20000, 50000, 70000, 90000, 100000, 125000)
miles_prices = {}
lim1 = 0
r = '{0}k-{1}k km'
r_final = 'above {0}k km'
for i in buckets:
    miles_prices[r.format(lim1/1000, i/1000)] = round(\
    autos.loc[autos['odometer_km'].between(lim1, i),'price'].mean(), 0)
    lim1 = i + 1000
miles_prices[r_final.format(lim1/1000)] = round(\
autos.loc[autos['odometer_km']>lim1,'price'].mean(),0)
m_p_df = pd.Series(miles_prices)
m_p_df.sort_values()
pd.DataFrame(m_p_df, columns = ['Mean price']).sort_values('Mean price')

Out[67]:

	Mean price
above 126.0k km	4013.0
101.0k-125.0k km	6403.0
91.0k-100.0k km	8388.0
71.0k-90.0k km	9203.0
51.0k-70.0k km	11838.0
21.0k-50.0k km	15357.0
0.0k-20.0k km	16594.0

Of course there is no surprise here - prices are perfectly negatively correlated to mileage - the higher the mileage the lower the mean price of vehicles.

Check how cheaper damaged cars are in comparison to non_damaged ones¶

In [61]:

autos['unrepaired_damage'].value_counts()

Out[61]:

no     33251
yes     3786
Name: unrepaired_damage, dtype: int64

In [62]:

damage = {}
damage['Damaged cars'] = round(\
autos.loc[autos['unrepaired_damage'] == 'yes','price'].mean())
damage['Cars with no defects'] = round(\
autos.loc[autos['unrepaired_damage'] == 'no', 'price'].mean())
pd.Series(damage)

Out[62]:

Cars with no defects    7285.0
Damaged cars            2642.0
dtype: float64

Again - a predictable outcome! Damaged cars are cheaper on average than cars with no defects.