Cleaning and Exploring Ebay Car Sales Data¶

In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset can he found here: data.

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:

import numpy as np
import pandas as pd

In [2]:

autos = pd.read_csv('autos_dq.csv', encoding='Windows-1252')

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 4.8+ MB

In [4]:

autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Initial Observation¶

The initial observation are the following:

The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

We will now convert column names from camelcase to snakecase and reword some of the column names to be more descriptive.

Column Renaming¶

In [5]:

autos.columns

Out[5]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:

autos.rename({
    'dateCrawled':'date_crawled', 
    'offerType':'offer_type', 
    'vehicleType':'vehicle_type', 
    'yearOfRegistration':'registration_year', 
    'powerPS':'power_ps', 
    'monthOfRegistration':'registration_month', 
    'fuelType':'fuel_type', 
    'notRepairedDamage':'unrepaired_damage', 
    'dateCreated':'ad_created', 
    'nrOfPictures':'nr_of_pictures', 
    'postalCode':'postal_code', 
    'lastSeen':'last_seen'
}, axis=1, inplace=True)

In [7]:

autos.columns

Out[7]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [8]:

autos.head()

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [9]:

autos.describe(include='all')

Out[9]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 11:37:04	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

After the initial review of the data, it seems that there are few columns that have mostly one value and probably might be dropped from the further analysis; it includes: seller and offer_type.

Deleting Irrelevant Column¶

In [10]:

autos = autos.drop(columns = ['seller', 'offer_type'])

We also investigate few more columns.

In [11]:

autos.model.value_counts()

Out[11]:

golf          4024
andere        3528
3er           2761
polo          1757
corsa         1735
              ... 
kalina           2
i3               1
b_max            1
200              1
rangerover       1
Name: model, Length: 245, dtype: int64

In [12]:

autos.brand.value_counts()

Out[12]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

The data stored in brand and model columns seems to be clean. Now let's explore the price column.

In [13]:

autos.price.describe()

Out[13]:

count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object

In [14]:

autos.price.value_counts()

Out[14]:

$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$3,425        1
$3,129        1
$1,960        1
$369          1
$14,790       1
Name: price, Length: 2357, dtype: int64

Conversion of Text Data into Numeric¶

The data stored in the price column is stored as a text. Now let's remove any non-numeric characters ($) and then convert it into numeric values, followed by the column name change to properly reflect the currency.

In [15]:

autos["price"] = autos["price"].str.replace('$', '')
autos["price"] = autos["price"].str.replace(',', '').astype(int)

In [16]:

autos.rename({"price":"price_usd"}, axis=1, inplace=True)

The same applies to odometer column. Let's quickly do the operation for this column and also change the column name.

In [17]:

autos.odometer.value_counts()

Out[17]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [18]:

autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','').astype(int)

In [19]:

autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)

Removing the outliers¶

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll analyze the odometer_km and price_usd columns and look for the outliers.

In [20]:

autos.price_usd.value_counts().sort_index(ascending=False).head(20)

Out[20]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_usd, dtype: int64

In [21]:

autos.price_usd.value_counts().sort_index(ascending=True).head(30)

Out[21]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
40       6
45       4
47       1
49       4
50      49
55       2
59       1
60       9
65       5
66       1
Name: price_usd, dtype: int64

In [22]:

autos.loc[autos['price_usd'] == 350000]

Out[22]:

	date_crawled	name	price_usd	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
36818	2016-03-27 18:37:37	Porsche_991	350000	control	coupe	2016	manuell	500	911	5000	3	benzin	porsche	nein	2016-03-27 00:00:00	0	70499	2016-03-27 18:37:37

It seems that price column includes many outliers that should be removed. It's fair to say that every entries having price that above 350 000 can be removed (we checked the entries for 350 000 and 345 000, which are Porsche and Rolls-Royce, so their prices still seem rational).

The same applies to any cars sold for the price that equals or is lower than 50. Let's remove them!

In [23]:

autos = autos[(autos["price_usd"] > 50) & (autos["price_usd"] < 350000)]

Now let's do the same exercise with the odemeter_km column.

In [24]:

autos.odometer_km.value_counts().sort_index(ascending=False).head(20)

Out[24]:

150000    31257
125000     5045
100000     2103
90000      1733
80000      1413
70000      1214
60000      1153
50000      1011
40000       815
30000       777
20000       757
10000       248
5000        771
Name: odometer_km, dtype: int64

In [25]:

autos.odometer_km.value_counts().sort_index(ascending=True).head(30)

Out[25]:

5000        771
10000       248
20000       757
30000       777
40000       815
50000      1011
60000      1153
70000      1214
80000      1413
90000      1733
100000     2103
125000     5045
150000    31257
Name: odometer_km, dtype: int64

It seems that odometer_km column does not need further cleaning, as the data seems to be clean and does not include any outlier.

Working with Date & Time¶

In [26]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48297 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        48297 non-null  object
 1   name                48297 non-null  object
 2   price_usd           48297 non-null  int32 
 3   abtest              48297 non-null  object
 4   vehicle_type        43842 non-null  object
 5   registration_year   48297 non-null  int64 
 6   gearbox             46061 non-null  object
 7   power_ps            48297 non-null  int64 
 8   model               45889 non-null  object
 9   odometer_km         48297 non-null  int32 
 10  registration_month  48297 non-null  int64 
 11  fuel_type           44390 non-null  object
 12  brand               48297 non-null  object
 13  unrepaired_damage   39373 non-null  object
 14  ad_created          48297 non-null  object
 15  nr_of_pictures      48297 non-null  int64 
 16  postal_code         48297 non-null  int64 
 17  last_seen           48297 non-null  object
dtypes: int32(2), int64(5), object(11)
memory usage: 4.6+ MB

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

In [27]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[27]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [28]:

autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[28]:

2016-03-05    0.025364
2016-03-06    0.014080
2016-03-07    0.036006
2016-03-08    0.033253
2016-03-09    0.033046
2016-03-10    0.032238
2016-03-11    0.032569
2016-03-12    0.036980
2016-03-13    0.015674
2016-03-14    0.036648
2016-03-15    0.034288
2016-03-16    0.029464
2016-03-17    0.031534
2016-03-18    0.012879
2016-03-19    0.034723
2016-03-20    0.037787
2016-03-21    0.037269
2016-03-22    0.032921
2016-03-23    0.032300
2016-03-24    0.029464
2016-03-25    0.031534
2016-03-26    0.032279
2016-03-27    0.031079
2016-03-28    0.034930
2016-03-29    0.034122
2016-03-30    0.033687
2016-03-31    0.031865
2016-04-01    0.033750
2016-04-02    0.035572
2016-04-03    0.038553
2016-04-04    0.036524
2016-04-05    0.013065
2016-04-06    0.003168
2016-04-07    0.001387
Name: date_crawled, dtype: float64

The most popular day for the ad crawl was 2016-04-03, with 3.86% share.

In [29]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[29]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038802
2016-04-04    0.036897
2016-04-05    0.011781
2016-04-06    0.003251
2016-04-07    0.001242
Name: ad_created, Length: 76, dtype: float64

The most popular day for the ad creation was 2016-04-03, with 3.88% share.

In [30]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[30]:

2016-03-05    0.001077
2016-03-06    0.004327
2016-03-07    0.005425
2016-03-08    0.007309
2016-03-09    0.009607
2016-03-10    0.010622
2016-03-11    0.012402
2016-03-12    0.023749
2016-03-13    0.008883
2016-03-14    0.012651
2016-03-15    0.015881
2016-03-16    0.016440
2016-03-17    0.028076
2016-03-18    0.007330
2016-03-19    0.015777
2016-03-20    0.020643
2016-03-21    0.020581
2016-03-22    0.021368
2016-03-23    0.018593
2016-03-24    0.019794
2016-03-25    0.019173
2016-03-26    0.016688
2016-03-27    0.015529
2016-03-28    0.020829
2016-03-29    0.022320
2016-03-30    0.024722
2016-03-31    0.023811
2016-04-01    0.022879
2016-04-02    0.024908
2016-04-03    0.025115
2016-04-04    0.024515
2016-04-05    0.125018
2016-04-06    0.221794
2016-04-07    0.132161
Name: last_seen, dtype: float64

The most popular day for the last seen column was 2016-04-06, with 2.21% share.

After exploring ad crawl, ad creation and last seen columns, let's explore registration_year column.

In [31]:

autos.registration_year.describe()

Out[31]:

count    48297.000000
mean      2004.724807
std         87.831495
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration_year column includes some odd values, with minimum value at 1000 (before cars were invented) and maximum value 9999 (many years into the future).

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

In [32]:

autos = autos[autos['registration_year'].between(1900,2016)]

In [33]:

autos.registration_year.describe()

Out[33]:

count    46423.000000
mean      2002.935592
std          7.127037
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [34]:

autos.registration_year.value_counts(normalize=True)

Out[34]:

2000    0.067079
2005    0.062835
1999    0.062060
2004    0.058139
2003    0.058053
          ...   
1938    0.000022
1939    0.000022
1953    0.000022
1943    0.000022
1952    0.000022
Name: registration_year, Length: 78, dtype: float64

After further cleaning and removing any cars claimed to be registered before 1900 or after 2016, we have 46423 remaining cars.

Most cars (6.7%) were registered in year 2000.

Car Brands Analysis with Aggregation¶

Now let's explore variations across different car brands using aggregation.

In [35]:

autos['brand'].describe()

Out[35]:

count          46423
unique            40
top       volkswagen
freq            9809
Name: brand, dtype: object

In [36]:

autos['brand'].value_counts()

Out[36]:

volkswagen        9809
bmw               5109
opel              4981
mercedes_benz     4484
audi              4024
ford              3244
renault           2187
peugeot           1388
fiat              1194
seat               848
skoda              763
nissan             712
mazda              709
smart              658
citroen            653
toyota             593
hyundai            465
sonstige_autos     443
volvo              423
mini               408
mitsubishi         380
honda              365
kia                329
alfa_romeo         309
porsche            280
suzuki             275
chevrolet          264
chrysler           164
dacia              123
daihatsu           117
jeep               106
subaru              98
land_rover          98
saab                77
jaguar              72
daewoo              69
trabant             64
rover               62
lancia              49
lada                27
Name: brand, dtype: int64

We will select top 10 most common brands for further analysis.

In [37]:

top_brands = autos['brand'].value_counts().index[:10]
top_brands

Out[37]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')

Now we will do the data agreggation using loops and dictionary. The process looks as follows:

Create an empty dictionary to store our aggregate data
Loop over the unique values, and for each:
- Subset the dataframe by the unique values
- Calculate the mean of whichever column we're interested in
- Assign the val/mean to the dict as k/v.

We will use the process described above to calculate the mean price for the top 10 brands in the dataset.

In [38]:

mean_price_brand = {}
for b in top_brands:
    brand_row = autos[autos['brand']==b]
    mean = brand_row['price_usd'].mean()
    mean_price_brand[b]=mean

In [39]:

mean_price_brand

Out[39]:

{'volkswagen': 5431.494036089306,
 'bmw': 8378.434918770796,
 'opel': 2999.6338084721942,
 'mercedes_benz': 8664.980151650312,
 'audi': 9376.088469184891,
 'ford': 3771.270961775586,
 'renault': 2490.5665294924556,
 'peugeot': 3105.0886167146973,
 'fiat': 2820.7747068676717,
 'seat': 4423.116745283019}

In [40]:

dict(sorted(mean_price_brand.items(), key=lambda item: item[1])) #sorting the dictionary by value

Out[40]:

{'renault': 2490.5665294924556,
 'fiat': 2820.7747068676717,
 'opel': 2999.6338084721942,
 'peugeot': 3105.0886167146973,
 'ford': 3771.270961775586,
 'seat': 4423.116745283019,
 'volkswagen': 5431.494036089306,
 'bmw': 8378.434918770796,
 'mercedes_benz': 8664.980151650312,
 'audi': 9376.088469184891}

We aggregated across brands to understand mena price of the cars. From among the top 10 brands (most frequent ones), audi has the highest average price (9376 USD), followed by mercedes benz (8664 USD) and bmw (8378 USD).

Ford and Seat were less expensive, with Volkswagen in between.

Data Aggregation - Average Mileage of the Car and the Mean Price¶

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. To do so, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly using pandas series constructor.

In [41]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46423 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        46423 non-null  object
 1   name                46423 non-null  object
 2   price_usd           46423 non-null  int32 
 3   abtest              46423 non-null  object
 4   vehicle_type        43840 non-null  object
 5   registration_year   46423 non-null  int64 
 6   gearbox             44418 non-null  object
 7   power_ps            46423 non-null  int64 
 8   model               44279 non-null  object
 9   odometer_km         46423 non-null  int32 
 10  registration_month  46423 non-null  int64 
 11  fuel_type           43224 non-null  object
 12  brand               46423 non-null  object
 13  unrepaired_damage   38287 non-null  object
 14  ad_created          46423 non-null  object
 15  nr_of_pictures      46423 non-null  int64 
 16  postal_code         46423 non-null  int64 
 17  last_seen           46423 non-null  object
dtypes: int32(2), int64(5), object(11)
memory usage: 4.4+ MB

In [42]:

mean_mileage_brand = {}
for b in top_brands:
    brand_row = autos[autos['brand']==b]
    mean = brand_row['odometer_km'].mean()
    mean_mileage_brand[b]=mean

In [43]:

mean_mileage_brand

Out[43]:

{'volkswagen': 128781.73106330921,
 'bmw': 132673.71305539244,
 'opel': 129415.77996386267,
 'mercedes_benz': 130951.15967885817,
 'audi': 129249.50298210735,
 'ford': 124257.09001233046,
 'renault': 128331.04709647919,
 'peugeot': 127193.80403458214,
 'fiat': 117081.23953098827,
 'seat': 121303.06603773584}

We calculated the mean mileage for the top 10 brands and stored the results in a dictionary. Now we will convert both dictionaries into series objects, using the series constructor.

In [44]:

mpb_series = pd.Series(mean_price_brand)
print(mpb_series)

volkswagen       5431.494036
bmw              8378.434919
opel             2999.633808
mercedes_benz    8664.980152
audi             9376.088469
ford             3771.270962
renault          2490.566529
peugeot          3105.088617
fiat             2820.774707
seat             4423.116745
dtype: float64

In [45]:

mmb_series = pd.Series(mean_mileage_brand)
print(mmb_series)

volkswagen       128781.731063
bmw              132673.713055
opel             129415.779964
mercedes_benz    130951.159679
audi             129249.502982
ford             124257.090012
renault          128331.047096
peugeot          127193.804035
fiat             117081.239531
seat             121303.066038
dtype: float64

Now we will create a dataframe from the first series object using the dataframe constructor. Then we will assign the other series as a new column in this dataframe

In [46]:

price_miles_df = pd.DataFrame(mpb_series, columns=['mean_price'])

In [47]:

price_miles_df['mean_mileage'] = mmb_series

In [48]:

price_miles_df

Out[48]:

	mean_price	mean_mileage
volkswagen	5431.494036	128781.731063
bmw	8378.434919	132673.713055
opel	2999.633808	129415.779964
mercedes_benz	8664.980152	130951.159679
audi	9376.088469	129249.502982
ford	3771.270962	124257.090012
renault	2490.566529	128331.047096
peugeot	3105.088617	127193.804035
fiat	2820.774707	117081.239531
seat	4423.116745	121303.066038

In [49]:

price_miles_df.describe()

Out[49]:

	mean_price	mean_mileage
count	10.000000	10.000000
mean	5146.144894	126923.813346
std	2676.018160	4737.382400
min	2490.566529	117081.239531
25%	3025.997511	124991.268518
50%	4097.193854	128556.389080
75%	7641.699698	129374.210718
max	9376.088469	132673.713055

In [50]:

price_miles_df['mean_price'].corr(price_miles_df['mean_mileage'])

Out[50]:

0.5618561436226072

There are no huge differences in the average mileage within the top 10 brands in the dataset, with the standard deviation value being below 5000 km. There is also no correlation between those two values.

Conclusions¶

Within the project, we have mainly performed several data cleaning steps with the ebay cars dataset. As for the data analysis itself, from among the top 10 brands, there were three brands with the highest average price (audi, mercedes_benz and bmw). Also, the mean milage did not influence the average price of the car.

In [ ]: