EXPLORING EBAY KLEINANZEIGEN CAR SALES DATA¶

The main goal of this project is to clean and analyze used car listings that were scraped from a classifieds section of the German Ebay website and upload to kaggle. However, the dataset we are going to use was modified by Dataquest team since this project is part of one of Dataquest requirements for accomplishing Data Scientist certificate.

So in this sence, our project is structured into 11 steps: first, we are going to import the proper libraries for data analysis, which are Pandas and Numpy, and open the data file. Next, a deep analysis in each column of the dataset will take place aiming to find any patterns. Moving foward, we will need to drop some inaccurate data and some outliers to make our analysis easier and more clear. After a quick cleaning process, we are going move onto data analysis and make some observations about the dataset.

Step 1: Importing Pandas and opening the dataset¶

As mentioned above, it is necessary to import two important libraries for data analysis: Pandas and Numpy. Next, we are going to read the dataset and print it.

In [1]:

import pandas as pd
import numpy as np

In [2]:

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [3]:

# Verifying the dataset

autos

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

Step 2: Verifying the dataset¶

In [4]:

# Checking information about each column

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [5]:

# printing the first few rows of the dataset

autos.head()

Out[5]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

According to the information above, there are null values in 5 different columns (but none with more than 20% null values): vehicleType, gearbox, model, fuelType and notRepairedDamage. Besides that, the values of the price and odometer columns are classified as objects. Therefore, it will be necessary to convert these values to int64 in order to make data manipulation easier.

Another point that can be observed is that the columns titles are in camelcase, which are not the Python's preferred notation. That is why we are going to start working with this issue.

Step 3: Converting the column names from camelcase to snakecase¶

In [6]:

# checking the present column names

autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

# changing the column names

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [8]:

# checking the modified dataset

autos.head()

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Step 4: Descriptive statistics of all columns¶

In this step, we are going to make a quick view on descriptive statistics of each column to identify any odd data.

In [9]:

autos.describe(include = 'all')

Out[9]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-30 17:37:35	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [10]:

# checking on seller, offer_type and nr_of_pictures columns

In [11]:

autos['seller'].value_counts()

Out[11]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [12]:

autos['offer_type'].value_counts()

Out[12]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [13]:

autos['nr_of_pictures'].value_counts()

Out[13]:

0    50000
Name: nr_of_pictures, dtype: int64

Observations¶

It is possible to notice in the description above that there are two columns that have mostly one value (49999 of the same value, to be exact): seller and offer_type. Also, it is possible to see thath the column nr_of_pictures has only 0 values. To make our analysis easier, we are going to drop these three columns since they will not make any difference for us.

Step 5: Dropping seller, offer_type and nr_of_pictures columns.¶

In [14]:

autos.drop(labels = ['seller', 'offer_type', 'nr_of_pictures'], axis = 1, inplace = True)

In [15]:

# checking the modified dataset

In [16]:

autos.shape

Out[16]:

(50000, 17)

In [17]:

autos.head()

Out[17]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Step 6: Removing non-numeric characters and converting strings to integers.¶

As it was said before, we have an issue with the price and odometer columns which are classified as object but should contain numeric values instead. Therefore, we are now going to remove any non-numeric characters and convert the strings in these two columns to integers.

For price column¶

Removing non-numeric character ("$" and ",")

In [18]:

autos['price'].unique()

Out[18]:

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

In [19]:

autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')

In [20]:

# checking the new values

autos['price'].unique()

Out[20]:

array(['5000', '8500', '8990', ..., '385', '22200', '16995'], dtype=object)

Converting strings to integers

In [21]:

autos['price'] = autos['price'].astype(int)

In [22]:

# checking the cleaned data

autos['price'].dtype

Out[22]:

dtype('int32')

In [23]:

autos['price'].head()

Out[23]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

For odometer column¶

Removing non-numeric character ("," and "km")

In [24]:

autos['odometer'].unique()

Out[24]:

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

In [25]:

autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km', '')

In [26]:

# checking the new values

autos['odometer'].unique()

Out[26]:

array(['150000', '70000', '50000', '80000', '10000', '30000', '125000',
       '90000', '20000', '60000', '5000', '100000', '40000'], dtype=object)

Converting strings to integers

In [27]:

autos['odometer'] = autos['odometer'].astype(int)

In [28]:

# checking the cleaned data

autos['odometer'].head()

Out[28]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int32

To finish this first cleaning process, we are going to change price and odometer labels to price_dollar and odometer_km, for better comprehension.

In [29]:

autos.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis = 1, inplace = True)

In [30]:

# checking our modified dataset

autos.head()

Out[30]:

	date_crawled	name	price_dollar	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [31]:

autos[['price_dollar', 'odometer_km']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price_dollar  50000 non-null  int32
 1   odometer_km   50000 non-null  int32
dtypes: int32(2)
memory usage: 390.8 KB

Step 7: Looking for outliers in price_dollar and odometer_km columns.¶

In this section, we are going to look for outliers in price_dollar and odometer_km columns aiming to drop them to make our analysis easier. To do so, we are going to use to use Series.unique(), Series.describe() and Series.value_counts() methods, for each column.

For price_dollar column¶

In [32]:

autos['price_dollar'].unique().shape

Out[32]:

(2357,)

According to the code line above, our dataset has 2357 unique values.

In [33]:

autos[['price_dollar']].describe()

Out[33]:

	price_dollar
count	5.000000e+04
mean	9.840044e+03
std	4.811044e+05
min	0.000000e+00
25%	1.100000e+03
50%	2.950000e+03
75%	7.200000e+03
max	1.000000e+08

We can see in the code line above that the minimum value of the cars in our dataset is 0 and the maximum is $100,000,000, which is technically impossible. After a quick research on eBay website, we found out that there are no significative amount of cars that costs more than 1 million dollars and that mosts of the cars that were announced as less than $ 50 are actually donation ads or "give an offer" ads.

In [34]:

autos['price_dollar'].value_counts().sort_index(ascending = True).head(50)

Out[34]:

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price_dollar, dtype: int64

After this quick analysis, we are now going to remove the outliers (lines that has cars' prices less than 50 and more than $1,000,000.

In [35]:

# before removal

autos['price_dollar'].shape

Out[35]:

(50000,)

In [36]:

autos[['price_dollar']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price_dollar  50000 non-null  int32
dtypes: int32(1)
memory usage: 195.4 KB

In [37]:

autos = autos[autos['price_dollar'].between(50, 1000000)]  # code to remove the outliers

In [38]:

# after removal

autos['price_dollar'].shape

Out[38]:

(48350,)

For odometer_km column¶

In [39]:

autos['odometer_km'].unique()

Out[39]:

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [40]:

autos['odometer_km'].describe()

Out[40]:

count     48350.000000
mean     125876.732161
std       39617.575951
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [41]:

autos['odometer_km'].value_counts()

Out[41]:

150000    31290
125000     5046
100000     2105
90000      1733
80000      1414
70000      1215
60000      1153
50000      1011
40000       815
5000        782
30000       778
20000       759
10000       249
Name: odometer_km, dtype: int64

As we can see from the code line above, there are not apparent outliers in the odometer_km column, since it is totally possible to find cars that have driven 150,000 km. Therefore, we are going to let the column as it is.

Step 8: Exploring the date columns (date_crawled, last_seen, ad_created, registration_month and registration_year)¶

In [42]:

autos[['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']].head()

Out[42]:

	date_crawled	last_seen	ad_created	registration_month	registration_year
0	2016-03-26 17:47:46	2016-04-06 06:45:54	2016-03-26 00:00:00	3	2004
1	2016-04-04 13:38:56	2016-04-06 14:45:08	2016-04-04 00:00:00	6	1997
2	2016-03-26 18:57:24	2016-04-06 20:15:37	2016-03-26 00:00:00	7	2009
3	2016-03-12 16:58:10	2016-03-15 03:16:28	2016-03-12 00:00:00	6	2007
4	2016-04-01 14:38:50	2016-04-01 14:38:50	2016-04-01 00:00:00	7	2003

It is possible to see that registration_month and registration_year columns are represented as numerical data, whereas data_crawled, last_seen and ad_created are represented as strings. Because of that, we need to convert these three columns into numerical data. First of all, we are going to isolate the date values (which are represented by the first 10 characters) to understand the date range, for each of the three columns.

For date_crawled column¶

In [44]:

autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

Out[44]:

2016-03-05    0.025357
2016-03-06    0.014064
2016-03-07    0.036029
2016-03-08    0.033257
2016-03-09    0.033051
2016-03-10    0.032223
2016-03-11    0.032554
2016-03-12    0.036980
2016-03-13    0.015657
2016-03-14    0.036670
2016-03-15    0.034292
2016-03-16    0.029493
2016-03-17    0.031603
2016-03-18    0.012906
2016-03-19    0.034747
2016-03-20    0.037787
2016-03-21    0.037311
2016-03-22    0.032947
2016-03-23    0.032265
2016-03-24    0.029431
2016-03-25    0.031499
2016-03-26    0.032265
2016-03-27    0.031086
2016-03-28    0.034891
2016-03-29    0.034126
2016-03-30    0.033713
2016-03-31    0.031830
2016-04-01    0.033733
2016-04-02    0.035533
2016-04-03    0.038573
2016-04-04    0.036525
2016-04-05    0.013051
2016-04-06    0.003164
2016-04-07    0.001386
Name: date_crawled, dtype: float64

For ad_created column¶

In [46]:

autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

Out[46]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038821
2016-04-04    0.036898
2016-04-05    0.011768
2016-04-06    0.003247
2016-04-07    0.001241
Name: ad_created, Length: 76, dtype: float64

For last_seen column¶

In [48]:

autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

Out[48]:

2016-03-05    0.001075
2016-03-06    0.004323
2016-03-07    0.005419
2016-03-08    0.007342
2016-03-09    0.009638
2016-03-10    0.010610
2016-03-11    0.012389
2016-03-12    0.023744
2016-03-13    0.008873
2016-03-14    0.012637
2016-03-15    0.015884
2016-03-16    0.016463
2016-03-17    0.028087
2016-03-18    0.007322
2016-03-19    0.015801
2016-03-20    0.020620
2016-03-21    0.020600
2016-03-22    0.021406
2016-03-23    0.018594
2016-03-24    0.019772
2016-03-25    0.019173
2016-03-26    0.016774
2016-03-27    0.015574
2016-03-28    0.020869
2016-03-29    0.022337
2016-03-30    0.024736
2016-03-31    0.023826
2016-04-01    0.022875
2016-04-02    0.024881
2016-04-03    0.025171
2016-04-04    0.024509
2016-04-05    0.124964
2016-04-06    0.221634
2016-04-07    0.132079
Name: last_seen, dtype: float64

After checking simultaneously the 3 columns, it is possible to conclude that ads created before March 2016 are considered as outliers, since the date_crawled and last_seen columns only registered values after March 2016.

Dealing with incorrect data in registration_year column¶

In [49]:

autos['registration_year'].value_counts().sort_index(ascending = True)

Out[49]:

1000    1
1001    1
1111    1
1800    2
1910    2
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

From the code cell above, it is possible to see that we have some outliers in the registration_year column. For example, we can see that there are some cars that were registered before 1800 (which is impossilbe since the car was not invented yet) and others were registered after 2016, which are definitely inaccurate since a car can't be registered after the listing was last seen. Therefore, we need to drop these outliers, considering that the first car registration happened in the first few decades of 1900s.

In [50]:

# before removal

autos['registration_year'].shape

Out[50]:

(48350,)

In [51]:

autos = autos[autos['registration_year'].between(1900, 2016)]

In [52]:

# after removal

autos['registration_year'].shape

Out[52]:

(46476,)

Now we are going to calculate the distribution of the remaining values for the registration_year.

In [53]:

autos['registration_year'].value_counts(normalize = True).sort_index(ascending = True)

Out[53]:

1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
          ...   
2012    0.028165
2013    0.017235
2014    0.014244
2015    0.008262
2016    0.025970
Name: registration_year, Length: 78, dtype: float64

According to the output from the code cell above, most of the cars were registered after 1995.

Step 9: Identifying the top brands in the dataset and calculating the mean price¶

In order to identify the top brands of brand column, we are going to use the aggregation method, looping over the brand column. As we can see bellow, there are a total of 40 different brands in the dataset.

In [54]:

autos['brand'].unique().shape[0]

Out[54]:

In [55]:

autos['brand'].value_counts(normalize =  True)

Out[55]:

volkswagen        0.211356
bmw               0.109949
opel              0.107303
mercedes_benz     0.096501
audi              0.086604
ford              0.069993
renault           0.047186
peugeot           0.029908
fiat              0.025712
seat              0.018246
skoda             0.016417
nissan            0.015320
mazda             0.015255
smart             0.014179
citroen           0.014050
toyota            0.012759
hyundai           0.010048
sonstige_autos    0.009618
volvo             0.009101
mini              0.008779
mitsubishi        0.008176
honda             0.007854
kia               0.007100
alfa_romeo        0.006649
porsche           0.006046
suzuki            0.005939
chevrolet         0.005702
chrysler          0.003529
dacia             0.002647
daihatsu          0.002517
jeep              0.002281
subaru            0.002109
land_rover        0.002109
saab              0.001657
jaguar            0.001571
daewoo            0.001485
trabant           0.001377
rover             0.001334
lancia            0.001054
lada              0.000581
Name: brand, dtype: float64

Now we are going to loop over the brand column to add the brands that respond for more than 5% of the total of cars in the dataset into a dictionary. The main purpose of this method will be to calculate the mean price of each of the top brands.

In [56]:

top_brands = autos['brand'].value_counts(normalize = True, ascending = False)
common_brands = top_brands[top_brands > 0.05].index

brand_mean_price = {}

for b in common_brands:
    mean_price = autos[autos['brand'] == b]['price_dollar'].mean()
    brand_mean_price[b] = int(mean_price)      

In [57]:

brand_mean_price

Out[57]:

{'volkswagen': 5627,
 'bmw': 8376,
 'opel': 2996,
 'mercedes_benz': 8663,
 'audi': 9373,
 'ford': 4068}

Step 10: Calculating the mean milage for the top brands¶

In order to compare the mean price to the average milage of each brand, we are going to create another dictionary for Mean Milage just like we did for the mean price. Next, it will be necessary to transform both dictonaries into data series and then into a dataframe, using the series and dataframe constructors.

First, we are going to transform our previous dictionary(brand_mean_price) into a series.

Transforming Mean Price dictionary into a series¶

In [59]:

bmp_series = pd.Series(data=brand_mean_price)
bmp_series

Out[59]:

volkswagen       5627
bmw              8376
opel             2996
mercedes_benz    8663
audi             9373
ford             4068
dtype: int64

Creating Mean Milage dictionary and then series¶

Now, we can work on the Mean Milage, looping over the milage of the top brands (which we have called previously as common_brands)

In [60]:

brand_mean_milage = {}

for b in common_brands:
    mean_milage = autos[autos['brand'] == b]['odometer_km'].mean()
    brand_mean_milage[b] = int(mean_milage)
    
brand_mean_milage
    

Out[60]:

{'volkswagen': 128775,
 'bmw': 132651,
 'opel': 129440,
 'mercedes_benz': 130923,
 'audi': 129254,
 'ford': 124296}

Moving foward! It is possible now to transform the brand_mean_milage dictionary into a series, such as bmp_series.

In [61]:

bmm_series = pd.Series(data=brand_mean_milage)
bmm_series

Out[61]:

volkswagen       128775
bmw              132651
opel             129440
mercedes_benz    130923
audi             129254
ford             124296
dtype: int64

Step 11: Creating a DataFrame¶

In this final step, we are going to create a DataFrame containing both the brand mean price and the brand mean milage series. The main goal of this step is to make the comparison between both series easier and to identify if there is any link with each other.

In [62]:

mean_price_milage = pd.DataFrame(data = bmp_series, columns = ['mean_price_dollar'])
mean_price_milage

Out[62]:

	mean_price_dollar
volkswagen	5627
bmw	8376
opel	2996
mercedes_benz	8663
audi	9373
ford	4068

Now, we must add the bmm_series as a new column of our dataframe.

In [63]:

mean_price_milage['mean_milage_km'] = bmm_series
mean_price_milage.sort_values('mean_price_dollar', ascending = False)

Out[63]:

	mean_price_dollar	mean_milage_km
audi	9373	129254
mercedes_benz	8663	130923
bmw	8376	132651
volkswagen	5627	128775
ford	4068	124296
opel	2996	129440

Apparently, there is not a direct relation between the price and the milage of the cars when we consider just the top 6 brands, since the mean milage doesn't varies significatively with the price.