Guided Project - Exploring Ebay Car Sales¶

Data Source:¶

This dataset was originally scrapped as a a compilation of data on used cars from eBay Kleinanzeigen, a classified section of the German eBay website.

With a few modifications, the dataset in use here contains 50,000 sampled data points from this originally web scrapped dataset.

Aim of Project:¶

The aim of this project is to clean the data and analyze the included used car listings.

Dataset General Structure ( According to the data dictionary)¶

Column name	Description
"dateCrawled"	When this ad was first crawled. All field-values are taken from this date.
"name"	Name of the car
"seller"	Whether the seller is private or a dealer.
"offerType"	The type of listing
"price"	The price on the ad to sell the car.
"abtest"	Whether the listing is included in an A/B test.
"vehicleType"	The vehicle Type
"yearOfRegistration"	The year in which the car was first registered
"gearbox"	The transmission type
"powerPS"	The power of the car in PS
"model"	The car model name
"kilometer"	How many kilometers the car has driven.
"monthOfRegistration"	The month in which the car was first registered
"fuelType"	What type of fuel the car uses
"brand"	The brand of the car
"notRepairedDamage"	If the car has a damage which is not yet repaired
"dateCreated"	The date on which the eBay listing was created
"nrOfPictures"	The number of pictures in the ad
"postalCode"	The postal code for the location of the vehicle
"lastSeenOnline"	When the crawler saw this ad last online.

Step 1 - Importing libraries and Reading in the data¶

In [1]:

import numpy as np; import pandas as pd

In [92]:

autos = pd.read_csv('autos.csv')

In [93]:

# A Brief look at the dataset as loaded into pandas 
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [94]:

autos.head()

Out[94]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Data cleaning critiques based on this initial read in of the dataset¶

Problems with dataset per column¶

Column name	Description
"dateCrawled"	Saved as an 'object' type instead of datetime object
"name"	The names are not clear as they are mixed with website directories
"seller"	Spelling errors
"offerType"	NA
"price"	Price is saved as an object instead of a numeric type
"abtest"	NA
"vehicleType"	It contains null-values & column should be renamed
"yearOfRegistration"	The year is saved as an 'object' type instead of a numeric datetime object & should be renamed
"gearbox"	Spelling errors & contains null values
"powerPS"	NA
"model"	Contains null values
"kilometer"	Column name instead spelt as 'odemeter' & values are saved as string object types instead of a numeric type
"monthOfRegistration"	Column name needs to be re-named
"fuelType"	Contains null values
"brand"	NA
"notRepairedDamage"	Not in english and contains null values
"dateCreated"	Saved as an object instead of datetime object
"nrOfPictures"	Column name to be renamed
"postalCode"	NA
"lastSeenOnline"	Saved as an object instead of datetime object

Step 2 - Data Cleaning¶

Step 2A - Renaming column names to be more accurate¶

This helps create more structure and readability as we parse through the dataset

In [95]:

auto_column_copy = list(autos.columns)

In [96]:

# First, some specific renaming of particular columns
auto_column_copy[7] = 'registrationYear'
auto_column_copy[12] = 'registrationMonth'
auto_column_copy[15] = 'unrepairedDamage'
auto_column_copy[-4] = 'adCreated'
auto_column_copy

Out[96]:

['dateCrawled',
 'name',
 'seller',
 'offerType',
 'price',
 'abtest',
 'vehicleType',
 'registrationYear',
 'gearbox',
 'powerPS',
 'model',
 'odometer',
 'registrationMonth',
 'fuelType',
 'brand',
 'unrepairedDamage',
 'adCreated',
 'nrOfPictures',
 'postalCode',
 'lastSeen']

Convertin all column names to snake_case

In [97]:

# Defining a function that uses list comprehension to convert a single string entered into snake case 
# and then merges this list of string characters thereafter using the ''.join() function.

# The .lstrip() function then strips away any leading underscores ('_') from the word 
# in cases where the entered string may have begun with a captial letter.
def snake_case(col_name):
    return ''.join(['_'+ i.lower() if i.isupper() else i for i in col_name]).lstrip('_')

In [30]:

new_column = [snake_case(name) for name in auto_column_copy]

In [98]:

# One adjustment for the column with labelled -> power_p_s
new_column[9] = 'power_ps'
new_column

Out[98]:

['date_crawled',
 'name',
 'seller',
 'offer_type',
 'price',
 'abtest',
 'vehicle_type',
 'registration_year',
 'gearbox',
 'power_ps',
 'model',
 'odometer',
 'registration_month',
 'fuel_type',
 'brand',
 'unrepaired_damage',
 'ad_created',
 'nr_of_pictures',
 'postal_code',
 'last_seen']

In [99]:

# Resetting the dataset column names to the new column names and checking the head
autos.columns = new_column
autos.head()

Out[99]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

'

Step 2B - Basic Data Exploration to determine further Data cleaning needs¶

Initially we look for the following:

Text columns where all or almost all values are the same. These can often be dropped as they dont have useful information for analysis
Examples of numeric data stored as text which can be cleaned and converted to its correct datatype

In [101]:

# Looking at initial descriptive statics for each column in the dataset
autos.describe(include ='all')

Out[101]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [102]:

# Taking a quick look at individual columns
autos['odometer'].value_counts(dropna = False)

Out[102]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [104]:

autos['price'].head()

Out[104]:

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

From this brief exploration we can see that the 'price' and 'odemeter' columns are numeric values stored as text

For each column we will:

Remove all non-numeric characters
Convert the column to a numeric type
Rename the 'odometer' column to 'odometer_km'

In [105]:

# autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)
autos['price'].head()

<ipython-input-105-1be2fff3cb57>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)

Out[105]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

In [106]:

# autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype(int)
# autos.rename({'odometer':"odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].value_counts()

Out[106]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

Further Exploratory Analysis for Data cleaning¶

Price Column¶

In [107]:

autos['price'].nunique()

Out[107]:

In [108]:

autos['price'].describe()

Out[108]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

Here we can see that the price column appears to have a 'Positive or Right Skew' with its maximum value being of the magnitude 10^5 times greater than values within 75% of the dataset.

Odometer column¶

In [109]:

autos['odometer_km'].nunique()

Out[109]:

In [110]:

autos['odometer_km'].value_counts()

Out[110]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [111]:

autos['odometer_km'].describe()

Out[111]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset.

Removing outliers in the 'Price' & 'odometer_km' columns¶

In [125]:

# Removing outliers in the Price column and checking the new statistical distribution
autos.loc[autos['price'].between(0,30000), 'price'].describe()

Out[125]:

count    49206.000000
mean      5025.773483
std       5679.154441
min          0.000000
25%       1100.000000
50%       2850.000000
75%       6900.000000
max      30000.000000
Name: price, dtype: float64

In [126]:

autos.loc[autos['price'].between(0,30000), 'price'].nunique()

Out[126]:

In [119]:

# Removing outliers in the odometer_km column and checking its statistical distribution
autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km'].describe()

Out[119]:

count     49033.000000
mean     128113.719332
std       36631.381785
min       10000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [130]:

autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km'].nunique()

Out[130]:

Re-setting the 'odometer_km' column with removed outliers¶

In [127]:

# autos['odometer_km'] = autos[autos['odometer_km'].between(10000,150000)]['odometer_km']

Re-setting the 'price' column with removed outliers¶

In [132]:

# autos['price'] = autos[autos['price'].between(0,30000)]['price']

In [139]:

# The new Dataframe
autos.describe()

Out[139]:

	price	registration_year	power_ps	odometer_km	registration_month	nr_of_pictures	postal_code
count	49206.000000	50000.000000	50000.000000	49033.000000	50000.000000	50000.0	50000.000000
mean	5025.773483	2005.073280	116.355920	128113.719332	5.723360	0.0	50813.627300
std	5679.154441	105.712813	209.216627	36631.381785	3.711984	0.0	25779.747957
min	0.000000	1000.000000	0.000000	10000.000000	0.000000	0.0	1067.000000
25%	1100.000000	1999.000000	70.000000	125000.000000	3.000000	0.0	30451.000000
50%	2850.000000	2003.000000	105.000000	150000.000000	6.000000	0.0	49577.000000
75%	6900.000000	2008.000000	150.000000	150000.000000	9.000000	0.0	71540.000000
max	30000.000000	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

In [140]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        50000 non-null  object 
 1   name                50000 non-null  object 
 2   seller              50000 non-null  object 
 3   offer_type          50000 non-null  object 
 4   price               49206 non-null  float64
 5   abtest              50000 non-null  object 
 6   vehicle_type        44905 non-null  object 
 7   registration_year   50000 non-null  int64  
 8   gearbox             47320 non-null  object 
 9   power_ps            50000 non-null  int64  
 10  model               47242 non-null  object 
 11  odometer_km         49033 non-null  float64
 12  registration_month  50000 non-null  int64  
 13  fuel_type           45518 non-null  object 
 14  brand               50000 non-null  object 
 15  unrepaired_damage   40171 non-null  object 
 16  ad_created          50000 non-null  object 
 17  nr_of_pictures      50000 non-null  int64  
 18  postal_code         50000 non-null  int64  
 19  last_seen           50000 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

Fixing the Date Columns¶

With the initial read in of the data, columns - 'date_crawled', 'last_seen' and 'ad_created' are all identified as string vlaues by pandas.

Because these 3 columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.

As the other two time columns - 'registration_month' & 'registration_year' are already in a numerical format, their distribution can be understood using the Series.describe() method

In [142]:

autos.loc[0:5, ['date_crawled', 'ad_created', 'last_seen']]

Out[142]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50
5	2016-03-21 13:47:45	2016-03-21 00:00:00	2016-04-06 09:45:21

Extracting the date only and calculating the distribution of values in these 3 columns¶

In [154]:

# INVESTIGATING THE DATE_CRAWLED COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the 
# 'normalize = True' and 'dropna = False' keyword parameters
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize= True, dropna = False)
date_crawled.sort_index()

Out[154]:

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

In [152]:

# INVESTIGATING THE AD_CREATED COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the 
# 'normalize = True' and 'dropna = False' keyword parameters.

ad_created = autos['ad_created'].str[:10].value_counts(normalize= True, dropna = False)
ad_created.sort_index()

Out[152]:

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64

In [155]:

# INVESTIGATING THE LAST_SEEN COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the 
# 'normalize = True' and 'dropna = False' keyword parameters
last_seen = autos['last_seen'].str[:10].value_counts(normalize= True, dropna = False)
last_seen.sort_index()

Out[155]:

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

Exploring the statistical distribution of values in the 'registration_year' column¶

In [173]:

sorted(autos['registration_year'].unique())

Out[173]:

In [174]:

autos['registration_year'].describe()

Out[174]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The Observed Problem¶

From this exploration of the registration_year column we can see that some odd values are eminent:

The minimum value is 1000, before cars were invented ???
The maximum value is 9999, many years into the future!

Our Solution

Because a manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created column which represents date_of_listing only has data on years up to 2016.
Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic.

In [212]:

# Counting the number of listings from the 'ad_created' column, that fall outside the 1900-2016 registration year interval.
autos.loc[-(autos['registration_year'].between(1900, 2016))].describe()

Out[212]:

	price	registration_year	power_ps	odometer_km	registration_month	nr_of_pictures	postal_code
count	1963.000000	1972.000000	1972.000000	1916.000000	1972.000000	1972.0	1972.000000
mean	3367.427407	2060.353448	98.954361	133987.473904	4.641988	0.0	47836.475152
std	3955.896136	528.204087	426.692546	32032.014822	3.912620	0.0	25302.249083
min	0.000000	1000.000000	0.000000	10000.000000	0.000000	0.0	1067.000000
25%	900.000000	2017.000000	0.000000	125000.000000	1.000000	0.0	27577.500000
50%	1900.000000	2017.000000	75.000000	150000.000000	4.000000	0.0	46006.500000
75%	4390.000000	2018.000000	120.000000	150000.000000	8.000000	0.0	66127.750000
max	29699.000000	9999.000000	16011.000000	150000.000000	12.000000	0.0	99974.000000

In [191]:

indices_to_remove = autos.loc[-(autos['registration_year'].between(1900, 2016))].index
indices_to_remove

Out[191]:

Int64Index([   10,    55,    65,    68,    84,   113,   164,   197,   253,
              348,
            ...
            49662, 49689, 49696, 49731, 49770, 49796, 49841, 49880, 49910,
            49935],
           dtype='int64', length=1972)

Removing rows not within this highest and lowest acceptable values for the 'registration_year' column

In [192]:

autos_19_2016 = autos.drop(indices_to_remove)

In [195]:

autos_19_2016.describe(include= 'all')

Out[195]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	48028	48028	48028	48028	47243.000000	48028	44903	48028.00000	45604	48028.000000	45560	47117.000000	48028.000000	44301	48028	39040	48028	48028.0	48028.000000	48028
unique	46375	37019	2	2	NaN	2	8	NaN	2	NaN	244	NaN	NaN	7	40	2	74	NaN	NaN	38102
top	2016-04-02 11:37:04	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	76	48027	48027	NaN	24756	12859	NaN	35560	NaN	3815	NaN	NaN	29248	10188	34255	1878	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	5094.679635	NaN	NaN	2002.80351	NaN	117.070417	NaN	127874.864699	5.767760	NaN	NaN	NaN	NaN	0.0	50935.867327	NaN
std	NaN	NaN	NaN	NaN	5729.225244	NaN	NaN	7.31085	NaN	195.151278	NaN	36786.714694	3.696802	NaN	NaN	NaN	NaN	0.0	25792.079828	NaN
min	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	1910.00000	NaN	0.000000	NaN	10000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	1100.000000	NaN	NaN	1999.00000	NaN	71.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30459.000000	NaN
50%	NaN	NaN	NaN	NaN	2900.000000	NaN	NaN	2003.00000	NaN	107.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49696.000000	NaN
75%	NaN	NaN	NaN	NaN	6999.000000	NaN	NaN	2008.00000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71665.000000	NaN
max	NaN	NaN	NaN	NaN	30000.000000	NaN	NaN	2016.00000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [201]:

autos_19_2016['registration_year'].value_counts(normalize = True).sort_index()

Out[201]:

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
          ...   
2012    0.027546
2013    0.016782
2014    0.013867
2015    0.008308
2016    0.027401
Name: registration_year, Length: 78, dtype: float64

In [214]:

# autos_19_2016['ad_created'].str[:10].value_counts(normalize = True).sort_index()
autos_19_2016['ad_created'].str[:10].unique()

Out[214]:

array(['2016-03-26', '2016-04-04', '2016-03-12', '2016-04-01',
       '2016-03-21', '2016-03-20', '2016-03-16', '2016-03-22',
       '2016-03-31', '2016-03-23', '2016-03-29', '2016-03-17',
       '2016-03-05', '2016-03-06', '2016-03-28', '2016-03-10',
       '2016-04-03', '2016-03-27', '2016-03-19', '2016-04-02',
       '2016-03-14', '2016-03-15', '2016-04-05', '2016-03-11',
       '2016-03-07', '2016-03-08', '2016-03-09', '2016-03-25',
       '2016-03-18', '2016-03-30', '2016-03-24', '2016-03-13',
       '2016-04-06', '2016-03-04', '2016-04-07', '2016-02-24',
       '2016-03-02', '2016-02-29', '2016-01-03', '2015-11-10',
       '2016-03-01', '2016-02-28', '2016-02-19', '2016-03-03',
       '2016-02-25', '2016-02-23', '2016-02-14', '2016-02-12',
       '2016-01-27', '2016-02-05', '2016-02-01', '2016-02-02',
       '2016-02-17', '2016-02-21', '2016-02-09', '2015-08-10',
       '2015-06-11', '2016-02-11', '2016-01-14', '2016-01-10',
       '2016-02-08', '2015-12-05', '2016-02-27', '2016-02-22',
       '2016-01-13', '2015-09-09', '2016-02-20', '2016-01-16',
       '2015-12-30', '2016-02-07', '2016-01-07', '2016-02-26',
       '2016-02-18', '2016-02-16'], dtype=object)

In [203]:

autos_19_2016['last_seen'].str[:10].value_counts(normalize = True).sort_index()

Out[203]:

2016-03-05    0.001083
2016-03-06    0.004206
2016-03-07    0.005351
2016-03-08    0.007662
2016-03-09    0.009994
2016-03-10    0.010785
2016-03-11    0.012514
2016-03-12    0.023757
2016-03-13    0.008766
2016-03-14    0.012826
2016-03-15    0.016011
2016-03-16    0.016241
2016-03-17    0.027942
2016-03-18    0.007308
2016-03-19    0.015553
2016-03-20    0.020634
2016-03-21    0.020696
2016-03-22    0.021071
2016-03-23    0.018406
2016-03-24    0.019489
2016-03-25    0.018926
2016-03-26    0.016948
2016-03-27    0.016032
2016-03-28    0.020717
2016-03-29    0.022112
2016-03-30    0.024715
2016-03-31    0.023715
2016-04-01    0.023257
2016-04-02    0.024652
2016-04-03    0.025319
2016-04-04    0.024236
2016-04-05    0.124885
2016-04-06    0.222433
2016-04-07    0.131756
Name: last_seen, dtype: float64

In [220]:

autos_19_2016['date_crawled'].str[:10].value_counts(normalize = True).sort_index()

Out[220]:

2016-03-05    0.025256
2016-03-06    0.014075
2016-03-07    0.036146
2016-03-08    0.033522
2016-03-09    0.033356
2016-03-10    0.032190
2016-03-11    0.032356
2016-03-12    0.036708
2016-03-13    0.015741
2016-03-14    0.036395
2016-03-15    0.034022
2016-03-16    0.029379
2016-03-17    0.031711
2016-03-18    0.012972
2016-03-19    0.034792
2016-03-20    0.037957
2016-03-21    0.037499
2016-03-22    0.032814
2016-03-23    0.032335
2016-03-24    0.029254
2016-03-25    0.031607
2016-03-26    0.032294
2016-03-27    0.030753
2016-03-28    0.034605
2016-03-29    0.034188
2016-03-30    0.033751
2016-03-31    0.031877
2016-04-01    0.033918
2016-04-02    0.035479
2016-04-03    0.038873
2016-04-04    0.036645
2016-04-05    0.012992
2016-04-06    0.003102
2016-04-07    0.001437
Name: date_crawled, dtype: float64

In [224]:

autos_19_2016.loc[(autos_19_2016['price'].notnull()) & (autos_19_2016['odometer_km'].notnull())].isnull().sum()

Out[224]:

date_crawled             0
name                     0
seller                   0
offer_type               0
price                    0
abtest                   0
vehicle_type          2840
registration_year        0
gearbox               2138
power_ps                 0
model                 2261
odometer_km              0
registration_month       0
fuel_type             3412
brand                    0
unrepaired_damage     8506
ad_created               0
nr_of_pictures           0
postal_code              0
last_seen                0
dtype: int64

Observations after removing rows outside of registration_year = 1900-2016

With the exception of the registration column, it still appears that the years for the other 3 date columns: date_crawled, last_seen, ad_created mainly comprise of car listings in 2015-2016.
RECOMMENDATION: Consider trimming the dataset to only include registration_years 2000-2016 instead

In [233]:

# CLEANING UP FURTHER BY MINIMISING NULL VALUES

# Removing rows with null values for the 'price' and 'odometer_km' columns
# Also removing the 'unrepaired_damage' column as there is an abundance of null values

autos_19_2016_cleaner = autos_19_2016.loc[(autos_19_2016['price'].notnull()) & (autos_19_2016['odometer_km'].notnull())]
autos_19_2016_cleaner.drop(['unrepaired_damage'], axis = 1, inplace = True)
autos_19_2016_cleaner.info()

Data Exploration Via Aggregation¶

When working on data with cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.

Reminder on the process of aggregation

Identify the unique values we want to aggregate by.
Create an empty dictionary to store our aggregate data.
Loop over the unique values and for each - Subset the dataframe by the unique values - caluclate the mean of whichever column we're intersted in-Assign the unique_val/mean to the dict as k/v.

In [240]:

# Step 1 - Identifying the number of unique brand values
autos_19_2016_cleaner['brand'].nunique()

Out[240]:

In [260]:

# Step 2 - Exploring the unique brand values
autos_19_2016_cleaner['brand'].value_counts(normalize = True)

Out[260]:

volkswagen        0.214218
opel              0.109535
bmw               0.109125
mercedes_benz     0.094528
audi              0.085040
ford              0.070615
renault           0.048105
peugeot           0.030273
fiat              0.026047
seat              0.018522
skoda             0.016279
mazda             0.015352
nissan            0.015287
smart             0.014296
citroen           0.014209
toyota            0.012722
hyundai           0.010091
volvo             0.009487
sonstige_autos    0.009013
mini              0.008711
mitsubishi        0.008194
honda             0.007935
kia               0.007310
alfa_romeo        0.006684
suzuki            0.006016
chevrolet         0.005520
chrysler          0.003666
porsche           0.003019
dacia             0.002609
daihatsu          0.002523
subaru            0.002221
jeep              0.002135
saab              0.001660
land_rover        0.001617
jaguar            0.001531
daewoo            0.001509
rover             0.001380
trabant           0.001337
lancia            0.001100
lada              0.000582
Name: brand, dtype: float64

In [277]:

# Step 3 - Selecting the top 20 brands
auto_brands_20 = autos_19_2016_cleaner['brand'].value_counts().head(20).index
auto_brands_20

Out[277]:

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'hyundai', 'volvo', 'sonstige_autos', 'mini'],
      dtype='object')

In [278]:

# Step 4 - Finding the aggregate mean_selling_price for cars in each of the top 20 brands

brand_mean_prices = {}
for brand in auto_brands_20:
    mean_price = round(autos_19_2016_cleaner.loc[autos_19_2016_cleaner['brand'] == brand, 'price'].mean())
    # Creating a new key_value pair with this mean price
    brand_mean_prices[brand] = mean_price

In [279]:

brand_mean_prices

Out[279]:

{'volkswagen': 5038,
 'opel': 2859,
 'bmw': 7247,
 'mercedes_benz': 7357,
 'audi': 7773,
 'ford': 3407,
 'renault': 2319,
 'peugeot': 3044,
 'fiat': 2674,
 'seat': 4265,
 'skoda': 6271,
 'mazda': 3815,
 'nissan': 4607,
 'smart': 3561,
 'citroen': 3621,
 'toyota': 4989,
 'hyundai': 5282,
 'volvo': 4781,
 'sonstige_autos': 7417,
 'mini': 10334}

In [280]:

# Step 5 - Sorting the values in the brand dictionary according to the highest mean price
brand_mean_prices = dict(sorted(brand_mean_prices.items(), key = lambda data: data[1], reverse = True ))
brand_mean_prices

Out[280]:

{'mini': 10334,
 'audi': 7773,
 'sonstige_autos': 7417,
 'mercedes_benz': 7357,
 'bmw': 7247,
 'skoda': 6271,
 'hyundai': 5282,
 'volkswagen': 5038,
 'toyota': 4989,
 'volvo': 4781,
 'nissan': 4607,
 'seat': 4265,
 'mazda': 3815,
 'citroen': 3621,
 'smart': 3561,
 'ford': 3407,
 'peugeot': 3044,
 'opel': 2859,
 'fiat': 2674,
 'renault': 2319}

Observation From Results

From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi, sostige , mercedes and bmw.

Discovering more Insights through Aggregation¶

For the top 20 brands, we would like to use yet another aggregation process to understand the average mileage for those cars and if there's any visible link with its mean price.

Using these two aggregated series objects we can then combine then into a single subset dataframe (with a shared index).

In [281]:

# Step 1 - Calculating the mean mileage of the top 20 car brands

brand_mean_mileages = {}

for brand in auto_brands_20:
    mean_mileage = round(autos_19_2016_cleaner.loc[autos_19_2016_cleaner['brand'] == brand, 'odometer_km'].mean())
    # Creating a new key_value pair with this mean mileage
    brand_mean_mileages[brand] = mean_mileage

In [282]:

# Step 2 - Sorting the values in the brand dictionary according to the highest mean mileage
brand_mean_milieages = dict(sorted(brand_mean_mileages.items(), key = lambda data: data[1], reverse = True ))
brand_mean_mileages

Out[282]:

{'volkswagen': 131497,
 'opel': 131987,
 'bmw': 136564,
 'mercedes_benz': 134396,
 'audi': 133746,
 'ford': 126649,
 'renault': 130547,
 'peugeot': 128355,
 'fiat': 119694,
 'seat': 123463,
 'skoda': 113040,
 'mazda': 127170,
 'nissan': 121051,
 'smart': 100309,
 'citroen': 121002,
 'toyota': 117314,
 'hyundai': 107596,
 'volvo': 139568,
 'sonstige_autos': 103481,
 'mini': 90829}

In [283]:

# Step 3 - Converting both aggregate dictionaries into Series objects using the pd.Series() constructor

mean_prices_top20 = pd.Series(brand_mean_prices)
mean_milieage_top20 = pd.Series(brand_mean_mileages)

In [287]:

# Step 4 - Creating a dataframe using the mean_prices series
top_20_brands = pd.DataFrame(mean_prices_top20, columns=['mean_price'])
top_20_brands

Out[287]:

	mean_price
mini	10334
audi	7773
sonstige_autos	7417
mercedes_benz	7357
bmw	7247
skoda	6271
hyundai	5282
volkswagen	5038
toyota	4989
volvo	4781
nissan	4607
seat	4265
mazda	3815
citroen	3621
smart	3561
ford	3407
peugeot	3044
opel	2859
fiat	2674
renault	2319

In [288]:

# Step 5 - Appending the mean_mileage data in the other series as a column of this new dataframe
top_20_brands['mean_milieage'] = mean_milieage_top20

In [298]:

top_20_brands.sort_values('mean_milieage', ascending = False )

Out[298]:

	mean_price	mean_milieage
volvo	4781	139568
bmw	7247	136564
mercedes_benz	7357	134396
audi	7773	133746
opel	2859	131987
volkswagen	5038	131497
renault	2319	130547
peugeot	3044	128355
mazda	3815	127170
ford	3407	126649
seat	4265	123463
nissan	4607	121051
citroen	3621	121002
fiat	2674	119694
toyota	4989	117314
skoda	6271	113040
hyundai	5282	107596
sonstige_autos	7417	103481
smart	3561	100309
mini	10334	90829

In [295]:

# Exploring the data simoltaneously to determine the relationship between 'price' and 'mileage' if any

print(top_20_brands['mean_price'].idxmax())
print(top_20_brands['mean_price'].idxmin())

mini
renault

In [296]:

print(top_20_brands['mean_milieage'].idxmax())
print(top_20_brands['mean_milieage'].idxmin())

volvo
mini

Conclusion

With the exception of the 'mini' car brand, the mean_price and mean_milieage appear to have a small correlation where brands like bmw, mercedes_benz, audi, opel and volkswagen (top 5) appear to be higher in both mean price and milieage.

In [ ]: