Exploring eBay Car Sales Data¶

Introduction¶

In this project we are going to work with a data set containing the details of used cars from eBay Klenanzeigen a classified section of German eBay website. The link to the dataset is here. We are going to work on a dataset with only 50000 data points for ease of analysis.

The objective of this project is :-

Clean the data
Analyze the included used car listings

In [296]:

#import pandas and numpy libraries
import numpy as np
import pandas as pd

In [297]:

#read the csv file, with encoding equal to Latin-1 & Windows-1252 seperately
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos = pd.read_csv('autos.csv', encoding = 'Windows-1252')

In [298]:

#run the autos.info() method to check the information about the dataframe
autos.info()

#run the autos.head() method to check the first five rows of the dataframe
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[298]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Here we observe that autos.csv file gives an error message while reading it with default encoding UTF-8. But can be read with both Latin-1 and Windows-1252 encodings. This data set has 50000 rows and 20 columns. Among the columns, 13 are string data-types and 7 are integer data-types. There are 5 columns with null values and all these have < 20% null values. The column names are in camelcase.

Cleaning Column Names for Ease of Analysis¶

First we will look into the column names in the dataframe.

In [299]:

#apply df.columns attribute to look at the column names
autos.columns

Out[299]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We observe that the column names are in camelcase instead of snakecase, which is preffered by python. Here we will have to change the column names to snakecase and also reword some of the column names based on the data dictionary (below) to be more descriptive.

In [300]:

#using df.rename() method change the column names of autos dataset as in the dictionary below
autos.rename(columns = {'dateCrawled':'date_crawled', 'offerType':'offer_type', 'abtest': 'ab_test', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year', 'gearbox': 'gear_box', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage','dateCreated':'date_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 'lastSeen':'last_seen'}, inplace=True)  

In [301]:

#check the few rows of the autos dataset using df.head() method
autos.head()

Out[301]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	date_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Column names have been changed from camelcase to snakecase and some of the column names have been re-written. Now the column names are more clear and easy to read.

Initial Exploration and Cleaning¶

Here we are going to investigate the columns more carefully. We will use DataFrame.decribe() method inorder to check if any columns have mostly one value and if there are any columns with numeric data stored as text.

In [302]:

#apply autos.describe() method for only string data types
autos.describe(include = 'object')

Out[302]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	gear_box	model	odometer	fuel_type	brand	unrepaired_damage	date_created	last_seen
count	50000	50000	50000	50000	50000	50000	44905	47320	47242	50000	45518	50000	40171	50000	50000
unique	48213	38754	2	2	2357	2	8	2	245	13	7	40	2	76	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	manuell	golf	150,000km	benzin	volkswagen	nein	2016-04-03 00:00:00	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	36993	4024	32424	30107	10687	35232	1946	8

In [303]:

#apply autos.describe() method only for int data-types
autos.describe(include = 'int')

Out[303]:

	registration_year	power_ps	registration_month	nr_of_pictures	postal_code
count	50000.000000	50000.000000	50000.000000	50000.0	50000.000000
mean	2005.073280	116.355920	5.723360	0.0	50813.627300
std	105.712813	209.216627	3.711984	0.0	25779.747957
min	1000.000000	0.000000	0.000000	0.0	1067.000000
25%	1999.000000	70.000000	3.000000	0.0	30451.000000
50%	2003.000000	105.000000	6.000000	0.0	49577.000000
75%	2008.000000	150.000000	9.000000	0.0	71540.000000
max	9999.000000	17700.000000	12.000000	0.0	99998.000000

After looking at the descriptive statistics of autos data set above we make following observations :-

Among the Int data-types, nr_of_pictures column has only one value, zero(0). This column can be dropped.
Among the string data-types, seller column has name gewerblich one time, remaining all (4999) are privat. Likewise offer_type has value Gesuch just one time and remaining all values are Angebot. Both of these columns can also be dropped.
The date_crawled, date_created and last_seen listed as strings, can be converted into date-time class.
price and odometer listed as strings. Should be converted into Int class.

Below we rename the price and odometer columns and also convert the data-type to numeric (int) type. First we will remove the non-numeric texts in both the columns.

In [304]:

#remove the non-numeric texts using autos.replace() method
autos = autos.replace({'price':{'\$':'', ',':''}}, regex=True)

#convert the datatype to 'int' using Series.astype()
autos['price'] = autos['price'].astype(int)

In [305]:

#remove the non-numeric texts using autos.replace() method
autos = autos.replace({'odometer':{'km':'', ',':''}}, regex=True)

#convert the datatype to 'int' using Series.astype()
autos['odometer'] = autos['odometer'].astype(int)

In [306]:

#rename the column names using autos.rename() method
autos.rename(columns = {'price':'price_$', 'odometer':'odometer_km'}, inplace=True)

#check the first few columns of both the columns
autos[['price_$', 'odometer_km']].head()

Out[306]:

	price_$	odometer_km
0	5000	150000
1	8500	150000
2	8990	70000
3	4350	70000
4	1350	150000

We have performed needful changes to the required columns. Further we will explore these two columns.

Exploring the Odometer and Price Columns¶

We want to find out if there are any outliers (unrealistically high/low values) among these columns and get rid of these values. First we will see how many unique values are there in both the columns.

In [307]:

#Using Series.unique().shape find out the unique values
autos['price_$'].unique().shape

Out[307]:

(2357,)

In [308]:

#Using Series.unique().shape find out the unique values
autos['odometer_km'].unique().shape

Out[308]:

(13,)

We observe that there are 2357 unique values in the price column and 13 unique values in the Odometer column. Now we use Series.describe() method inorder to find the minimum, maximum, median and mean values of these columns.

In [309]:

#Apply Series.describe() method to calculate the statistics of Odometer column
autos['odometer_km'].describe()

Out[309]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

odometer_km column's statistical details show that the values are almost uniformly distributed as the standard deviation(std) is low. Even after removing lower boundary values (like 5000, 10000 etc.) there will not be much difference among these values.

In [310]:

#Apply Series.describe() method to calculate the statistics of price column
autos['price_$'].describe()

Out[310]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_$, dtype: float64

For the price column, statistical details show that the standard deviation value is very high. So we will first consider the values higher than 50000 and apply Series.value_counts() to check the number of times these values occurs.

In [311]:

#consider the values >= 50000
price_high = autos[autos['price_$'] >= 50000]

#apply Series.value_counts() on selected range
price_high['price_$'].value_counts().sort_values(ascending=False)

Out[311]:

52900     5
55000     5
56000     4
53000     4
60000     3
         ..
58900     1
71000     1
72900     1
116000    1
82987     1
Name: price_$, Length: 152, dtype: int64

We observe that above $50000, total number of cars for sale are just 152 and are in single digit, highest being 5 in number. These values look unrealistically high, so we can ignore the values above 50000. Now we will check values which are lower than $100.

In [312]:

#consider the values below 100 in the price column
price_low = autos[autos['price_$'] < 100]

#Apply Series.value_counts() on the selected range
price_low['price_$'].value_counts().head()

Out[312]:

0     1421
1      156
50      49
99      19
80      15
Name: price_$, dtype: int64

We observe that there are total of 1421 values which are equal to $0. We can discard these values. Further we will only consider the values between $100 and $50000 as below $100 looks unrealistically low.

In [313]:

#consider the values between 100 and 50000 and name it 'new_price'
new_price = autos[autos['price_$'].between(100, 50000)]

#check the statistics of these values using Series.describe()
new_price['price_$'].describe()

Out[313]:

count    48038.000000
mean      5611.942691
std       6723.459745
min        100.000000
25%       1250.000000
50%       3000.000000
75%       7400.000000
max      50000.000000
Name: price_$, dtype: float64

After removing the outliers on both the ends of the data, we observe that the standard deviation is comparatively low meaning the values are more uniform.

Exploring the date columns¶

There are five columns in total which represent date values. These are:-

date_crawled: added by the crawler
last_seen: added by the crawler
date_created: from the website
registration_month: from the website
registration_year: from the website

Among these date_crawled, last_seen, date_created are stored as string values. The other two date columns are represented as numeric values. Now we need to convert the first three date columns in the above list to numeric data-type in order to understand it quantitatively.

First we will look into the first few rows of these columns.

In [314]:

autos[['date_crawled', 'date_created', 'last_seen']][0:5]

Out[314]:

	date_crawled	date_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

Here we are interested only in the date and not the time among these columns. So we will extract just the date and apply value_counts() method.

In [315]:

#Apply Series.split() method and extract the values of first position in the date_crawled column
autos['date_crawled'] = autos['date_crawled'].str.split(r' ').str[0]

#Apply Series.value_counts()  method on the row
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index()

Out[315]:

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

While applying Series.value_counts() method we have included missing values and percentages instead of counts in the distribution. We have also used Series.sort_index() in order to rank the date in ascending order (earliest date to latest date).

From the above analysis we make following observations. date_crawled column does not have any null values. All the data are crawled between March 2016 and April 2016. The highest number of data is crawled on 3rd April 2016 (~ 4%) and lowest on 7th April 2016 (~0.1%).

Below we perform the same analysis on the date_created column.

In [316]:

#Apply Series.split() method and extract the values of first position in the date_created column
autos['date_created'] = autos['date_created'].str.split(r' ').str[0]

#Apply Series.value_counts()  method on the row
autos['date_created'].value_counts(normalize=True, dropna=False).sort_index()

Out[316]:

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: date_created, Length: 76, dtype: float64

Here we make following observations. date_created column also does not have any null values. eBay listing 0f these data are created in the year June 2015 and April 2016. The highest number of listing was done on 3rd April 2016 (~ 4%).

Below we perform the same analysis on the last seen column.

In [317]:

#Apply Series.split() method and extract the values of first position in the last_seen column
autos['last_seen'] = autos['last_seen'].str.split(r' ').str[0]

#Apply Series.value_counts()  method on the row
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index()

Out[317]:

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

From the above analysis we make following observations. last_seen column has no null values. Here most of the ad's are last seen by the crawler between the month of March and April of 2016. Most of the ad's are seen on 6th April 2016 (~ 2.2%) and least are on 5th March 2016 (~ 0.01%).

Further we study the registration_year column by applying Series.describe() method on it.

In [318]:

autos['registration_year'].describe()

Out[318]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

registration_year (year in which the car was registered) needs to be further investigated as according to above analysis the minimum year of registration is 1000 and maximum is 9999, which is unrealistic.

In [319]:

bel_1900 = autos[autos['registration_year'] < 1900]
bel_1900['registration_year']

Out[319]:

10556    1800
22316    1000
24511    1111
32585    1800
35238    1500
49283    1001
Name: registration_year, dtype: int64

In [320]:

abo_2016 = autos[autos['registration_year'] > 2016]
abo_2016['registration_year']

Out[320]:

10       2017
55       2017
65       2017
68       2017
84       2018
         ... 
49796    2017
49841    2017
49880    2017
49910    9000
49935    2017
Name: registration_year, Length: 1966, dtype: int64

Above analysis shows that there are only six cars which are registered below the year 1900. There are total of 1966 cars which are registered above the year 2016.

Dealing with Incorrect Registration Year Data¶

The last year in the column registration_year should be 2016. Because this is the year when the ad was created on eBay website. There are no records of listing of ads after this year. So we have decided to ignore all 1966 data points. Besides it's < 4% of the total data.

The lowest year should be in the early 1900's or late 1800's. Because gas powered engine car was invented around that time. So we will limit it to 1900.

In [321]:

#consider only the year between 1900 and 2016
new_registration_year = autos[autos['registration_year'].between(1900,2016)]

#look into the statistics of corrected data column
new_registration_year['registration_year'].describe()

Out[321]:

count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64

Here we observe that the mean year of registration is 2002. Below we apply Series.value_counts() method.

In [322]:

#Apply Series.value_counts() method on the new registration year column
new_registration_year['registration_year'].value_counts(normalize=True)

Out[322]:

2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
          ...   
1939    0.000021
1927    0.000021
1929    0.000021
1948    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

The normalized value_counts() performed on the edited data shows that, most of the registrations are done in the year 2000 (~ 7%) and the least is done in the year 1952 (~ 0.0021).

Exploring Price by Brand¶

In our car analysis we would like to explore variations across different car brands. First we will list all the unique names of top 20 car brands.

In [323]:

#select first 20 top brand names
unique_brand = autos['brand'].value_counts().index[0:20]
unique_brand

Out[323]:

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],
      dtype='object')

In [324]:

#apply Series.value_counts() method to check the number of each brand
autos['brand'].value_counts()

Out[324]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
smart               701
citroen             701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

Here we chose first 20 unique brands as these are constituted of huge number of vehicles (top brand volkswagen alone has double the number of vehicles than the next highest i.e. opel).

Below we will construct a dictionary with unique brand name as key and mean value of price as value.

In [325]:

#create a dictionary with unique brand name as key and mean value of price as value
dict_unique_brand = {}
for key in unique_brand:
    selected_brand = new_price[new_price['brand'] == key]
    dict_unique_brand[key] = selected_brand['price_$'].mean()
    
dict_unique_brand 

Out[325]:

{'volkswagen': 5354.53715064758,
 'opel': 2974.764503159104,
 'bmw': 8076.7559762860965,
 'mercedes_benz': 8130.28503046127,
 'audi': 8955.433939393939,
 'ford': 3631.9298507462686,
 'renault': 2411.617787418655,
 'peugeot': 3086.930281690141,
 'fiat': 2815.635782747604,
 'seat': 4348.652792990142,
 'skoda': 6394.309677419355,
 'mazda': 4075.319293478261,
 'nissan': 4681.94046008119,
 'smart': 3538.344927536232,
 'citroen': 3783.6788856304984,
 'toyota': 5148.0032733224225,
 'sonstige_autos': 9637.480725623584,
 'hyundai': 5416.23382045929,
 'volvo': 4911.680459770115,
 'mini': 10566.824940047962}

We need to sort the average price in the above dictionary so that we can check which unique brands have higher average price and which have lower average price.

In [326]:

#swap the key and value in the above dictionary and sort the values
dict_unique_brand_swap = dict([(value, key) for key, value in dict_unique_brand.items()])
values = dict_unique_brand_swap.items()
sorted_values = sorted(values, reverse=True)
sorted_values

Out[326]:

[(10566.824940047962, 'mini'),
 (9637.480725623584, 'sonstige_autos'),
 (8955.433939393939, 'audi'),
 (8130.28503046127, 'mercedes_benz'),
 (8076.7559762860965, 'bmw'),
 (6394.309677419355, 'skoda'),
 (5416.23382045929, 'hyundai'),
 (5354.53715064758, 'volkswagen'),
 (5148.0032733224225, 'toyota'),
 (4911.680459770115, 'volvo'),
 (4681.94046008119, 'nissan'),
 (4348.652792990142, 'seat'),
 (4075.319293478261, 'mazda'),
 (3783.6788856304984, 'citroen'),
 (3631.9298507462686, 'ford'),
 (3538.344927536232, 'smart'),
 (3086.930281690141, 'peugeot'),
 (2974.764503159104, 'opel'),
 (2815.635782747604, 'fiat'),
 (2411.617787418655, 'renault')]

The average price for the selected brands ranges from around $10000 to $2000. Brand mini has highest average price i.e. ~$10566. Next four brands with highest average price are sonstige_autos, audi, mercedes_benz and bmw. At the bottom of the list there are smart, peugeot, opel, fiat and the least average price is for renault i.e. ~$2411.

Storing Aggregate Data in a DataFrame¶

Let us take a look at the top 6 brands in the unique brand list of 20 brands.

In [327]:

top_brands = unique_brand[:6]
top_brands

Out[327]:

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford'], dtype='object')

If we look at the average prices of these 6 brands, we observe that:-

Audi, BMW and Mercedes Benz are more expensive
Ford and Opel are less expensive
Volkswagen is in between

For these brands we would like to check if there is a link between average mileage and mean price. Below we will construct dictionaries for mean price and mean mileage for these 6 brands.

In [328]:

mean_price = {}
mean_mileage = {}
for key in top_brands:
    selected_top_brands = new_price[new_price['brand'] == key]
    mean_price[key] = selected_top_brands['price_$'].mean()
    mean_mileage[key] = selected_top_brands['odometer_km'].mean()
mean_price

Out[328]:

{'volkswagen': 5354.53715064758,
 'opel': 2974.764503159104,
 'bmw': 8076.7559762860965,
 'mercedes_benz': 8130.28503046127,
 'audi': 8955.433939393939,
 'ford': 3631.9298507462686}

In [329]:

mean_mileage

Out[329]:

{'volkswagen': 129031.55127081508,
 'opel': 129442.84893739231,
 'bmw': 133076.11397972843,
 'mercedes_benz': 131554.61270670147,
 'audi': 130124.84848484848,
 'ford': 124489.55223880598}

It is very difficult to compare these two dictionaries. We need to construct a pandas series for both these and join them using pandas dataframe constructor. This will enable us easy comparision between the two columns (mean_price and mean_mileage).

In [330]:

#construct a series for mean price
mean_price_series = pd.Series(data=mean_price)
mean_price_series

Out[330]:

volkswagen       5354.537151
opel             2974.764503
bmw              8076.755976
mercedes_benz    8130.285030
audi             8955.433939
ford             3631.929851
dtype: float64

In [331]:

#construct a series for mean mileage
mean_mileage_series = pd.Series(data=mean_mileage)
mean_mileage_series

Out[331]:

volkswagen       129031.551271
opel             129442.848937
bmw              133076.113980
mercedes_benz    131554.612707
audi             130124.848485
ford             124489.552239
dtype: float64

In [332]:

#construct a dataframe for mean price
top_brand_data = pd.DataFrame(data=mean_price_series, columns=['price_mean'])
top_brand_data

Out[332]:

	price_mean
volkswagen	5354.537151
opel	2974.764503
bmw	8076.755976
mercedes_benz	8130.285030
audi	8955.433939
ford	3631.929851

In [333]:

#add mean mileage series to the abovem dataframe
top_brand_data['mileage_mean']= mean_mileage_series[:]
top_brand_data.astype('int64') #convert the data-type to int-64

Out[333]:

	price_mean	mileage_mean
volkswagen	5354	129031
opel	2974	129442
bmw	8076	133076
mercedes_benz	8130	131554
audi	8955	130124
ford	3631	124489

The mean mileage of all the top 6 brands are in the range125000 km - 133000 km. Highest mean mileage is for bmw, which is ~ 133076 km and the lowest is for ford, which is ~ 124489 km. The mean price is lowest for opel i.e. ~$2974 and highest for audi at ~$8955. We observe that the relation between mean price and mean mileage does not follow any pattern.

Further Data Cleaning¶

Translating German Words of Categorical Data to English¶

In here our goal is to translate the German words to English and map the values to their English counterparts. First we will look at the dataset and check the categorical data among each column and translate them to English.

In [334]:

#look at the first few rows of the dataframe and list the categorical data
autos.head()

Out[334]:

	date_crawled	name	seller	offer_type	price_$	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	date_created	postal_code	last_seen
0	2016-03-26	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26	79588	2016-04-06
1	2016-04-04	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04	71034	2016-04-06
2	2016-03-26	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26	35394	2016-04-06
3	2016-03-12	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12	33729	2016-03-15
4	2016-04-01	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01	39218	2016-04-01

Converting words in every column can be painstaking. Columns with car names, models and brands are quite huge and going through each of these is time consuming and laborious. We assume that these are standard for all countries, irrespective of the language and do not make any changes to them. Below we will consider other categorical data and replace them with English counterparts.

Below we will apply Series.value_counts() on every categorical data column and look at the values. Then we will replace the German words with appropriate English words. We will use either Series.str.replace() method or Series.map() method depending on the number of values to be replaced.

In [335]:

#apply Series.value_counts() method to seller column
autos['seller'].value_counts()

Out[335]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [336]:

#apply Series.str.replace() method to replace with appropriate English words
autos['seller'] = autos['seller'].str.replace('privat', 'private')
autos['seller'] = autos['seller'].str.replace('gewerblich', 'commercial')

In [337]:

#apply Series.value_counts() method to offer_type column
autos['offer_type'].value_counts()

Out[337]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [338]:

#apply Series.str.replace() method to replace with appropriate English words
autos['offer_type'] = autos['offer_type'].str.replace('Angebot', 'offer')
autos['offer_type'] = autos['offer_type'].str.replace('Gesuch', 'Application')

In [339]:

#apply Series.value_counts() method to gear_box column
autos['gear_box'].value_counts()

Out[339]:

manuell      36993
automatik    10327
Name: gear_box, dtype: int64

In [340]:

#apply Series.str.replace() method to replace with appropriate English words
autos['gear_box'] = autos['gear_box'].str.replace('manuell', 'manual')
autos['gear_box'] = autos['gear_box'].str.replace('automatik', 'automatic')

In [341]:

#apply Series.value_counts() method to fuel_type column
autos['fuel_type'].value_counts()

Out[341]:

benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

In [342]:

#apply Series.str.replace() method to replace with appropriate English words
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol')
autos['fuel_type'] = autos['fuel_type'].str.replace('andere', 'other')
autos['fuel_type'] = autos['fuel_type'].str.replace('elektro', 'electric')

In [343]:

#apply Series.value_counts() method to unrepaired_damage column
autos['unrepaired_damage'].value_counts()

Out[343]:

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

In [344]:

#apply Series.str.replace() method to replace with appropriate English words
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no')
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja', 'yes')

In [345]:

#apply Series.value_counts() method to vehicle_type column
autos['vehicle_type'].value_counts()

Out[345]:

limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

In [346]:

#create a mapping dictionary and map it to the vehicle_type column
mapping_dict = {'limousine': 'sedan', 
                'kleinwagen':'small car', 
                'kombi':'station wagon',
                'cabrio':'convertible',
                'andere':'other',
                'bus': 'bus',
                'coupe': 'Coupe',
                'suv':'suv'}
autos['vehicle_type'] = autos['vehicle_type'].map(mapping_dict)

In [347]:

#apply Series.str.replace() method 
autos['model'] = autos['model'].str.replace('andere', 'other')

In [348]:

#print first few rows of the dataframe
autos.head()

Out[348]:

	date_crawled	name	seller	offer_type	price_$	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	date_created	postal_code	last_seen
0	2016-03-26	Peugeot_807_160_NAVTECH_ON_BOARD	private	offer	5000	control	bus	2004	manual	158	other	150000	3	lpg	peugeot	no	2016-03-26	79588	2016-04-06
1	2016-04-04	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	private	offer	8500	control	sedan	1997	automatic	286	7er	150000	6	petrol	bmw	no	2016-04-04	71034	2016-04-06
2	2016-03-26	Volkswagen_Golf_1.6_United	private	offer	8990	test	sedan	2009	manual	102	golf	70000	7	petrol	volkswagen	no	2016-03-26	35394	2016-04-06
3	2016-03-12	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	private	offer	4350	control	small car	2007	automatic	71	fortwo	70000	6	petrol	smart	no	2016-03-12	33729	2016-03-15
4	2016-04-01	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	private	offer	1350	test	station wagon	2003	manual	0	focus	150000	7	petrol	ford	no	2016-04-01	39218	2016-04-01

We printed the few rows of corrected dataset inorder to check if the changes have been implimented properly. It looks fine.

Converting the Dates into Uniform Numeric Data¶

The dataset has three date columns date_crawled, date_created and last_seen. Let us take a look at these columns again.

In [349]:

autos[['date_crawled', 'date_created', 'last_seen']][:5]

Out[349]:

	date_crawled	date_created	last_seen
0	2016-03-26	2016-03-26	2016-04-06
1	2016-04-04	2016-04-04	2016-04-06
2	2016-03-26	2016-03-26	2016-04-06
3	2016-03-12	2016-03-12	2016-03-15
4	2016-04-01	2016-04-01	2016-04-01

Our aim is to convert these dates into uniform numeric data which should be in an interger form. To impliment this we will use Series.str.replace() method and convert the datatype to integer.

In [350]:

#aplly Series.str.replace() and Series.astype() methods on all the three columns
autos['date_crawled'] = autos['date_crawled'].str.replace('-', '')
autos['date_crawled'].astype('int')

autos['date_created'] = autos['date_created'].str.replace('-', '')
autos['date_created'].astype('int')

autos['last_seen'] = autos['last_seen'].str.replace('-', '')
autos['last_seen'].astype('int')

#check the three columns
autos[['date_crawled', 'date_created', 'last_seen']][:5]

Out[350]:

	date_crawled	date_created	last_seen
0	20160326	20160326	20160406
1	20160404	20160404	20160406
2	20160326	20160326	20160406
3	20160312	20160312	20160315
4	20160401	20160401	20160401

We have succesfully converted the three date columns into unifrm numeric data in an integer form.

Checking for Particular Keywords in the Name Column¶

In here we are going to investigate if there are any keywords in the name column which can be extracted as a new column.

In [351]:

#look at the first few rows of the name column
autos['name'].head()

Out[351]:

0                     Peugeot_807_160_NAVTECH_ON_BOARD
1           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                           Volkswagen_Golf_1.6_United
3    Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4    Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
Name: name, dtype: object

From the name column data, we observe that the zeroth position indicates the brand name of the car. First position indicates the model name. Second position onwards indicates Engine details. Since the Engine details of the cars are not given in the dataframe we can extract this as a new column.

In [352]:

#extract the engine details
autos['engine'] = autos['name'].str.replace('/','_').str.split('_').str[2:]
    
#using Series.str.join() method join the extracted values
autos['engine'] = autos['engine'].str.join(" ")
autos['engine']

Out[352]:

0                                     160 NAVTECH ON BOARD
1                        4 4 Liter HAMANN UMBAU Mega Optik
2                                               1.6 United
3                  fortwo coupe softouch F1 Klima Panorama
4        1 6 Benzin TÜV neu ist sehr gepflegt.mit Klima...
                               ...                        
49995             3.0 TDI qu. S tr.  Navi  Panorama  Xenon
49996     F Cabrio Bertone Edition   TÜV neu+Reifen neu !!
49997                                C 1.2 Dualogic Lounge
49998                           2.0 TDI Sportback Ambition
49999                                              1.6 16V
Name: engine, Length: 50000, dtype: object

Below we will extract the first two positions of the name column and construct a new column called brand_model. This column has all the brand and model names together.

In [353]:

#using Series.str.replace() method and slicing extract the first two position of the name column
autos['brand_model'] = autos['name'].str.replace('/','_').str.replace('.', '_').str.split('_').str[:2]
    
#using Series.str.join() method join the extracted values
autos['brand_model'] = autos['brand_model'].str.join("_")
autos['brand_model']

Out[353]:

0            Peugeot_807
1               BMW_740i
2        Volkswagen_Golf
3            Smart_smart
4             Ford_Focus
              ...       
49995            Audi_Q5
49996         Opel_Astra
49997           Fiat_500
49998            Audi_A3
49999        Opel_Vectra
Name: brand_model, Length: 50000, dtype: object

We have succesfully extracted a column from the name column. We have named it engine, it has engine details of the car. Let us have a look at it in the main dataset.

In [354]:

autos.head()

Out[354]:

	date_crawled	name	seller	offer_type	price_$	ab_test	vehicle_type	registration_year	gear_box	power_ps	...	registration_month	fuel_type	brand	unrepaired_damage	date_created	postal_code	last_seen	engine	brand_model
0	20160326	Peugeot_807_160_NAVTECH_ON_BOARD	private	offer	5000	control	bus	2004	manual	158	...	3	lpg	peugeot	no	20160326	79588	20160406	160 NAVTECH ON BOARD	Peugeot_807
1	20160404	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	private	offer	8500	control	sedan	1997	automatic	286	...	6	petrol	bmw	no	20160404	71034	20160406	4 4 Liter HAMANN UMBAU Mega Optik	BMW_740i
2	20160326	Volkswagen_Golf_1.6_United	private	offer	8990	test	sedan	2009	manual	102	...	7	petrol	volkswagen	no	20160326	35394	20160406	1.6 United	Volkswagen_Golf
3	20160312	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	private	offer	4350	control	small car	2007	automatic	71	...	6	petrol	smart	no	20160312	33729	20160315	fortwo coupe softouch F1 Klima Panorama	Smart_smart
4	20160401	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	private	offer	1350	test	station wagon	2003	manual	0	...	7	petrol	ford	no	20160401	39218	20160401	1 6 Benzin TÜV neu ist sehr gepflegt.mit Klima...	Ford_Focus

5 rows × 22 columns

Further Analysis on the Dataset¶

Finding the Most Common brand/model Combinations¶

We have already constructed a column called brand_model with brand names and corresponding model names. We just have to count all the unique values in this column.

In [355]:

#apply Series.value_counts() method to calculate the count of each value
autos['brand_model'].value_counts()

Out[355]:

Mercedes_Benz              3219
Volkswagen_Golf            1837
Opel_Corsa                 1265
Opel_Astra                 1123
Audi_A4                    1044
                           ... 
Leasing_Übernahme             1
DACIA_SANDERO                 1
Opel_Corca                    1
Reparaturanleitung_Vito       1
Smart_turbo                   1
Name: brand_model, Length: 7544, dtype: int64

The most common brand_model combination is Mercedes_Benz, it has a total of 3219 cars. Second highest is Volkswagen_Golf. This combinatin has a total 1837 cars. Third highest is Opel_Corsa, this has total of 1265 combinations.

Comparing Mean Price of Aggregated Odometer Groups¶

Here we will create an aggregated odometer groups and analyse the relationship between the mean price and the mileage. First let us group the odometer_km column.

In [356]:

#calculate the minimum
autos['odometer_km'].min()

Out[356]:

In [357]:

#calculate the maximum
autos['odometer_km'].max()

Out[357]:

Here we will segregate the column odometer_km into four groups using pd.cut() method. These four groups are:

Low (ranging from 5000 - 40000)
Mediam (ranging from 40000 - 80000)
High (ranging from 80000 - 120000)
Very High (ranging from 120000 - 150000)

Next we create an empty dictionary and take each group as it's key and average price of each group as it's value.

In [358]:

#group the odometer_km values using pd.cut() method
bins = [5000, 40000, 80000, 120000, 150000]
group_names = ['Low', 'Medium', 'High', 'Very High']
autos['odom_group'] = pd.cut(autos['odometer_km'], bins, labels=group_names)
odom_group = autos['odom_group'].unique() #make a list of unique values of the series

#create an empty dictionary
mileage_avg_price = {}

#loop through the odom_group, take each group as key and average price as value
for key in odom_group:
    mileage_avg_price[key] = autos.loc[autos['odom_group']==key, 'price_$'].mean()
    
mileage_avg_price

Out[358]:

{'Very High': 7994.325823269671,
 'Medium': 13847.910232653901,
 'Low': 27000.685993975905,
 'High': 10562.586856851758,
 nan: nan}

The average price for the group Low is highest i.e. ~ 27000. The average price for the group Medium is next highest i.e. ~ 13847. The average price for the group High is third highest i.e. ~ 10562 and the average price for the group Very High is the last of all i.e. ~ 7994. We observe that the car which has covered less distance has higher price and as the distance covered increases the price of the car reduces.

Relation Between Price and Damage/Non-Damage of Cars¶

In here we want to find out how much cheaper are the cars with damage (not repaired) than their non-damaged (repaired) counterparts. We will only consider the top six brands for this analysis as it will be easier to compare these.

In [359]:

#create two empty dictionaries, one for not repaired and the other for repaired cars
damage_data_no = {}
damage_data_yes = {}

#loop through the top_brands
for key in top_brands:
    select_brands = autos[autos['brand']==key]  #take brand as dictionary key
    #consider mean price as value
    damage_data_no[key] = select_brands.loc[select_brands['unrepaired_damage']=='no', 'price_$'].mean()    
    damage_data_yes[key] = select_brands.loc[select_brands['unrepaired_damage']=='yes', 'price_$'].mean()   

damage_data_no

Out[359]:

{'volkswagen': 6307.9194005670315,
 'opel': 7013.604521350823,
 'bmw': 9266.727139004657,
 'mercedes_benz': 9635.512065095398,
 'audi': 10690.226426994639,
 'ford': 9925.713718570798}

In [360]:

damage_data_yes

Out[360]:

{'volkswagen': 3081.8205128205127,
 'opel': 1291.3265625,
 'bmw': 3287.640625,
 'mercedes_benz': 3801.075630252101,
 'audi': 3131.905817174515,
 'ford': 1303.7013274336284}

Above analysis shows that in general the price of the unrepaired cars is lower than the price of the repaired cars. The average price of unrepaired volkswagen is ~ $3000 cheaper than the repaired volkswagen. Other brands also follow the same suit. opel, bmw and mercedes benz are ~ $6000 cheaper. ford is ~ $8000 cheaper. audi is ~ $7000 cheaper.

Conclusions¶

In conclusion, the project was to work with a data set containing the details of used cars from eBay Klenanzeigen a classified section of German eBay website. The main objectives of the project were Data Cleaning/Exploration and Data Analysis :-

Data Cleaning - We have performed data cleaning/exploration wherever it was required. We cleaned the column names from CamelCase to SnakeCase and changed the names of few columns for better clarity. We explored both the price and odometer column by examining the statistics of the column-values and dropping unrealistic values. We explored the date columns. We dropped all the incorrect registration year column values. We translated the German words of categorical data into English words wherever possible. Converted the dates of all the date columns into uniform numeric data. We extracted two new columns, one Engine with the engine details of cars and the other brand/model with the brand and model of a car together in one column from the names column.
Data Analysis - We have performed several analysis on the data-set. They are listed below.

Exploring Price by Brand:- We selected top 20 brands and calculated theaverage price of all these brands. The average price for the these brands ranges from around $10000 to $2000. Brand mini has highest average price i.e. ~$10566. Next four brands with highest average price are sonstige_autos, audi, mercedes_benz and bmw. At the bottom of the list there are smart, peugeot, opel, fiat and the least average price is for renault i.e. ~$2411.
Mean Price & Mean Mileage of Top Six Brands:- We calculated the mean mileage and mean price for six unique brands (volkswagen, opel, bmw, mercedes benz and ford). The mean mileage of all these brands are in the range 125000 km - 133000 km. Highest mean mileage is for bmw, which is ~ 133076 km and the lowest is for ford, which is ~ 124489 km. The mean price is lowest for opel i.e. ~$2974 and highest for audi at ~$8955.
Most Common brand/model Combinations:- We calculated the most common brand/model combinations for cars. And found that Mercedes_Benz tops the list with 3219 cars. Next hghest is Volkswagen_Golf with 1837 cars.
Mean Price of Aggregated Odometer Groups:- In here we segregated the odometer_km column into four groups depending on the distance covered by the cars. The average price for the group Low is highest i.e. ~ 27000. The average price for the group Medium is next highest i.e. ~ 13847. The average price for the group High is third highest i.e. ~ 10562 and the average price for the group Very High is the last of all i.e. ~ 7994. We observe that the cars which have covered less distance have higher prices and as the distance covered increases the price of these cars reduces.
Relation Between Price and Damage/Non-Damage of Cars:- From the analysis we observe that in general the price of the unrepaired cars is lower than the price of the repaired cars. The average price of unrepaired volkswagen is ~ $3000 cheaper than the repaired volkswagen. Other brands also follow the same suit. opel, bmw and mercedes benz are ~ $6000 cheaper. ford is ~ $8000 cheaper. audi is ~ $7000 cheaper.