Exploring Ebay Car Sales Data¶

This car sales dataset has been extracted from 'eBay Kleinanzeigen', a classifieds section of the German eBay website.
The dataset can be found here: https://data.world/data-society/used-cars-data

Column attributes :

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first

registered.

gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

AIM : To use pandas library to clean and analyze the dataset.

In [1]:

# importing the required libraries:
import numpy as np
import pandas as pd

In [2]:

# reading the file into a dataframe using pandas:
autos = pd.read_csv("autos.csv", encoding= 'Latin-1')

In [3]:

# peeking at the data:
autos.head()

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [4]:

# preliminary analysis using pandas:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [5]:

autos.head()

Out[5]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The column names use 'camelcase' instead of 'snakecase'. As snakecase is easier to understand and work with, let's convert the attribute names into snakecase.

In [6]:

# analysing the columns:
autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

# defining a function to change the column-names:

def col_clean(col):
    col = col.replace("yearOfRegistration","registration_year")
    col = col.replace("monthOfRegistration","registration_month")
    col = col.replace("notRepairedDamage","unrepaired_damage")
    col = col.replace("dateCreated","ad_created")
    
    name = [col[0].lower()]
    
    for c in col[1:]:
        if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
            name.append("_")
            name.append(c.lower())
        else:
            name.append(c)
            
    return ''.join(name)

## checking the function:

print(col_clean('fuelType'))
    
    

fuel_type

In [8]:

# changing the col names using our function:

new_cols = []

for c in autos.columns:
    new_cols.append(col_clean(c))
    
# checking the new column names:
print(new_cols)

['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']

In [9]:

# reassigning the new col back to our dataset:
autos.columns = new_cols

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_p_s             50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
nr_of_pictures        50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

** Basic data exploration :**

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
Examples of numeric data stored as text which can be cleaned and converted.

In [10]:

# basic dataset analysis:
autos.describe(include='all') # we use  (include = 'all') to get the numerical and non numerical columns

Out[10]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-23 19:38:20	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Let's further analyse the two numerical columns, namely 'price' and odometer readings:

In [11]:

## the price column :
autos['price'].value_counts()

Out[11]:

$0          1421
$500         781
$1,500       734
$2,500       643
$1,200       639
$1,000       639
$600         531
$3,500       498
$800         498
$2,000       460
$999         434
$750         433
$900         420
$650         419
$850         410
$700         395
$4,500       394
$300         384
$2,200       382
$950         379
$1,100       376
$1,300       371
$3,000       365
$550         356
$1,800       355
$5,500       340
$1,250       335
$350         335
$1,600       327
$1,999       322
            ... 
$9,577         1
$5,685         1
$19,970        1
$26,150        1
$10,498        1
$2,989         1
$7,825         1
$9,725         1
$554           1
$8,875         1
$20,799        1
$11,140        1
$295,000       1
$185           1
$33,777        1
$12,395        1
$6,155         1
$20,985        1
$12,898        1
$219           1
$251           1
$410           1
$2,004         1
$18,890        1
$25,850        1
$11,360        1
$2,986         1
$24,444        1
$3,620         1
$6,202         1
Name: price, Length: 2357, dtype: int64

In [12]:

## the odometer column: 
autos['odometer'].value_counts()

Out[12]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [13]:

## removing the '$' from price values and 'km' from the odometer readings and converting them to numeric columns:
autos['price'] = (autos['price']
                  .str.replace("$","")
                  .str.replace(",","")
                  .astype(int)
                 )
autos['odometer'] = (autos['odometer']
                     .str.replace("km","")
                     .str.replace(",","")
                     .astype(int)
                    )

## changing the column name to 'odometer_km':
autos.rename(columns={"odometer":"odometer_km"},inplace = True)

Analysing some of the text columns, namely (seller, offer_type and nr_of_pictures):

In [14]:

autos['seller'].value_counts(dropna=False)

Out[14]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [15]:

autos['offer_type'].value_counts(dropna=False)

Out[15]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [16]:

autos['nr_of_pictures'].value_counts(dropna=False)

Out[16]:

0    50000
Name: nr_of_pictures, dtype: int64

We see that the columns above have mostly similar values, therefore we should drop these columns before further analysis:

In [17]:

autos = autos.drop(['seller', 'offer_type','nr_of_pictures'], axis=1)

Exploring the `price` and `odometer_km` columns in the dataset to look for outliers:¶

In [18]:

## exploring the price and odometer_km columns to futher look for ouliers in the data entries:
autos['price'].unique().shape

Out[18]:

(2357,)

In [19]:

autos['price'].describe()

Out[19]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [20]:

autos['price'].value_counts().sort_index().head(5)

Out[20]:

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

In [21]:

autos['price'].value_counts().sort_index().tail(10)

Out[21]:

999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

We see that some of these vehicles cost much more than the expected value and can be outliers in the data. We should, therefore further explore the column data.

A quick google search shows us that the costliest car (as of 2021) is 70 Million . However, since such cars are rarely sold in the open market, let's keep the range of price between (1-4,500,000).

In [22]:

## Removing the outliers in the price column of the dataset:
autos = autos.loc[autos['price'].between(1,4500000),:]

Let's check if the outliers in the 'price' column have been removed:

In [23]:

autos['price'].value_counts().sort_index().tail(10)

Out[23]:

265000     1
295000     1
299000     1
345000     1
350000     1
999990     1
999999     2
1234566    1
1300000    1
3890000    1
Name: price, dtype: int64

We see that the maximum price of the vehicle in the dataset has nw been reduced to ~3.8 million, which is more realstic as it caters to the demands of the open market.

In [24]:

## exploring the odometer_km column for outliers:
autos['odometer_km'].value_counts().sort_index().tail(10)

Out[24]:

30000       780
40000       815
50000      1013
60000      1155
70000      1217
80000      1415
90000      1734
100000     2115
125000     5058
150000    31417
Name: odometer_km, dtype: int64

We see that the odometer column has reasonable values.

Exploring the columns containing 'date-time' data:¶

In [25]:

# basc information about the columns:
autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48571 entries, 0 to 49999
Data columns (total 5 columns):
date_crawled          48571 non-null object
last_seen             48571 non-null object
ad_created            48571 non-null object
registration_month    48571 non-null int64
registration_year     48571 non-null int64
dtypes: int64(2), object(3)
memory usage: 2.2+ MB

We see that the 'date_crawled, last_seen, and ad_created columns' are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values.

In [26]:

autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].head(5)

Out[26]:

	date_crawled	last_seen	ad_created	registration_month	registration_year
0	2016-03-26 17:47:46	2016-04-06 06:45:54	2016-03-26 00:00:00	3	2004
1	2016-04-04 13:38:56	2016-04-06 14:45:08	2016-04-04 00:00:00	6	1997
2	2016-03-26 18:57:24	2016-04-06 20:15:37	2016-03-26 00:00:00	7	2009
3	2016-03-12 16:58:10	2016-03-15 03:16:28	2016-03-12 00:00:00	6	2007
4	2016-04-01 14:38:50	2016-04-01 14:38:50	2016-04-01 00:00:00	7	2003

The 'date_crawled, last_seen and ad_created columns contain time-stamps in the end which should be removed before further analysis:

In [27]:

## selecting only the first 10 characters from the str-time columns:
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]

In [28]:

autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].head(5)

Out[28]:

	date_crawled	last_seen	ad_created	registration_month	registration_year
0	2016-03-26	2016-04-06	2016-03-26	3	2004
1	2016-04-04	2016-04-06	2016-04-04	6	1997
2	2016-03-26	2016-04-06	2016-03-26	7	2009
3	2016-03-12	2016-03-15	2016-03-12	6	2007
4	2016-04-01	2016-04-01	2016-04-01	7	2003

In [29]:

# for 'date_crawled' column:

autos['date_crawled'].value_counts(normalize =True, dropna = False)
## To include missing values in the distribution and to use percentages instead of counts

Out[29]:

2016-04-03    0.038603
2016-03-20    0.037883
2016-03-21    0.037389
2016-03-12    0.036915
2016-03-14    0.036544
2016-04-04    0.036524
2016-03-07    0.036009
2016-04-02    0.035474
2016-03-28    0.034856
2016-03-19    0.034774
2016-03-15    0.034280
2016-03-29    0.034115
2016-04-01    0.033683
2016-03-30    0.033683
2016-03-08    0.033291
2016-03-09    0.033086
2016-03-22    0.033003
2016-03-11    0.032571
2016-03-23    0.032221
2016-03-26    0.032200
2016-03-10    0.032180
2016-03-31    0.031830
2016-03-17    0.031644
2016-03-25    0.031603
2016-03-27    0.031089
2016-03-16    0.029606
2016-03-24    0.029338
2016-03-05    0.025324
2016-03-13    0.015668
2016-03-06    0.014041
2016-04-05    0.013094
2016-03-18    0.012909
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

date_crawled refers to the number of times the website has been visited by the user. It is an important metric for site owners to understand how often they are being revisited. We see that the date_crawled column represents the percentage number of visits per day on the website.

In [30]:

# for the 'ad_created' column:

autos['ad_created'].value_counts(normalize =True, dropna = False)

Out[30]:

2016-04-03    0.038850
2016-03-20    0.037944
2016-03-21    0.037594
2016-04-04    0.036894
2016-03-12    0.036750
2016-03-14    0.035186
2016-04-02    0.035144
2016-03-28    0.034980
2016-03-07    0.034733
2016-03-29    0.034053
2016-03-15    0.034012
2016-03-19    0.033683
2016-04-01    0.033683
2016-03-30    0.033497
2016-03-08    0.033312
2016-03-09    0.033147
2016-03-11    0.032900
2016-03-22    0.032818
2016-03-26    0.032262
2016-03-23    0.032056
2016-03-10    0.031891
2016-03-31    0.031871
2016-03-25    0.031747
2016-03-17    0.031294
2016-03-27    0.030986
2016-03-16    0.030121
2016-03-24    0.029277
2016-03-05    0.022894
2016-03-13    0.017006
2016-03-06    0.015318
                ...   
2016-02-14    0.000041
2016-02-05    0.000041
2016-02-20    0.000041
2016-02-12    0.000041
2016-01-10    0.000041
2016-02-24    0.000041
2016-02-02    0.000041
2016-02-18    0.000041
2016-02-26    0.000041
2016-01-13    0.000021
2016-02-16    0.000021
2015-12-30    0.000021
2016-02-08    0.000021
2016-01-29    0.000021
2016-02-11    0.000021
2016-02-09    0.000021
2016-01-07    0.000021
2016-02-22    0.000021
2016-01-22    0.000021
2015-06-11    0.000021
2016-02-17    0.000021
2015-09-09    0.000021
2016-01-16    0.000021
2016-01-03    0.000021
2015-12-05    0.000021
2016-01-14    0.000021
2015-08-10    0.000021
2016-02-01    0.000021
2016-02-07    0.000021
2015-11-10    0.000021
Name: ad_created, Length: 76, dtype: float64

It is very evident from the analysis that most of the ads have been created in the period of 'March - April' every year, making it the most favorable period for car sales.

In [31]:

# for the 'last_seen' column:

autos['last_seen'].value_counts(normalize =True, dropna = False)

Out[31]:

2016-04-06    0.221799
2016-04-07    0.131931
2016-04-05    0.124786
2016-03-17    0.028083
2016-04-03    0.025200
2016-04-02    0.024933
2016-03-30    0.024768
2016-04-04    0.024480
2016-03-12    0.023780
2016-03-31    0.023780
2016-04-01    0.022791
2016-03-29    0.022359
2016-03-22    0.021371
2016-03-28    0.020877
2016-03-20    0.020650
2016-03-21    0.020630
2016-03-24    0.019765
2016-03-25    0.019209
2016-03-23    0.018530
2016-03-26    0.016800
2016-03-16    0.016450
2016-03-15    0.015874
2016-03-19    0.015832
2016-03-27    0.015647
2016-03-14    0.012600
2016-03-11    0.012374
2016-03-10    0.010665
2016-03-09    0.009594
2016-03-13    0.008894
2016-03-08    0.007412
2016-03-18    0.007350
2016-03-07    0.005394
2016-03-06    0.004324
2016-03-05    0.001071
Name: last_seen, dtype: float64

last_seen represents the percentage of users that last logged on the given page. We again see that the ad_created,last_seen and the date_crawled columns have the highest metrics during 'March - April' every year, hence that period seems to be popular for car sales.

In [32]:

# exploring the registration month and year columns:

autos[['registration_month', 'registration_year']].describe()

Out[32]:

	registration_month	registration_year
count	48571.000000	48571.000000
mean	5.782319	2004.754257
std	3.685700	88.638548
min	0.000000	1000.000000
25%	3.000000	1999.000000
50%	6.000000	2004.000000
75%	9.000000	2008.000000
max	12.000000	9999.000000

We need to limit the registration year period from (1900-2016). Here we also see that '1000' and '9999' are the potential outliers in the data which need to removed.

In [33]:

# limiting the registeration year in our data:
autos = autos.loc[autos['registration_year'].between(1900,2016),:]

In [34]:

[autos['registration_year'].value_counts(dropna = False) > 1900]

Out[34]:

[2000     True
 2005     True
 1999     True
 2004     True
 2003     True
 2006     True
 2001     True
 2002     True
 1998     True
 2007     True
 2008     True
 2009     True
 1997     True
 2011    False
 2010    False
 1996    False
 2012    False
 1995    False
 2016    False
 2013    False
 2014    False
 1994    False
 1993    False
 2015    False
 1992    False
 1990    False
 1991    False
 1989    False
 1988    False
 1985    False
         ...  
 1966    False
 1976    False
 1969    False
 1975    False
 1965    False
 1964    False
 1963    False
 1959    False
 1961    False
 1910    False
 1956    False
 1958    False
 1937    False
 1962    False
 1950    False
 1954    False
 1941    False
 1951    False
 1934    False
 1957    False
 1955    False
 1953    False
 1943    False
 1929    False
 1939    False
 1938    False
 1948    False
 1927    False
 1931    False
 1952    False
 Name: registration_year, Length: 78, dtype: bool]

The period between [1997-2009] seems to be the period of boom for the used car sales industry with more than 1900 cars registered per year during the timespan.

Registration month cannot have a minimum value of 0.0, as the minimum value is assigned to the month of january as 1.

In [35]:

## dropping the columns which do not come under registeration month period of (1-12):

autos = autos.loc[autos['registration_month'].between(1,12),:]

In [36]:

## final exploration of the columns:

autos[['registration_month', 'registration_year']].describe()

Out[36]:

	registration_month	registration_year
count	42661.000000	42661.000000
mean	6.377113	2003.194815
std	3.352061	6.941008
min	1.000000	1927.000000
25%	3.000000	1999.000000
50%	6.000000	2004.000000
75%	9.000000	2008.000000
max	12.000000	2016.000000

In [37]:

## Exploring the changes in the dataset that we made already:

autos.head(10)

Out[37]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_p_s	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26	79588	2016-04-06
1	2016-04-04	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04	71034	2016-04-06
2	2016-03-26	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26	35394	2016-04-06
3	2016-03-12	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12	33729	2016-03-15
4	2016-04-01	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01	39218	2016-04-01
5	2016-03-21	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	7900	test	bus	2006	automatik	150	voyager	150000	4	diesel	chrysler	NaN	2016-03-21	22962	2016-04-06
6	2016-03-20	VW_Golf_III_GT_Special_Electronic_Green_Metall...	300	test	limousine	1995	manuell	90	golf	150000	8	benzin	volkswagen	NaN	2016-03-20	31535	2016-03-23
7	2016-03-16	Golf_IV_1.9_TDI_90PS	1990	control	limousine	1998	manuell	90	golf	150000	12	diesel	volkswagen	nein	2016-03-16	53474	2016-04-07
8	2016-03-22	Seat_Arosa	250	test	NaN	2000	manuell	0	arosa	150000	10	NaN	seat	nein	2016-03-22	7426	2016-03-26
9	2016-03-16	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	590	control	bus	1997	manuell	90	megane	150000	7	benzin	renault	nein	2016-03-16	15749	2016-04-06

In [38]:

autos['vehicle_type'].value_counts(dropna =False)

Out[38]:

limousine     11886
kleinwagen     9695
kombi          8340
bus            3838
cabrio         2849
coupe          2289
suv            1894
NaN            1538
andere          332
Name: vehicle_type, dtype: int64

We see that some of the listings in the vehicle_type column are in german. Let's convert those listings into english for better understanding the dataset.

In [39]:

## converting the vehicle_type into english:

autos.loc[autos['vehicle_type'] == 'kleinwagen' , 'vehicle_type'] = 'compact_car'
autos.loc[autos['vehicle_type'] == 'kombi' , 'vehicle_type'] = 'station_wagon'
autos.loc[autos['vehicle_type'] == 'cabrio' , 'vehicle_type'] = 'cabriolet'
autos.loc[autos['vehicle_type'] == 'andere' , 'vehicle_type'] = 'others'

In [40]:

## checking the value counts in english:
autos['vehicle_type'].value_counts(dropna =False)

Out[40]:

limousine        11886
compact_car       9695
station_wagon     8340
bus               3838
cabriolet         2849
coupe             2289
suv               1894
NaN               1538
others             332
Name: vehicle_type, dtype: int64

We see that 'limousine'* which essentially are large luxury vehicles, is the most common vehicle_type listed for sale on the website.*

In [41]:

## exploring the 'gearbox' column:

autos['gearbox'].value_counts(dropna = False)

Out[41]:

manuell      32086
automatik     9384
NaN           1191
Name: gearbox, dtype: int64

We see that the gearbox listings are in german. Let's convert them into english for better understanding the dataset.

In [42]:

autos.loc[autos['gearbox'] == 'manuell' , 'gearbox'] = 'manual'
autos.loc[autos['gearbox'] == 'automatik' , 'gearbox'] = 'automatic'

In [43]:

autos['gearbox'].value_counts(normalize = True ,dropna = False)

Out[43]:

manual       0.752116
automatic    0.219967
NaN          0.027918
Name: gearbox, dtype: float64

*We see that over 75% of the listings on the website include a 'Manual gearbox', which is more commonly known as a 'stick-shift'.*

Exploring the 'power_ps' column:

In [44]:

autos['power_p_s'].describe()

Out[44]:

count    42661.000000
mean       122.413164
std        191.183734
min          0.000000
25%         75.000000
50%        110.000000
75%        150.000000
max      17700.000000
Name: power_p_s, dtype: float64

There are major outliers in the power_p_s column as the vehicles listed cannot have a minimum value of '0' and a maximum of '17700'. We should get rid of such ouliers in our data:

In [45]:

autos['power_p_s'].value_counts(dropna = False)

Out[45]:

0        2979
75       2663
60       1863
150      1783
140      1742
101      1534
116      1510
90       1478
170      1375
105      1296
136       852
125       833
163       799
102       785
143       694
122       664
131       661
54        655
110       610
109       575
177       521
80        515
50        497
58        460
120       456
69        454
184       362
95        353
68        350
115       347
         ... 
1000        1
2729        1
168         1
3750        1
999         1
455         1
678         1
454         1
30          1
587         1
304         1
1003        1
8404        1
431         1
6045        1
157         1
1998        1
430         1
398         1
253         1
14          1
3500        1
365         1
317         1
17700       1
460         1
268         1
236         1
1771        1
16312       1
Name: power_p_s, Length: 426, dtype: int64

A simple google search tells us that the compact_cars are the least powered vehicles with an avg of ' 70hp ' and the most powerful cars have around ' 1500hp ' . Let's limit our power range to ( 70-1500 ) in order to make the dataset relevant.

In [46]:

## selecting only the vehicles within the power range of (70-1500):
autos = autos.loc[autos['power_p_s'].between(70,1500),:]

In [47]:

autos['power_p_s'].describe()

Out[47]:

count    33658.000000
mean       140.840305
std         65.528226
min         70.000000
25%        101.000000
50%        125.000000
75%        163.000000
max       1405.000000
Name: power_p_s, dtype: float64

Exploring the 'unrepaired_damage' column:

In [48]:

autos['unrepaired_damage'].value_counts(dropna= False)

Out[48]:

nein    26987
NaN      3747
ja       2924
Name: unrepaired_damage, dtype: int64

Since all of the listings are in German, let's convert them to English for relevance:

In [49]:

autos.loc[autos['unrepaired_damage'] == 'nein' , 'unrepaired_damage'] = 'No'
autos.loc[autos['unrepaired_damage'] == 'ja' , 'unrepaired_damage'] = 'Yes'

In [50]:

autos['unrepaired_damage'].value_counts(normalize = True,dropna= False)

Out[50]:

No     0.801800
NaN    0.111326
Yes    0.086874
Name: unrepaired_damage, dtype: float64

We see that over 80% of the cars listed on the website suffer from unrepaired damage, which does not bode well for the website on the first glance. But given that these vehicles are used, some damage should be expected.

*** Now that we have cleaned the dataset, let's move towards gaining inferences from our data. ***

`AGGREGATING` the brand column:¶

In [51]:

## Let's first see the the number of brands in the dataset:
print(autos['brand'].value_counts(normalize = True ,dropna = False)*100)
print('\n')
print('There are {} unique car brands in the dataset.'.format(len(autos['brand'].value_counts(dropna = False))))

volkswagen        20.004159
bmw               13.559926
mercedes_benz     11.572286
audi              10.571038
opel               9.341018
ford               6.173867
renault            3.693030
peugeot            2.757145
skoda              1.744013
mazda              1.690534
seat               1.550894
fiat               1.437994
nissan             1.283499
toyota             1.253788
citroen            1.143859
mini               1.129004
volvo              1.117119
hyundai            0.965595
mitsubishi         0.926971
honda              0.915087
alfa_romeo         0.799216
porsche            0.775447
sonstige_autos     0.754650
kia                0.736823
smart              0.695228
chevrolet          0.540733
suzuki             0.540733
chrysler           0.410007
dacia              0.320875
jeep               0.285222
land_rover         0.249569
subaru             0.216888
saab               0.205003
jaguar             0.184206
rover              0.136669
daihatsu           0.101016
daewoo             0.098045
lancia             0.074277
lada               0.044566
Name: brand, dtype: float64


There are 39 unique car brands in the dataset.

Let's aggregate the car brands on the basis of their mean price as they are listed on the website. Also, we calculated that about 80% of the cars listed on the website had unrepaired damage. Let's use that data to futher check if the 'brand value' affects the listed car prices for the first 10 most popular brands in the dataset.

In [52]:

# empty dictionary to store the values:
car_brand_prices = {}
car_brand_prices_undam = {}
car_brand_prices_dam = {}

car_brands = autos['brand'].value_counts(dropna = False).head(10).index

## looping over the dataset to aggregate and find the mean price per brand:
for brand in car_brands:
    brand_col = autos.loc[autos['brand'] == brand,:]
    
    brand_col_dam = brand_col.loc[brand_col['unrepaired_damage'] == 'Yes']
    brand_col_undam = brand_col.loc[brand_col['unrepaired_damage'] == 'No']
    
    # to find mean price:
    mean_price = brand_col['price'].sum() / len(brand_col)
    mean_price_dam = brand_col_dam['price'].sum() / len(brand_col_dam)
    mean_price_undam = brand_col_undam['price'].sum() / len(brand_col_undam)
    
    # adding to their respective dictionaries :
    car_brand_prices[brand] = mean_price
    car_brand_prices_undam[brand] = mean_price_undam
    car_brand_prices_dam[brand] = mean_price_dam

In [53]:

# Ranking car brands by average price :
print("The car brands and their average prices are given as:" + '\n')
price_list = []
for brand,price in car_brand_prices.items():
    price_list.append((price, brand))

price_list.sort(reverse = True)

for price,brand in price_list:
    print(brand + ' : ' + str(price))
    
avg_top_3 = 0;

for price,brand in price_list[:3]:
    avg_top_3  = avg_top_3 + price
    
avg_top_3 = avg_top_3/ 3;

print("The average price of the top three car brands : {}" .format(int(avg_top_3)))

The car brands and their average prices are given as:

audi : 10061.087408656549
mercedes_benz : 9145.857509627727
bmw : 9115.98773006135
skoda : 7261.8313458262355
volkswagen : 7121.500816872122
ford : 5255.128970163619
mazda : 4686.669595782074
opel : 3884.320292620865
peugeot : 3802.855603448276
renault : 3350.540627514079
The average price of the top three car brands : 9440

In [54]:

# Ranking car brands by average price of damaged cars :
print("The car brands and their average prices of damaged cars given as:" + '\n')
price_list_dam = []
for brand,price in car_brand_prices_dam.items():
    price_list_dam.append((price, brand))

price_list_dam.sort(reverse = True)

for price,brand in price_list_dam:
    print(brand + ' : ' + str(price))

avg_top_3 = 0;

for price,brand in price_list_dam[:3]:
    avg_top_3  = avg_top_3 + price
    
avg_top_3 = avg_top_3/ 3;

print("The average price of the top three car brands in the damaged category : {}" .format(int(avg_top_3)))

The car brands and their average prices of damaged cars given as:

volkswagen : 5076.759656652361
skoda : 4467.222222222223
mercedes_benz : 4252.318021201413
bmw : 3846.87125748503
audi : 3518.924187725632
ford : 1741.9076305220883
opel : 1712.28125
peugeot : 1697.137254901961
renault : 1586.291390728477
mazda : 1552.72131147541
The average price of the top three car brands in the damaged category : 4598

In [55]:

# Ranking car brands by average price of undamaged cars :
print("The car brands and their average prices of undamaged cars given as:" + '\n')
price_list_undam = []
for brand,price in car_brand_prices_undam.items():
    price_list_undam.append((price, brand))

price_list_undam.sort(reverse = True)

for price,brand in price_list_undam:
    print(brand + ' : ' + str(price))
    
avg_top_3 = 0;

for price,brand in price_list_undam[:3]:
    avg_top_3  = avg_top_3 + price
    
avg_top_3 = avg_top_3/ 3;

print("The average price of the top three car brands in the undamaged category is : {}".format(int(avg_top_3)))

The car brands and their average prices of undamaged cars given as:

audi : 11222.07032057911
mercedes_benz : 10000.280860349127
bmw : 9667.33493846977
skoda : 7693.350806451613
volkswagen : 7590.761508596783
ford : 6222.325492689129
mazda : 5486.579676674365
opel : 4423.779074614423
peugeot : 4261.455295735901
renault : 3888.2448979591836
The average price of the top three car brands in the undamaged category is : 10296

From the price comparison above we can draw certain conclusions:

The top three car brands by resale value prices are Audi , Mercedes-Benz and BMW with an average price of 9440 overall and 10296 in the undamaged category.
However, Volkswagen, Skoda and Mercedes-Benz hold the top three in the damaged cars category with an average price of : 4598
Opel , Peugeot and Renalut are the cheapest car brands in the dataset.

*** Since Mercedez-Benz features in the top three amongst all the categories, it is safe to say that it has a higher brand value or it holds its resale value the best amongst the given brands. However, we should also take a note that the dataset if from a German website and Mercedes-Benz is German itself, which might affect the overall resale value of the brand. ***

`AGGREGATING` the `odometer_km` column:¶

In [56]:

## looping over the dataset to aggregate and find the mean mileage per brand:
car_brand_mil = {}
for brand in car_brands:
    brand_col = autos.loc[autos['brand'] == brand,:]
    
    # to find mean mileage:
    mean_mil = brand_col['odometer_km'].sum() / len(brand_col)
    
    # adding to their respective dictionaries :
    car_brand_mil[brand] = mean_mil

print("The car brands and their average mileages are given as:" + '\n')
mil_list = []
for brand,mil in car_brand_mil.items():
    mil_list.append((mil, brand))

mil_list.sort(reverse = True)

for mil,brand in mil_list:
    print(brand + ' : ' + str(mil))

The car brands and their average mileages are given as:

bmw : 132300.61349693252
mercedes_benz : 130513.47881899872
audi : 127922.9904440697
opel : 127188.2951653944
volkswagen : 127079.31085697311
ford : 124983.15688161695
renault : 124883.34674175382
peugeot : 124364.22413793103
mazda : 122996.48506151143
skoda : 112827.93867120954

*The top three car brands with highest mean mileages are BMW , Mercedes-Benz and Audi.*

In [57]:

## creating a new dataframe object to check the correlation between mean prices and mean mileage:
cars = {'mean_prices': car_brand_prices, 'mean_mileage' :car_brand_mil }
df_Car = pd.DataFrame(data= cars)
df_Car

Out[57]:

	mean_mileage	mean_prices
audi	127922.990444	10061.087409
bmw	132300.613497	9115.987730
ford	124983.156882	5255.128970
mazda	122996.485062	4686.669596
mercedes_benz	130513.478819	9145.857510
opel	127188.295165	3884.320293
peugeot	124364.224138	3802.855603
renault	124883.346742	3350.540628
skoda	112827.938671	7261.831346
volkswagen	127079.310857	7121.500817

In [58]:

df_Car.corr()

Out[58]:

	mean_mileage	mean_prices
mean_mileage	1.000000	0.309558
mean_prices	0.309558	1.000000

*** We see that the mean prices and mean mileage of the car brands are loosely related with a correlation factor of 0.309.***

Let's find the top 10 most common brand/model combinatons from our dataset:

In [59]:

# finding the top 10 most common car models from our dataset:
autos['brand_model'] = autos['brand'] + '-' + autos['model']
bmods = autos['brand_model'].value_counts(dropna = False).head(11).index
brand_model ={}

print ("The top 10 most commonly listed brand-model combinations with their number of listings are: "+ '\n')
for bmod in bmods:
    car_model = autos.loc[autos['brand_model'] == bmod]
    count = len(car_model)
    
    brand_model[bmod] = count

mod_list = []

for mod,count in brand_model.items():
    mod_list.append((count, mod))

mod_list.sort(reverse = True)

for count,mod in mod_list:
    print(str(mod)+ ' : ' + str(count))

The top 10 most commonly listed brand-model combinations with their number of listings are: 

volkswagen-golf : 3001
bmw-3er : 2307
volkswagen-passat : 1170
opel-astra : 1099
audi-a4 : 1089
bmw-5er : 1032
mercedes_benz-c_klasse : 1012
mercedes_benz-e_klasse : 848
audi-a3 : 732
audi-a6 : 718
nan : 0

We see that volkswagen-golf, bmw-3er and volkswagen-passat* are the popular Brand-Model listings on the website.*

Conclusion :¶

We exlored the data from 'eBay Kleinanzeigen' and after thorough cleaning of the dataset we can draw the following conclusions from our analysis:

From the ad_created , last_seen and the date_crawled we see that the website is very active and columns have the highest metrics during the months of 'March - April' every year, hence that period seems to be popular for car sales.
'Limousines' which essentially are large luxury vehicles, is the most common vehicle_type listed for sale on the website.
Over 75% of the listings on the website include a 'Manual gearbox'.
Over 80% of the cars listed on the website suffer from unrepaired damage.
Volkswagen is the most popular car brand listed on the website, taking up around 20% of the listings.
Mercedes-Benz is the costliest car brand listed on the website.
BMW has the highest mean mileage amongst the other brands listed on the website.
Based on the data from the website we found that mean price and mean mileage of the cars listed are very loosely related (correlation coeff: 0.3), which means that the buyers believe in the intrinsic value of the car brand rather than the distance covered by the vehicle.
Damaged cars priced cheaper than their undamaged counterparts.

Exploring Ebay Car Sales Data¶

Exploring the price and odometer_km columns in the dataset to look for outliers:¶

Exploring the columns containing 'date-time' data:¶

AGGREGATING the brand column:¶

AGGREGATING the odometer_km column:¶

Conclusion :¶

Exploring the `price` and `odometer_km` columns in the dataset to look for outliers:¶

`AGGREGATING` the brand column:¶

`AGGREGATING` the `odometer_km` column:¶