Exploring Ebay Car Sales Data¶

Introduction¶

The given dataset contains information about used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded by other user, afterwards it was modified and dirtied a bit as well - so, it is necessary to clean the data before working with it. The original dataset isn't available on Kaggle anymore, but it can be found here.

The aim of the following project is to analyze the included used car listings and make some conclusions about the given dataframe.

Importing the modules and opening the file¶

In [1]:

# Importing necessary modules - pandas and numpy
import pandas as pd
import numpy as np

In [2]:

# Reading the 'autos.csv' file as 'autos' 
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:

# Only first 5 rows are returned as currently there's no need to examine all the rows 
autos.head() 

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Obtaining basic information about the dataframe¶

In [4]:

# Obtaining information about the dataframe
autos.info()
# Checking how many rows are present in the dataframe 
print('\nnumber of rows in df:', autos.shape[0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

number of rows in df: 50000

All the columns are non-null, however, there are NaN values in some columns (such as: 'vehicleType', 'gearbox', 'model', 'fueltype', 'notRepairedDamage' - because they aren't equal to 50000 - the total number of rows in the dataframe). There are 5 integer and 15 object columns - that makes 20 columns in total.

In [5]:

# Getting an array of existing column names
autos.columns

Out[5]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Editing column names¶

In [6]:

# Editing column names and converting them from camelcase
autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 'lastSeen':'last_seen'}, axis=1, inplace=True)

In [7]:

# Making sure that the columns have been converted to snakecase
autos.columns

Out[7]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [8]:

# Making sure that the converted columns are now in the dataframe and haven't altered the rest of the dataframe structure
autos.head()

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Column labels were replaces from camelcase to snakecase to make the data more descriptive, also some column names ('yearOfRegistration', 'monthOfRegistration', 'notRepairedDamage', 'dateCreated') were altered to make them easier to understand and avoid misinformation they could cause

Copying with columns values¶

In [9]:

# Looking at descriptive statistics for all columns
autos.describe(include='all')

Out[9]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-12 16:06:22	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Results analysis:¶

'nr_of_pictures' has only one value 0 (as its std, min, and max are equal to 0). 'seller' and 'offer_type' have two values (the count of the first value is 49999 and the count of the second one is only 1) - they can be candidates to be dropped as well. We are going to remove the 'nr_of_pictures' column; we are going to remove 'seller' and 'offer_type' too - as we can see, they have almost no unique values

'price' and 'odometer' store numeric data as a text and does not allow us to analyze this data as numeric data (does not allow us to find their min, max, std)

'registration_year', 'power_ps', 'registration_month', and 'postal code' need more investigation as their number of unique values cannot be defined. They are equal to NaN because their dtype is 'int64' - the '.describe()' method is not capable of finding unique values of numeric values

Now, after we examined the results presented above, it is necessary to optimize the dataframe:

1) Removing statistically insignificant columns¶

In [10]:

# Removing the 'nr_of_pictures', 'seller', 'offer_type' columns
autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)

In [11]:

# Making sure that the mentioned columns have been removed
autos.shape[1]

Out[11]:

2) Converting text into numeric data¶

Converting 'price'¶

In [12]:

# Understanding what non-numeric characters are present in the 'price' column
autos['price'].head()

Out[12]:

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

In [13]:

# Removing non-numeric values (',' and '$') from the 'price' column -
# including converting the column to a numeric dtype 
autos['price'] = autos['price'].str.replace(',','').str.replace('$','').astype('int64')

In [14]:

# Making sure that non-numeric values have been removed
autos['price'].head()

Out[14]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Converting 'odometer' and renaming it¶

In [15]:

# Understanding what non-numeric characters are present in the 'odometer' column
autos['odometer'].head() 

Out[15]:

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

In [16]:

# Removing non-numeric values ('km' and ',') from the 'odometer' column and converting it to a numeric dtype
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype('int64')

In [17]:

# Making sure that non-numeric values have been removed
autos['odometer'].head()

Out[17]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

In [18]:

# Renaming the column from 'odometer' to 'odometer_km'
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [19]:

# Making sure that the column has been renamed
autos.columns

Out[19]:

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

3) Describing numeric columns¶

Now we can find out how many unique values numeric columns have. It is necessary to make sure that in many rows they have more than one value - otherwise these columns will be viewed as statistically insignificant and, consequently, will be removed from the dataframe

In [20]:

# Calculating the number of unique values for each numeric column
print('Unique values for numeric columns:')
print('registration_year:', *autos['registration_year'].unique().shape)
print('power_ps:', *autos['power_ps'].unique().shape)
print('registration_month:', *autos['registration_month'].unique().shape)
print('postal_code:', *autos['postal_code'].unique().shape)

Unique values for numeric columns:
registration_year: 97
power_ps: 448
registration_month: 13
postal_code: 7014

As we can see, all the four numeric columns have a plenty of unique values - that proves that they statistically significant and that they cannot be dropped

Copying with outliers¶

Before removing the outliers from 'odometer_km' and 'price' columns, we need to describe these columns (with outliers included) in order to be able to compare them with themselves, but without outliers

In [21]:

# Observing the values present in 'odometer_km' and 'price' columns before the outliers are removed
outliers_table = autos[['odometer_km', 'price']].describe()
outliers_table

Out[21]:

	odometer_km	price
count	50000.000000	5.000000e+04
mean	125732.700000	9.840044e+03
std	40042.211706	4.811044e+05
min	5000.000000	0.000000e+00
25%	125000.000000	1.100000e+03
50%	150000.000000	2.950000e+03
75%	150000.000000	7.200000e+03
max	150000.000000	1.000000e+08

Now, to remove the outliers (if there are any), we need to identify them with the '1.5xIQR rule'. This rule provides us with an interval needed to exclude all the outliers: [Q1-1.5xIQR; Q3+1.5xIQR], where Q1 is the lower quartile (0.25 quantile), Q3 - upper quartile (0.75 quantile), and IQR = Q3 - Q1. The quantiles can be found with the following function: series.quantile(n), where n ∈ [0,1].

Removing outliers from 'odometer_km'¶

In [22]:

# Removing the outliers from the 'odometer_km' column
odometer_no_outliers = autos[autos['odometer_km'].between(autos['odometer_km'].quantile(0.25)-1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)),
                                   autos['odometer_km'].quantile(0.75)+1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)))]   

In [23]:

# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['odometer_km'].shape)
print('df without outliers:', *odometer_no_outliers['odometer_km'].shape) 

df with potential outliers: 50000
df without outliers: 41520

There were outliers in the 'odometer_km' column because the number of rows of 'df without outliers' (equal to 41520) is less than this number of 'df with potential outliers' (50000). Then we can replace the values of the 'autos' dataframe with the values of 'odometer_no_outliers' - it is the same dataframe but without outliers in the 'odometer_km' column

In [24]:

# Removing the rows with the outliers in the 'odometer_km' column
autos = odometer_no_outliers

In [25]:

# Making sure that the number of rows in the new 'autos' version is less than 50000
autos.shape[0]

Out[25]:

Removing outliers from 'price'¶

In [26]:

# Removing the outliers from the 'price' column
price_no_outliers = autos[autos['price'].between(autos['price'].quantile(0.25)-1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)), 
                                                 autos['price'].quantile(0.75)+1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)))]

In [27]:

# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['price'].shape)
print('df without outliers:', *price_no_outliers['price'].shape) 

df with potential outliers: 41520
df without outliers: 38386

There were outliers in the 'price' column because the number of rows of 'df without outliers' (equal to 38386) is less than this number of 'df with potential outliers' (41520). Then we can replace the values of the 'autos' dataframe with the values of 'price_no_outliers' - it is the same dataframe but without outliers in the 'price' column

In [28]:

# Removing the rows with the outliers in the 'price' column
autos = price_no_outliers

In [29]:

# Making sure that the number of rows in the new 'autos' version is less than 41520
autos.shape[0]

Out[29]:

Comparing the columns with and without outliers¶

Now we can compare the values of the columns with outliers included with the values of the same columns without outliers:

In [30]:

# Printing the table with outliers
outliers_table

Out[30]:

	odometer_km	price
count	50000.000000	5.000000e+04
mean	125732.700000	9.840044e+03
std	40042.211706	4.811044e+05
min	5000.000000	0.000000e+00
25%	125000.000000	1.100000e+03
50%	150000.000000	2.950000e+03
75%	150000.000000	7.200000e+03
max	150000.000000	1.000000e+08

In [31]:

# Observing the values left in 'odometer_km' and 'price' after removing outliers
autos[['odometer_km', 'price']].describe()

Out[31]:

	odometer_km	price
count	38386.000000	38386.000000
mean	142674.412546	3127.445136
std	16147.477812	2872.206674
min	90000.000000	0.000000
25%	150000.000000	950.000000
50%	150000.000000	2131.000000
75%	150000.000000	4500.000000
max	150000.000000	12250.000000

After the outliers were removed, 'odometer_km' has experienced some changes. Its smallest values were outliers and their removal, consequenty, lead to smaller standard deviation (16147 instead of 40042) and bigger minimum value (90000 instead of 5000).

The new 'price' column has changed significantly as well: its standard deviation is now equal to 2872 instead of 481104; its maximum value is now 12250 instead of 1.000000+e08 (1 billion)

Calculating the distribution of values in columns concerning ads¶

It can be noted that all there are three columns ('date_crawled', 'ad_created', and 'last_seen') characterize ads of cars sold. Then analyzing them can provide us with a brief description of: when the ads were discovered by the crawler, when these ads were active, and when they went inactive.

In [32]:

# Calculating the distribution of values in 'date_crawled'
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()  

Out[32]:

2016-03-05    0.024697
2016-03-06    0.013911
2016-03-07    0.036628
2016-03-08    0.034414
2016-03-09    0.033736
2016-03-10    0.032642
2016-03-11    0.032017
2016-03-12    0.037540
2016-03-13    0.015058
2016-03-14    0.037383
2016-03-15    0.034127
2016-03-16    0.030037
2016-03-17    0.032121
2016-03-18    0.013234
2016-03-19    0.033398
2016-03-20    0.037514
2016-03-21    0.037488
2016-03-22    0.032564
2016-03-23    0.032538
2016-03-24    0.028630
2016-03-25    0.032199
2016-03-26    0.032069
2016-03-27    0.030819
2016-03-28    0.034674
2016-03-29    0.033867
2016-03-30    0.034127
2016-03-31    0.031991
2016-04-01    0.033267
2016-04-02    0.034492
2016-04-03    0.037800
2016-04-04    0.036862
2016-04-05    0.013442
2016-04-06    0.003569
2016-04-07    0.001146
Name: date_crawled, dtype: float64

All the data given was obtained by the crawler within a year (2016). Most of the data was obtained throughout March (3-rd month), while its remaining part - throughout April (4-th month); that means there were only 2 months when the data was 'crawled'.

In [33]:

# Calculating the distribution of values in 'ad_created'
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index() 

Out[33]:

2015-12-05    0.000026
2015-12-30    0.000026
2016-01-07    0.000026
2016-01-10    0.000026
2016-01-13    0.000026
2016-01-16    0.000026
2016-01-22    0.000026
2016-01-27    0.000052
2016-01-29    0.000026
2016-02-01    0.000026
2016-02-02    0.000052
2016-02-05    0.000026
2016-02-08    0.000026
2016-02-09    0.000026
2016-02-11    0.000026
2016-02-12    0.000078
2016-02-16    0.000026
2016-02-18    0.000052
2016-02-19    0.000026
2016-02-20    0.000052
2016-02-21    0.000078
2016-02-22    0.000026
2016-02-23    0.000026
2016-02-25    0.000052
2016-02-26    0.000026
2016-02-27    0.000130
2016-02-28    0.000208
2016-02-29    0.000130
2016-03-01    0.000130
2016-03-02    0.000078
                ...   
2016-03-09    0.033632
2016-03-10    0.032434
2016-03-11    0.032329
2016-03-12    0.037436
2016-03-13    0.016490
2016-03-14    0.035820
2016-03-15    0.033919
2016-03-16    0.030480
2016-03-17    0.031835
2016-03-18    0.013989
2016-03-19    0.032382
2016-03-20    0.037462
2016-03-21    0.037644
2016-03-22    0.032642
2016-03-23    0.032356
2016-03-24    0.028474
2016-03-25    0.032382
2016-03-26    0.032095
2016-03-27    0.030714
2016-03-28    0.034648
2016-03-29    0.034023
2016-03-30    0.033893
2016-03-31    0.031861
2016-04-01    0.033424
2016-04-02    0.034075
2016-04-03    0.038087
2016-04-04    0.037488
2016-04-05    0.012010
2016-04-06    0.003595
2016-04-07    0.001042
Name: ad_created, Length: 66, dtype: float64

The first ad was created on June of 2015 (the 6-th month), while the last one - on April of 2016 (the 4-th month). Most of ads were created in 2016, it took around 8 months to create all the ads. February and March of 2016 (the 2-nd and the 3-rd months accordingly) represented the period of time when the majority of ads was created.

In [34]:

# Calculating the distribution of values in 'last_seen'
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index() 

Out[34]:

2016-03-05    0.001094
2016-03-06    0.005002
2016-03-07    0.006174
2016-03-08    0.008701
2016-03-09    0.010707
2016-03-10    0.011384
2016-03-11    0.014120
2016-03-12    0.026051
2016-03-13    0.009821
2016-03-14    0.013755
2016-03-15    0.016985
2016-03-16    0.017871
2016-03-17    0.030141
2016-03-18    0.007685
2016-03-19    0.017090
2016-03-20    0.022300
2016-03-21    0.022091
2016-03-22    0.022638
2016-03-23    0.019434
2016-03-24    0.021049
2016-03-25    0.020320
2016-03-26    0.017845
2016-03-27    0.017064
2016-03-28    0.022326
2016-03-29    0.023368
2016-03-30    0.025712
2016-03-31    0.024879
2016-04-01    0.024488
2016-04-02    0.025817
2016-04-03    0.025634
2016-04-04    0.025426
2016-04-05    0.116944
2016-04-06    0.205309
2016-04-07    0.120773
Name: last_seen, dtype: float64

All the data was last seen within two months: March and April of 2016. However, it took only about a month for all the data to be seen last (as this process started on the 5-th of March and ended on the 7-th of April). It means that after 2016-04-07 no ad was seen by the crawler online.

Analyzing 'registration_year'¶

In [35]:

# Describing the distribution of 'registration_year'
autos['registration_year'].describe() 

Out[35]:

count    38386.000000
mean      2002.499557
std         36.331672
min       1910.000000
25%       1999.000000
50%       2002.000000
75%       2006.000000
max       9000.000000
Name: registration_year, dtype: float64

The biggest values of the series (from 2008 up to 9999) are located after the third quartile (75%) while the max is equal to 9999 - that means most of the values in the series are located within the [1000;2008] interval.

It is known, however, that the values before 1903 cannot be taken into consideration because the first car registration took place in 1903. The latest year was 2020, so the upper limit is this year. All these facts taken into the consideration create an interval [1903,2020]. Then it is necessary to remove the rows that don't meet the criterion ('registration_year' out of the interval).

In [36]:

autos['registration_year'][autos['registration_year'].between(1903,2020)].value_counts(normalize=True)  

Out[36]:

2000    0.080839
1999    0.074144
2005    0.068699
2001    0.065963
2003    0.065911
2004    0.064635
2002    0.061456
1998    0.060049
2006    0.057965
1997    0.050410
2007    0.044835
2008    0.036889
1996    0.035300
2017    0.032461
1995    0.031002
2016    0.029725
2009    0.027354
1994    0.015553
2010    0.014771
2018    0.010812
1993    0.010655
2011    0.009222
1992    0.008858
1990    0.008363
1991    0.008180
2012    0.004012
1989    0.003960
1988    0.002918
1985    0.001902
1987    0.001615
          ...   
1970    0.000391
1960    0.000391
1973    0.000391
1972    0.000365
1976    0.000339
1977    0.000287
1975    0.000208
2014    0.000208
1965    0.000182
1968    0.000156
1974    0.000156
2015    0.000130
1966    0.000130
1967    0.000104
1971    0.000104
1969    0.000104
1910    0.000052
1956    0.000052
1961    0.000052
1958    0.000052
1964    0.000052
1963    0.000052
2019    0.000052
1962    0.000026
1937    0.000026
1934    0.000026
1959    0.000026
1954    0.000026
1953    0.000026
1950    0.000026
Name: registration_year, Length: 69, dtype: float64

Biggest number of annual registrations belongs to 2000s and the end of 90s. Lowest number belongs to the first half of the previous century (from 30s to 50s).

In [37]:

# Removing the rows where the 'registation_year' values are outside the interval
autos = autos.loc[autos['registration_year'].between(1903,2020)]

Analyzing the 15 most sold brands¶

A decision was made to choose 15 most sold brands, the most commonly sold of which is 'volkswagen' and the least one is 'toyota'.

In [38]:

# Picking 15 most common values in the 'brand' column
autos['brand'].value_counts().head(15) 

Out[38]:

volkswagen       8588
opel             4700
bmw              4052
mercedes_benz    3434
audi             3045
ford             2790
renault          2071
peugeot          1229
fiat             1007
seat              743
mazda             624
nissan            568
citroen           546
skoda             525
toyota            457
Name: brand, dtype: int64

In [39]:

# Creating a dictionary for mean price 
aggregate_price = {}
top_brands = autos['brand'].value_counts().head(15).index
for brand in top_brands:
    sel_row = autos[autos['brand']==brand]
    mean_row = sel_row['price'].mean()
    aggregate_price[brand] = mean_row

In [40]:

aggregate_price

Out[40]:

{'audi': 4322.562889983579,
 'bmw': 4476.98198420533,
 'citroen': 2698.168498168498,
 'fiat': 1843.119165839126,
 'ford': 2243.584946236559,
 'mazda': 2677.7596153846152,
 'mercedes_benz': 4145.680838672102,
 'nissan': 2648.906690140845,
 'opel': 2096.4821276595744,
 'peugeot': 2303.6102522375913,
 'renault': 1713.7469821342347,
 'seat': 2730.265141318977,
 'skoda': 4192.601904761905,
 'toyota': 3775.301969365427,
 'volkswagen': 3180.179902189101}

In [41]:

# Sorting the dictionary by its values in descending order
sorted(aggregate_price.items(), key=lambda x: x[1], reverse=True) 

Out[41]:

[('bmw', 4476.98198420533),
 ('audi', 4322.562889983579),
 ('skoda', 4192.601904761905),
 ('mercedes_benz', 4145.680838672102),
 ('toyota', 3775.301969365427),
 ('volkswagen', 3180.179902189101),
 ('seat', 2730.265141318977),
 ('citroen', 2698.168498168498),
 ('mazda', 2677.7596153846152),
 ('nissan', 2648.906690140845),
 ('peugeot', 2303.6102522375913),
 ('ford', 2243.584946236559),
 ('opel', 2096.4821276595744),
 ('fiat', 1843.119165839126),
 ('renault', 1713.7469821342347)]

'bmw' mean price is the highest among the top 15 brands, it is being followed by 'audi' and 'skoda'. The lowest mean price among the given brands belongs to 'renault'.

In [42]:

# Creating a dictionary for mean mileage
aggregate_mileage = {}
for brand in autos['brand'].value_counts().head(15).index:
    sel_row = autos[autos['brand'] == brand]
    mean_mileage = sel_row['odometer_km'].mean() 
    aggregate_mileage[brand] = mean_mileage
aggregate_mileage    

Out[42]:

{'audi': 146834.15435139573,
 'bmw': 145826.75222112538,
 'citroen': 138095.2380952381,
 'fiat': 137120.1588877855,
 'ford': 140992.83154121865,
 'mazda': 141153.84615384616,
 'mercedes_benz': 145789.16715200932,
 'nissan': 138345.0704225352,
 'opel': 142320.21276595743,
 'peugeot': 140537.02196908055,
 'renault': 141144.37469821342,
 'seat': 140343.20323014804,
 'skoda': 137285.7142857143,
 'toyota': 137133.47921225382,
 'volkswagen': 144168.60735910572}

In [43]:

# Creating series for mean price
bmp_series = pd.Series(aggregate_price) 
bmp_series

Out[43]:

audi             4322.562890
bmw              4476.981984
citroen          2698.168498
fiat             1843.119166
ford             2243.584946
mazda            2677.759615
mercedes_benz    4145.680839
nissan           2648.906690
opel             2096.482128
peugeot          2303.610252
renault          1713.746982
seat             2730.265141
skoda            4192.601905
toyota           3775.301969
volkswagen       3180.179902
dtype: float64

In [44]:

# Creating series for mean mileage
bmm_series = pd.Series(aggregate_mileage)
bmm_series

Out[44]:

audi             146834.154351
bmw              145826.752221
citroen          138095.238095
fiat             137120.158888
ford             140992.831541
mazda            141153.846154
mercedes_benz    145789.167152
nissan           138345.070423
opel             142320.212766
peugeot          140537.021969
renault          141144.374698
seat             140343.203230
skoda            137285.714286
toyota           137133.479212
volkswagen       144168.607359
dtype: float64

In [45]:

# Assigning the values of mean price to df
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

Out[45]:

	mean_price
audi	4322.562890
bmw	4476.981984
citroen	2698.168498
fiat	1843.119166
ford	2243.584946
mazda	2677.759615
mercedes_benz	4145.680839
nissan	2648.906690
opel	2096.482128
peugeot	2303.610252
renault	1713.746982
seat	2730.265141
skoda	4192.601905
toyota	3775.301969
volkswagen	3180.179902

In [46]:

# Inserting the values of mean mileage as a new column
df.insert(1, 'mean_mileage', bmm_series)
df

Out[46]:

	mean_price	mean_mileage
audi	4322.562890	146834.154351
bmw	4476.981984	145826.752221
citroen	2698.168498	138095.238095
fiat	1843.119166	137120.158888
ford	2243.584946	140992.831541
mazda	2677.759615	141153.846154
mercedes_benz	4145.680839	145789.167152
nissan	2648.906690	138345.070423
opel	2096.482128	142320.212766
peugeot	2303.610252	140537.021969
renault	1713.746982	141144.374698
seat	2730.265141	140343.203230
skoda	4192.601905	137285.714286
toyota	3775.301969	137133.479212
volkswagen	3180.179902	144168.607359

In [47]:

# Sorting the dataframe by mean price in descending order
df.sort_values('mean_price', ascending=False)

Out[47]:

	mean_price	mean_mileage
bmw	4476.981984	145826.752221
audi	4322.562890	146834.154351
skoda	4192.601905	137285.714286
mercedes_benz	4145.680839	145789.167152
toyota	3775.301969	137133.479212
volkswagen	3180.179902	144168.607359
seat	2730.265141	140343.203230
citroen	2698.168498	138095.238095
mazda	2677.759615	141153.846154
nissan	2648.906690	138345.070423
peugeot	2303.610252	140537.021969
ford	2243.584946	140992.831541
opel	2096.482128	142320.212766
fiat	1843.119166	137120.158888
renault	1713.746982	141144.374698

In [48]:

# Calculating correlation between the two columns
df['mean_price'].corr(df['mean_mileage'])

Out[48]:

0.43168914135180714

The correlation between these two columns ('mean_price' and 'mean_mileage') is equal to 43% - that is considered to be a weak correlation. Then it can be concluded that these two characteristics ('mean_price' and 'mean_mileage') are weakly correlated; consequently, the changes in one column are not likely to influence the other column (and the other way around).

General conclusions:¶

There were 3 statistically insignificant columns ('seller', 'offer_type', and 'nr_of_pictures');
There were 50000-41520=8480 outliers in the 'odometer_km' column and after their removal there remained 41520-38386=3134 outliers in the 'price' column;
All the ads were first crawled and last seen in March and April of 2016, most of ads were created in the same year;
Mean price of a car brand and its mean mileage are weakly correlated.

In [ ]: