Exploring eBay Car Sales Data¶

Introduction¶

The dataset for this project we'll work with is on used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here.

The aim of this project is to clean the data and analyze the included used car listings.

Exploring the dataset¶

We will import the Pandas and NumPy liabries to explore the dataset and get information about it.

In [1]:

#Import libraries to read the dataset
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [2]:

autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:

#Use the info() method to get infomation about the data
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [4]:

#Use the head() method to view the first few rows of the dataset
autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Cleaning Column Names¶

From the exploriing our dataset, we can make the following observations:

The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase.

Next we will convert the column names from camelcase to snakecase and reword some of the column names to be more descriptive.

In [5]:

autos.columns

Out[5]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:

# Use dataframe.column attribute to view the existing column names
# And rename the columns, and transforming camelcase to snakecase

old_column_names = autos.columns
new_column_names = ['date_crawled',
                    'name',
                    'seller',
                    'offer_type',
                    'price',
                    'abtest',
                    'vehicle_type',
                    'registration_year',
                    'gearbox',
                    'power_ps',
                    'model',
                    'odometer',
                    'registration_month',
                    'fuel_type',
                    'brand',
                    'unrepaired_damage',
                    'ad_created',
                    'num_of_pictures',
                    'postal_code',
                    'last_seen']

# Assign the new column names to the dataset using dataframe.column
autos.columns = new_column_names

In [7]:

autos.head(2)

Out[7]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08

Further Exploration and Cleaning.¶

We will continue with some basic exploration to figure where there is need to carry out further data cleaning tasks. We will look for:

Check for text columns where all or almost all of the values are the same
Check for columns with numeric data stored as text, which can be cleaned and converted

In [8]:

#We will look at the descriptive statistics of our dataset
autos.describe(include = 'all')

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

From the described dataset we can notice that:

The vehicle_type,gearbox,model,fuel_type, last_seen and unrepaired_damage columns have counts less than 50000. Possible to have missing or NaN values.
The seller and offer_type columns has 2 unique entries each, but 'privat' and 'Angebot' accounted for 49999 of 50000 entries in their respective columns. These can be dropped.
The statistics for the price column return NaN, suggesting the price is not in numeric format.
The num_of_pictures column seems to have only 0 values. It can be dropped.
The registration_year and registration_month columns appear to be in int format instead of datetime.
The postal_code column can be converted to strings as they are not actual numeric values.
dateCrawled, last_seen, and last_seen appear to be in string format.
The numeric data can be extracted from the odometer column and converted to numeric data type.

In [9]:

#Investigating futher the seller column
autos['seller'].value_counts()

Out[9]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [10]:

#Investigating futher the seller column
autos['offer_type'].value_counts()

Out[10]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [11]:

#Investigating futher the seller column
autos['num_of_pictures'].value_counts()

Out[11]:

0    50000
Name: num_of_pictures, dtype: int64

The columns offer_type, num_of_pictures and seller columns can be dropped from the exploration above. The first two mentioned columns have mostly one value, while the later had the same value.

In [12]:

# converting the price column to numeric field
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)

In [13]:

# converting the odometer column to numeric field
autos["odometer"] = autos["odometer"].str.replace("km", "").str.replace(",", "").astype(int)

#Renaming the odometer column to a more descriptive name
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)

In [14]:

#Removing unwanted columns
autos.drop(['seller','offer_type', 'num_of_pictures'], 
           axis=1, inplace=True)

In [15]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   price               50000 non-null  int64 
 3   abtest              50000 non-null  object
 4   vehicle_type        44905 non-null  object
 5   registration_year   50000 non-null  int64 
 6   gearbox             47320 non-null  object
 7   power_ps            50000 non-null  int64 
 8   model               47242 non-null  object
 9   odometer_km         50000 non-null  int64 
 10  registration_month  50000 non-null  int64 
 11  fuel_type           45518 non-null  object
 12  brand               50000 non-null  object
 13  unrepaired_damage   40171 non-null  object
 14  ad_created          50000 non-null  object
 15  postal_code         50000 non-null  int64 
 16  last_seen           50000 non-null  object
dtypes: int64(6), object(11)
memory usage: 6.5+ MB

Exploring the Odometer_km and Price Columns¶

We'll continue exploring the data, specifically looking for data that doesn't look right. We'll begin by analyzing the odometer_km and price columns. Tasks we will carry out are:

Use the minimum and maximum values to search for outliers.
Remove outliers sif they are present

In [16]:

# Use Series.unique().shape to see unique values
print('The shape of odometer_km column is:', autos['odometer_km'].unique().shape)
print('The shape of price column is:', autos['price'].unique().shape)

The shape of odometer_km column is: (13,)
The shape of price column is: (2357,)

In [17]:

# Use Series.describe() to view the summary statistics
autos[['odometer_km','price']].describe()

Out[17]:

	odometer_km	price
count	50000.000000	5.000000e+04
mean	125732.700000	9.840044e+03
std	40042.211706	4.811044e+05
min	5000.000000	0.000000e+00
25%	125000.000000	1.100000e+03
50%	150000.000000	2.950000e+03
75%	150000.000000	7.200000e+03
max	150000.000000	1.000000e+08

The price column has minimum value of 0. This could be an outlier as cars are not sold for $0.

Similarly, it is unlikely for used cars to be sold for as high as $100,000,000.

On the minimum value of 5000 for odometer column might be an outlier as it has huge difference from the min and the lowest quartile.

In [18]:

# Exploring the odometer_km column
# With Series.value_counts() to see unique values
autos['odometer_km'].value_counts(ascending=False).head(13)

Out[18]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

The data for the odometer column looks realistics, and needs no further modification at this point.

We will keep the price maximum value at 350000. Values above this seem unrealistic.

In [19]:

# Exploring the price column
# With Series.value_counts() to see unique values
autos['price'].value_counts(ascending=False).head(20)

Out[19]:

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price, dtype: int64

We will not be removing the values with 0. They make up a large part of the dataset, and may be useful as we go along.

In [20]:

autos = autos[autos['price'].between(0,350000)]
autos['price'].describe()

Out[20]:

count     49986.000000
mean       5721.525167
std        8983.617820
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price, dtype: float64

Exploring the date columns¶

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. They include:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

The date_crawled, last_seen, and ad_created columns are all identified as string; we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are in as numeric formmat; methods like Series.describe() can help us understand the distribution without further data processing.

In [21]:

# A closer look to first understand how the values in the three string columns are formatted
autos[['date_crawled','ad_created','last_seen']][:5]

Out[21]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

You'll notice that the first 10 characters represent the day, representd as year, month and day (e.g. 2016-03-12); and the last 8 characters the time in hour, minute and seconds (e.g.14:38:50).

Going forward, we will extract just the date values to understand the date range.

In [22]:

# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[22]:

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

The data for data_crawled is between 2016-03-05 and 2016-04-07

In [23]:

# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[23]:

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038931
2016-04-04    0.036850
2016-04-05    0.011843
2016-04-06    0.003261
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

Ads were created between 2015-06-11 and 2016-04-07. The maximum ad was in 2016-04-03.

In [24]:

# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[24]:

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

In [25]:

# We will view the distribution of registration_year
autos['registration_year'].describe()

Out[25]:

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The minimum (1000) and maximum (9999) of the registration year column might be wrong, and appear unrealistic.

In [26]:

# Generate a distribution, and then sort by the index.
registration_year_list = autos['registration_year'].value_counts(normalize=True, dropna=False, ascending=True).sort_index()
print(registration_year_list.head(20))
print('\n')
print(registration_year_list.tail(20))

1000    0.00002
1001    0.00002
1111    0.00002
1500    0.00002
1800    0.00004
1910    0.00018
1927    0.00002
1929    0.00002
1931    0.00002
1934    0.00004
1937    0.00008
1938    0.00002
1939    0.00002
1941    0.00004
1943    0.00002
1948    0.00002
1950    0.00006
1951    0.00004
1952    0.00002
1953    0.00002
Name: registration_year, dtype: float64


2011    0.032689
2012    0.026467
2013    0.016125
2014    0.013304
2015    0.007982
2016    0.026327
2017    0.029048
2018    0.009823
2019    0.000060
2800    0.000020
4100    0.000020
4500    0.000020
4800    0.000020
5000    0.000080
5911    0.000020
6200    0.000020
8888    0.000020
9000    0.000040
9996    0.000020
9999    0.000080
Name: registration_year, dtype: float64

Dealing with the Registration Year Data¶

From the exploration we did on the registration_year column, we observed the column contains:

A minimum value of 1000. This year was before the invention of cars.
The maximum value of 9999. This is a year into the future.

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Additionally, it is more realistic, for the earliest valid year to be somewhere in the first few decades of the 1900s.

So we will be keeping date values for 1910 (as earliest) to 2016.

In [27]:

# Selecting year between 1990 to 2016
autos = autos[autos['registration_year'].between(1990,2016)]
autos.describe()

Out[27]:

	price	registration_year	power_ps	odometer_km	registration_month	postal_code
count	46628.000000	46628.000000	46628.000000	46628.000000	46628.000000	46628.000000
mean	5753.910269	2003.518036	117.910097	126136.119928	5.801021	50921.426782
std	8908.747513	5.801633	197.405225	39567.381126	3.694594	25757.597852
min	0.000000	1990.000000	0.000000	5000.000000	0.000000	1067.000000
25%	1149.000000	1999.000000	75.000000	125000.000000	3.000000	30519.000000
50%	2990.000000	2003.000000	109.000000	150000.000000	6.000000	49671.000000
75%	7300.000000	2008.000000	150.000000	150000.000000	9.000000	71636.000000
max	350000.000000	2016.000000	17700.000000	150000.000000	12.000000	99998.000000

In [28]:

autos['registration_year'].value_counts(normalize=True, ascending=True)

Out[28]:

1991    0.007635
1992    0.008364
1990    0.008471
2015    0.008557
1993    0.009544
1994    0.014155
2014    0.014262
2013    0.017286
1995    0.028138
2016    0.028223
2012    0.028374
1996    0.030969
2010    0.034250
2011    0.035043
1997    0.043493
2009    0.044973
2008    0.047847
2007    0.049412
1998    0.052608
2002    0.054324
2001    0.057948
2006    0.058055
2003    0.058484
2004    0.058699
1999    0.064296
2005    0.064661
2000    0.071931
Name: registration_year, dtype: float64

The data earlier than 1990 and above 2016 have been removed. Also, there is an increase in registration as the year increases.

Exploring Price by Brand¶

We can use aggregation to understand the brand column and understand the variations there are. For this we will:

Identify the unique values we want to aggregate by
Create an empty dictionary to store our aggregate data
Loop over the unique values, and for each:
- Subset the dataframe by the unique values:
- Calculate the mean of whichever column we're interested in
- Assign the val/mean to the dict as k/v.

In [29]:

autos.head(3)

Out[29]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37

In [30]:

autos['brand'].describe()

Out[30]:

count          46628
unique            40
top       volkswagen
freq            9874
Name: brand, dtype: object

In [31]:

autos_brand = autos['brand'].value_counts(normalize=True)
unique_brands = autos['brand'].unique()

print(autos_brand)
print('\n')
print('These are the unique brands in the data brand column:', '\n', unique_brands)

volkswagen        0.211761
bmw               0.111221
opel              0.109398
mercedes_benz     0.092670
audi              0.087866
ford              0.069851
renault           0.048426
peugeot           0.030389
fiat              0.025864
seat              0.018701
skoda             0.016406
mazda             0.015484
nissan            0.015356
smart             0.014326
citroen           0.013919
toyota            0.012632
hyundai           0.010144
volvo             0.009029
mini              0.008793
mitsubishi        0.008343
sonstige_autos    0.008042
honda             0.007871
kia               0.007313
alfa_romeo        0.006391
suzuki            0.006026
porsche           0.005404
chevrolet         0.005104
chrysler          0.003667
dacia             0.002638
daihatsu          0.002595
jeep              0.002230
subaru            0.002166
land_rover        0.002016
saab              0.001566
jaguar            0.001544
daewoo            0.001544
rover             0.001394
lancia            0.001072
lada              0.000493
trabant           0.000343
Name: brand, dtype: float64


These are the unique brands in the data brand column: 
 ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia'
 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda'
 'subaru' 'sonstige_autos' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'land_rover' 'alfa_romeo' 'rover' 'trabant' 'daihatsu'
 'lancia' 'lada']

The volkswagen is the most popular, accounting for about 21 percent of the total brand value, while trabant had the least. Let's explore brands over 5%.

In [32]:

#select the brands for the aggregation
selected_brands = autos_brand[autos_brand > 0.05].index
print(selected_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [33]:

bprice_means = {}

for brand in selected_brands:
    brand_select = autos[autos['brand']==brand]
    avg_brand = brand_select['price'].mean()
    bprice_means[brand] = int(avg_brand)
    
bprice_means

Out[33]:

{'volkswagen': 5233,
 'bmw': 8135,
 'opel': 2859,
 'mercedes_benz': 8447,
 'audi': 9181,
 'ford': 3362}

From the price means we can group car brands that contribute over 5% to the total sum into 3 distinct price categories:

Expensive cars: The brands BMW, Mercedes and Audi are found in this category
Averagely expensive cars: Volkskwagen falls in this group
The less expensive class: Ford and Opel are in this group.

Storing Aggregate Data in a DataFrame¶

For the top 6 brands, we'll use aggregation to understand the average mileage for those cars and see if there is any visible link with the mean price.

In [34]:

autos.columns

Out[34]:

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [35]:

# selected_brands
mean_milage = {}

for brand in selected_brands:
    brand_mile = autos[autos['brand']==brand]
    milage_mean = brand_mile['odometer_km'].mean()
    mean_milage[brand] = int(milage_mean)

print(mean_milage)

{'volkswagen': 129127, 'bmw': 132561, 'opel': 129836, 'mercedes_benz': 130925, 'audi': 129090, 'ford': 125164}

In [36]:

# convert the brand price means and odometer dictionaries to series
bpm_series = pd.Series(bprice_means)
mean_mileage = pd.Series(mean_milage)

# creating a single-column dataframe from the above price mean series object
bpm_df = pd.DataFrame(bpm_series, columns =['price_mean'])

# assigning the mean mileage as a new column to the dataframe
bpm_df['mean_mileage'] = mean_mileage

#print dataframe (bpm_df) and sort by the price column
bpm_df.sort_values(by=['price_mean'], ascending=False)

Out[36]:

	price_mean	mean_mileage
audi	9181	129090
mercedes_benz	8447	130925
bmw	8135	132561
volkswagen	5233	129127
ford	3362	125164
opel	2859	129836

Conclusion¶

The analysis, we discovered three distinct classes of car brands based on their prices. Audi cars appeared to be the most expensive brand amongst all. However, the analysis also revealed that there is no correlation between the prices of the car brands and the miles they had covered.