Project Title: eBay Car Sales Data¶

eBay Motors is one of the world's largest marketplaces for trading used cars. The dataset of eBay used cars is analysed in this project. The dataset includes used cars advertisements listed in Germany between 11th June 2015 and 7th April 2016. This dataset is taken from from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The original dataset can be found at https://data.world/data-society/used-cars-data. For the purpose of this project, the original dataset was reduced to 50,000 listings.The reduced dataset used for this project is saved as "autos.csv".

In the data cleaning process below, the listings are further reduced to 44,870. Among others, rows with vehicles registerd before 1986 (vehicles categorised as antique in Germany) are removed.

Project Findings - Highlights:¶

Almost 70% of the vehicles are 9 years or older, out of which, about 8% of the vehicles are 20 years older or older.
Between 11 June 2015 - 7 April 2016, overall, about 99.8% of the listings were created in a period of 38 days (in March and April 2016).

The top 6 brand-model combination have, among others, 2 things in common. More than 2/3 of these listings have their 'gearbox' listed as 'manual' and majority of these manually driven vahicles were driven for 150,000km.

Project Purpose:¶

Below, the data is analysed for vehicles registered between 1986 and 2016. The findings are produced in Part E of this project (below).

---} Part E: Data Analysis¶

--> 5. Analyse the data to:
    5.1. Calculate the distribution based on the column: 'reg_year'.
    5.2. Calculate the distribution based on the columns: 'date_crawled', 'ad_created' and 'last_seen'.
    5.3  Select brand and aggregating mean price.
    5.4. Calculate the mean mileage and mean price for each of the top brands.
    5.5. Find the most common brand/model combinations
    5.6. Find out if the average prices follows any patterns based on the mileage.
    5.7. Find out how much cheaper are cars with damage than their non-damaged counterparts.

Before Part E, the following tasks are carried out.¶

---} Part A: Original Dataset¶

---> 1. How the original dataset is organised
    1.1.  Observation

---} Part B: Reorganising Dataset¶

---> 2. Rows & Columns
    2.1. Review of unique values returned as NaN.
    2.2. Review of columns with only 2 unique values.
    2.3. Convert datetype of 'price' and 'odometer' columns from object to integer.
    2.4. Translate non-English word to English words. 
    2.5. Chang the use camelcase to snakecase in the names of columns and reorganising the columns.

---} Part C: Organised Dataset¶

---> 3. Quick Review of the organised dataset.

---} Part D: Cleaning Data Entries¶

---> 4. Data entries
    4.1. Remove data for antique vehicles from the columns: 'reg_year'.
    4.2. Remove inaccurate entries in the column: 'reg_year'.
    4.3. Review data entry for columns: 'reg_month'.
    4.4. Check for outliers in the column: 'adometer_km'.
    4.5. Check for outliers in the column: 'price_$'.

Part A: Original Dataset¶

1: How the original dataset is organised¶

In [1]:

import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv")
autos

Out[1]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___T�V_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [2]:

autos.describe(include="all")

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

1.1. Observation¶

There are 50,000 entries from index 0 to 49999 and 20 columns.¶

1. Unique values returned as NaN¶

There are 5 values returned as 'NaN' for number of unique values. They are 'yearOfRegistration', 'PowerPS', 'monthOfRegistration', 'nrOfPictures and 'postalCode'.

A further analysis of these rows is required to find out the number of entries and whether they would add any value to the purpose of this project.

2. Columns with 2 unique values¶

There are 5 columns with only 2 unique values. They are 'seller', 'offerType', 'abtest', 'gearbox', and 'notRepairedDamages'.

A further analysis of these rows is required to find out the number of entries attributed to these 2 unique values.

3. Price and odometer datatypes¶

The 'price' and 'odometer' columns' datatypes are 'object'. These can be changed to integar type.

4. Translate non-English word to English words.¶

Columns 'fuelType', 'gearbox', 'NotRepairedDamage' includes words that are not in English language. These words can be translated to English.

5. Column's name uses CamelcaseColumns¶

The columns' names use camelcase instead of snakecase. Python code conventions require that snakecase be used. This would make the code more readable to others using the same programming language and would avoid confusions (for example, with classes which, as a code convention, uses capitalised first alphabets).

This can be changed to Python's preferred style - snakecase. The columns can also be rearranged.

---------- End of Part A ----------¶

Part B: Reorganising Dataset¶

2. Rows & Columns¶

2.1. Review of unique values returned as NaN¶

For 5 columns, the values returned as 'NaN' for the number of unique values. They are 'yearOfRegistration', 'PowerPS', 'monthOfRegistration', 'nrOfPictures' and 'postalCode'. These require further review to find out if they include only 1 or 2 unique values.¶

In [4]:

autos["yearOfRegistration"].value_counts()

Out[4]:

2000    3354
2005    3015
1999    3000
2004    2737
2003    2727
        ... 
4500       1
1931       1
1939       1
5911       1
1952       1
Name: yearOfRegistration, Length: 97, dtype: int64

In [5]:

autos["powerPS"].value_counts()

Out[5]:

0        5500
75       3171
60       2195
150      2046
140      1884
         ... 
455         1
1016        1
952         1
696         1
16312       1
Name: powerPS, Length: 448, dtype: int64

In [6]:

autos["monthOfRegistration"].value_counts()

Out[6]:

0     5075
3     5071
6     4368
5     4107
4     4102
7     3949
10    3651
12    3447
9     3389
11    3360
1     3282
8     3191
2     3008
Name: monthOfRegistration, dtype: int64

In [7]:

autos["nrOfPictures"].value_counts()

Out[7]:

0    50000
Name: nrOfPictures, dtype: int64

In [8]:

autos["postalCode"].value_counts()

Out[8]:

10115    109
65428    104
66333     54
45888     50
44145     48
        ... 
71576      1
76776      1
76872      1
91233      1
67585      1
Name: postalCode, Length: 7014, dtype: int64

The column 'nr_of_pictures' has only one unique value: '0'. With only one value, this column will not have any influence on the purpose of this project. This column 'nr_of_pictures' is deleted from the dataset.¶

In [9]:

autos.drop(columns=["nrOfPictures"], inplace=True)

2.2. Review of columns with only 2 unique values¶

For 5 columns, the values returned as '2' for the number of unique values. They are 'seller', 'offerType', 'abtest', 'gearbox' and 'notRepairedDamage'.¶

In [10]:

autos["seller"].value_counts()

Out[10]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [11]:

autos["offerType"].value_counts()

Out[11]:

Angebot    49999
Gesuch         1
Name: offerType, dtype: int64

In [12]:

autos["abtest"].value_counts()

Out[12]:

test       25756
control    24244
Name: abtest, dtype: int64

In [13]:

autos["gearbox"].value_counts()

Out[13]:

manuell      36993
automatik    10327
Name: gearbox, dtype: int64

In [14]:

autos["notRepairedDamage"].value_counts()

Out[14]:

nein    35232
ja       4939
Name: notRepairedDamage, dtype: int64

The column 'seller' has only one entry for "gewerblich" and the column 'offerType' has only one entry for "Gesuch"¶

In [15]:

autos[autos["seller"] == "gewerblich"]

Out[15]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
7738	2016-03-15 18:06:22	Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten	gewerblich	Angebot	$100	control	kombi	2000	manuell	0	megane	150,000km	8	benzin	renault	NaN	2016-03-15 00:00:00	65232	2016-04-06 17:15:37

In [16]:

autos[autos["offerType"] == "Gesuch"]

Out[16]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
17541	2016-04-03 15:48:33	Suche_VW_T5_Multivan	privat	Gesuch	$0	test	bus	2005	NaN	0	transporter	150,000km	0	NaN	volkswagen	NaN	2016-04-03 00:00:00	29690	2016-04-05 15:16:06

These rows with entries "gewerblich" and "Gesuch"are deleted.¶

In [17]:

autos.drop(index=[7738, 17541], inplace=True)

After the deletion of rows '7738' and '7541', the column 'seller' is left with one unique value ('privat') and 'offerType' is left with one unique value ('Angebot').¶

In [18]:

print(autos["seller"].unique())
print(autos["offerType"].unique())

['privat']
['Angebot']

With only one value, these columns ('seller' and 'offerType') will not have any influence on the purpose of this project. Therefore these columns ('seller' and 'offerType') are deleted from the dataset.¶

In [19]:

autos.drop(columns=["seller"], inplace=True)
autos.drop(columns=["offerType"], inplace=True)

2.3. Converting datetype of 'price' and 'odometer' columns from object to integer¶

The 'price' and 'odometer' columns has non-numeric values such as "$", "KM" and ","¶

In [20]:

autos["price"].head(3)

Out[20]:

0    $5,000
1    $8,500
2    $8,990
Name: price, dtype: object

In [21]:

autos["odometer"].head(3)

Out[21]:

0    150,000km
1    150,000km
2     70,000km
Name: odometer, dtype: object

The non-numeric values are removed from the 'price' and 'odometer' columns and these columns are converted to integer.¶

In [22]:

### a function (clean_col) is created to remove non-numeric values ###
def clean_col(col):
    col = col.str.replace(",", "", regex=True)
    col = col.str.replace("$", "", regex=True)
    col = col.str.replace("km", "", regex=True)
    return col

### clean_col function is called to remove non-numeric values from 'price' and 'odometer' columns ###
autos["price"] = clean_col(autos["price"])
autos["odometer"] = clean_col(autos["odometer"])

### the 'price' and 'odometer columns are converted to integer type'
autos["price"] = autos["price"].astype(int)
autos["odometer"] = autos["odometer"].astype(int)

In [23]:

autos[["price", "odometer"]].head(3)

Out[23]:

	price	odometer
0	5000	150000
1	8500	150000
2	8990	70000

2.4. Substitute non-English word with English words¶

There are three columns with data entries that are not in English language.¶

In [24]:

### extracting unique values from the 3 columns which has data entries not in English language ###
print("Column - 'fuelType':", autos["fuelType"].unique())
print("Column - 'gearbox':", autos["gearbox"].unique())
print("Column - 'NotRepairedDamage':", autos["notRepairedDamage"].unique())

Column - 'fuelType': ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
Column - 'gearbox': ['manuell' 'automatik' nan]
Column - 'NotRepairedDamage': ['nein' nan 'ja']

In [25]:

### using the map() function to assign the English translation to these 3 columns ###
autos["fuelType"] = autos["fuelType"].map({
    'lpg':"lpg",
    'benzin':"petrol",
    'diesel':"diesel",
    'cng':"cng",
    'hybrid':"hybrid",
    'elektro':"electric",
    'andere':"other"
    })
autos["gearbox"] = autos["gearbox"].map({'manuell':"manual",'automatik':"automatic"})
autos["notRepairedDamage"] = autos["notRepairedDamage"].map({'nein':"no",'ja':"yes"})

2.5. Changing the use camelcase to snakecase and reorganising columns¶

The columns uses camelcase style for their names. This is change to snakecase.¶

In [26]:

### changing names style from camelcase to snakecase###
autos.columns = ['date_crawled', 'name', 'price_$', 'ab_test', 'vehicle_type', 'reg_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'reg_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen']

### rearranging columns ###
autos = autos[['brand', 'model', 'name', 'vehicle_type', 'price_$', 'odometer_km', 'reg_year', 'reg_month', 'gearbox', 
               'power_ps', 'fuel_type', 'ab_test', 'unrepaired_damage', 'postal_code', 'ad_created', 'date_crawled', 
               'last_seen']]

---------- End of Part B ----------¶

Part C: Organised Dataset¶

3. Quick Review of the organised dataset¶

In [27]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49998 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              49998 non-null  object
 1   model              47240 non-null  object
 2   name               49998 non-null  object
 3   vehicle_type       44903 non-null  object
 4   price_$            49998 non-null  int32 
 5   odometer_km        49998 non-null  int32 
 6   reg_year           49998 non-null  int64 
 7   reg_month          49998 non-null  int64 
 8   gearbox            47319 non-null  object
 9   power_ps           49998 non-null  int64 
 10  fuel_type          45517 non-null  object
 11  ab_test            49998 non-null  object
 12  unrepaired_damage  40171 non-null  object
 13  postal_code        49998 non-null  int64 
 14  ad_created         49998 non-null  object
 15  date_crawled       49998 non-null  object
 16  last_seen          49998 non-null  object
dtypes: int32(2), int64(4), object(11)
memory usage: 6.5+ MB

After reorganising the dataset, there are 49,998 rows with 17 coumns. The columns are arranged as follows:¶

0 brand: The brand of the car.

1 model: The car model name.

2 name: Name of the car.

3 vehicle_type: The vehicle type.

4 price_$: The price on the advertisement to sell the car.

5 odometer_km: How many kilometers the car has been driven.

6 reg_year: The year in which the car was first registered.

7 reg_month: The month in which the car was first registered.

8 gearbox: The transmission type.

9 power_ps: The power of the car in PS.

10 fuel_type: What type of fuel the car uses.

11 ab_test: Whether the listing is included in an A/B test.

12 unrepaired_damage: If the car has a damage which is not yet repaired.

13 postal_code: The postal code for the location of the vehicle.

14 ad_created: The date on which the eBay listing was created.

15 date_crawled: When this advertisement was first crawled. All field-values are taken from this date.

16 last_seen: When the crawler saw this advertisement last online.

Deletion Log:¶

Rows: (Initial Count: 50,000) (Previous Count:50,000) (Current Count: 49,998)

Part 2.2: 'seller' with entry for "gewerblich"  

Part 2.2: 'offerType' with entry for "Gesuch".

Columns: (Initial Count: 20) (Previous Count:20) (Current Count: 17)

Part 2.1: ''nr_of_pictures' with one unique value: '0'.

Part 2.2: 'seller' left with one unique value: 'privat'.

Part 2.2: 'offerType' left with one unique value: 'Angebot'

---------- End of Part C ----------¶

Part D: Cleaning Data Entries¶

4. Data entries¶

4.1. Removing data for antique vehicles from the columns: 'reg_year'¶

The rules pertaining to a car being antique may differ in different countries. For example, the Antique Automobile Club of America defines antique car as over 25 years of age. In Germany, cars older than 30 years are termed Oldtimer and may be registered as historic/antique cars.¶

For the value of an antique car, the brand and performance of the car are not the only significant factor or may be irrelevant. The value of an antique car may depends on many other factors such as the registraton year, looks and historic relevance etc.¶

Therefore, to put those that are not antique in the same category as antique cars for the purposes of comparison and data analysis may not be the right approach for this project.¶

In [28]:

autos.loc[autos["reg_year"]<1986].shape

Out[28]:

(924, 17)

There are 924 listings in the dataset for antique cars. These listings are removed from the dataset.¶

In [29]:

### from the dataset, removing antique vehicles ###
autos = autos.drop(autos.loc[autos["reg_year"]<1986].index)

4.2. Removing inaccurate entries in the column: 'reg_year'¶

Checking for listings with vehicle registration year indicated beyond 2016.¶

In [30]:

len(autos[autos["reg_year"]>2016])

Out[30]:

There are 1,966 listings with vehicle registration year indicated beyond 2016. As the dataset represents advertisements until April 2016, these 1,966 listings are therefore inaccurate. These rows are deleted from the dataset.¶

In [31]:

### from the dataset, removing rows with vehicle 'reg_year' beyond 2016. ###
autos = autos.drop(autos[autos["reg_year"]>2016].index)

4.3. Reviewing data entry for columns: 'reg_month'¶

The listings with registration year beyond 2016 have already been deleted. As the dataset represents advertisements of used cars until April 2016, this code checks for listings with registration month beyond the month of 'April'.¶

In [32]:

### checking for vehicles with 'reg_year' year 2016 and 'reg_month' beyond April ###
reg_month_above_4 = autos[(autos["reg_month"]==2016) & (autos["reg_month"] >4)]

print("Number of listings with registration month indicated beyond April:", 
      len(reg_month_above_4))

Number of listings with registration month indicated beyond April: 0

There are no null values in the column 'reg_month'. However, there are 13 unique entries for the months when there should have been 12 only.¶

In [33]:

### printing non-null values, unique values and listing counts for the column: 'reg_month' ###
print("reg_month          49998 non-null  int64")
print("Unique entries for 'reg_month' is", len(autos["reg_month"].unique()))
autos["reg_month"].value_counts()

reg_month          49998 non-null  int64
Unique entries for 'reg_month' is 13

Out[33]:

3     4810
0     4383
6     4128
5     3887
4     3880
7     3684
10    3519
12    3305
9     3261
11    3247
1     3076
8     3049
2     2879
Name: reg_month, dtype: int64

There are 4,383 listings with registration month indicated as $0. The 'wait and see' approach is adopted here and these listings are NOT deleted from the dataset. This is because the indicated month in the column - 'reg_month' may not be a significant factor for the purpose of this project. If required, these listings can be removed if the column - 'reg_month' becomes relevant to the purpose of this project.¶

4.4. Checking for outliers in the column: 'odometer_km'¶

There are no nul values in the column: 'adometer_km'. There are 13 unique entries. The minimum value is 5,000km and the maximum value is 150,000km.¶

In [34]:

### printing non-null values, unique, min and max values for 'odometer_km' column ###
print("odometer_km        49998 non-null  int64")
print("Unique entries:", len(autos["odometer_km"].unique()))
print("Minimum value:", autos["odometer_km"].min())
print("Maximum value:", autos["odometer_km"].max())

### extracting value_count() ###
autos["odometer_km"].value_counts().sort_values()

odometer_km        49998 non-null  int64
Unique entries: 13
Minimum value: 5000
Maximum value: 150000

Out[34]:

10000       229
20000       737
30000       746
40000       779
5000        805
50000       974
60000      1091
70000      1153
80000      1353
90000      1654
100000     2008
125000     4880
150000    30699
Name: odometer_km, dtype: int64

Checking the registration years for the first 3 entries above: 5,000km, 10,000km and 20,000km.¶

In [35]:

### counting entries by registration year for cars driven 5,000km ###
print(autos.loc[autos["odometer_km"]==5000, ["reg_year"]].value_counts().head(15))
print('\n')

### counting entries by registration year for cars driven 10,000km ###
print(autos.loc[autos["odometer_km"]==10000, ["reg_year"]].value_counts().head(15))
print('\n')

### counting entries by registration year for cars driven 20,000km ###
print(autos.loc[autos["odometer_km"]==20000, ["reg_year"]].value_counts().head(15))

reg_year
2000        107
2015         89
2016         86
1995         50
2005         48
1998         38
2002         35
1999         35
2001         35
1990         35
1997         32
2004         24
2014         23
1996         20
2007         19
dtype: int64


reg_year
2015        113
2014         41
2013         17
2011          6
2010          6
2009          6
2012          5
2000          4
1986          3
1988          3
1995          3
2005          2
2007          2
2006          2
2016          2
dtype: int64


reg_year
2014        188
2015        149
2013         74
2012         70
2000         24
2011         23
2010         21
2005         19
2009         17
1999         15
2007         14
2001         14
1998         12
1995         11
2002         10
dtype: int64

For vehicles driven 5,000 km, there are 80+ listings with their registration years listed as 2015 and 2016. The listings drop to 23 in 2014. Thereafter, the next earliest year is 2007.¶

The same observation is made for vehicles driven 10,000 km and 20,000 km. For them, the outliers appears to be the year 2009.¶

In [36]:

### printing number of rows and columns for ###
    
    ### vehicles driven 5,000km and are registered before 2014 ###
print("Vehicles driven 5,00 km and registered before 2014 - rows & columns:",
      autos[(autos["odometer_km"] == 5000) & (autos["reg_year"] < 2014)].shape)
   
    ### vehicles driven 10,000km and are registered before 2009 ###
print("Vehicles driven 10,00 km and registered before 2009 - rows & columns:",
      autos[(autos["odometer_km"] == 10000) & (autos["reg_year"] < 2009)].shape)

    ### vehicles driven 20,000km and are registered before 2009 ###
print("Vehicles driven 20,00 km and registered before 2009 - rows & columns:",
      autos[(autos["odometer_km"] == 20000) & (autos["reg_year"] < 2009)].shape)

Vehicles driven 5,00 km and registered before 2014 - rows & columns: (607, 17)
Vehicles driven 10,00 km and registered before 2009 - rows & columns: (33, 17)
Vehicles driven 20,00 km and registered before 2009 - rows & columns: (187, 17)

For the purpose of this project, the above 3 categories are removed from the dataset:¶

In [37]:

### from the dataset, removing rows with vehicles driven 5,000km and are registered before 2014 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 5000) & (autos["reg_year"] < 2014)].index) 

### from the dataset, removing rows with vehicles driven 10,000km and are registered before 2009 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 10000) & (autos["reg_year"] < 2009)].index) 

### from the dataset, removing rows with vehicles driven 20,000km and are registered before 2009 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 20000) & (autos["reg_year"] < 2009)].index) 

4.5. Checking for outliers in the column: 'price_$'¶

There are no null values in the 'price' column. The minimum value entered is 0 and maximum value entered is 99999999.¶

In [38]:

### printing non-null values, unique, min and max values for 'price_$' column ###
print("price_$            49998 non-null  int64")
print("Unique entries:", len(autos["price_$"].unique()))
print("Minimum value:", autos["price_$"].min())
print("Maximum value:", autos["price_$"].max())

price_$            49998 non-null  int64
Unique entries: 2301
Minimum value: 0
Maximum value: 99999999

Checking the entries for vehicles with price listed between 0-99.¶

In [39]:

### extracting number of rows with price between $0 - $99 and assigning it to a variable: price_below_100 ###
price_below_100 = autos.loc[(autos["price_$"] >=0) & (autos["price_$"] <= 99), "price_$"]
print("Number of entries with price between $0-99:", len(price_below_100))

### counting values in the variable: below_100 ###
print(price_below_100.value_counts().sort_values(ascending=False).head(10))

Number of entries with price between $0-99: 1404
0     1149
1      124
50      38
99      18
80      12
70       7
60       6
90       5
75       5
40       3
Name: price_$, dtype: int64

In [40]:

### extracting number of rows with price between $100 and assigning it to a variable: price_equal_100 ###
price_equal_100 = autos[autos["price_$"] == 100]
print("Number of entries with price equal to $100:", len(price_equal_100))

Number of entries with price equal to $100: 117

Between the price 99 to 100, there is jump in number of entries from 18 to 117.¶

This outlier (listings with price below 100) are removed from the dataset.¶

In [41]:

### from the dataset, removing rows with vehicle price below $100 ###
autos = autos.drop(autos[autos["price_$"] <= 99].index)

Checking the entries for vehicles with price above 500,000.¶

In [42]:

### extracting number of rows with price above 500k and assigning it to a variable: price_above_500k ###
price_above_500k = autos.loc[autos["price_$"] > 500000,['brand', 'model', 'vehicle_type', 'reg_year', 'price_$']]
print("Number of rows with price above $500,000:", len(price_above_500k))
print('\n')
print(price_above_500k)

Number of rows with price above $500,000: 7


                brand     model vehicle_type  reg_year   price_$
514              ford     focus        kombi      2009    999999
7814   sonstige_autos       NaN        coupe      1992   1300000
22947             bmw       NaN        kombi      1999   1234566
24384      volkswagen       NaN          NaN      1995  11111111
39705   mercedes_benz  s_klasse    limousine      1999  99999999
42221         citroen        c4    limousine      2014  27322222
47598            opel    vectra    limousine      2001  12345678

None of these brands and models would, realistically, fetch the advertised price. Therefore, these listings with price above 500,000 do not provide reliable data points with respect to price and are removed from the dataset.¶

In [43]:

### from the dataset, removing rows with vehicle price above $500,000 ###
autos = autos.drop(autos[autos["price_$"] > 500000].index)

Deletion Log:¶

Part 2¶

Rows: (Initial Count: 50,000) (Previous Count:50,000) (Current Count: 49,998)

Part 2.2: 'seller' with entry for "gewerblich"  
Part 2.2: 'offerType' with entry for "Gesuch".

Part 4¶

Rows: (Initial Count: 50,000)(Previous Count:49,998) (Current Count: 44,870)

Part 4.1: antique vehicles (registered before 1986): 924 rows.
Part 4.2: vehicle registration year indicated beyond 2016: 1,966 rows. 
Part 4.4: vehicles driven 5,00 km and registered before 2014: 607 rows.
Part 4.4: vehicles driven 10,00 km and registered before 2009: 33 rows.
Part 4.4: vehicles driven 20,00 km and registered before 2009: 187 rows.
Part 4.5: vehicle priced between $0-99: 1,404 rows.   
Part 4.5: vehicles prices above 500,000: 7 rows

Columns: No change (Initial Count: 20)(Previous Count:17) (Current Count: 17)

In [44]:

print("Number of rows and columns in the dataframe:", autos.shape)
### checking for accuracy of the number of rows ###
print(len(autos) == 49998 - (924 + 1966 + 607 + 33 + 187 + 1404 + 7))

Number of rows and columns in the dataframe: (44870, 17)
True

---------- End of Part D ----------¶

Part E: Data Analysis¶

5. Analyse the data¶

5.1 Calculating the distribution based on the column: 'reg_year'.¶

Taking into consideration that the dataset includes advertisements between June 2015 and April 2016, the following observations are made.¶

In [45]:

### counting number of listings based on the registration year and converting them to percentage ###
distr_ = autos["reg_year"].value_counts(normalize=True).sort_values(ascending=False)*100
distr_

Out[45]:

2000    6.739470
2005    6.369512
1999    6.307109
2003    5.961667
2004    5.946066
2006    5.894807
2001    5.754402
2002    5.424560
1998    5.110319
2007    5.001114
2008    4.891910
2009    4.622242
1997    4.214397
2011    3.592601
2010    3.521284
1996    2.966347
2012    2.903945
2016    2.678850
1995    2.560731
2013    1.769556
2014    1.475373
1994    1.346111
1993    0.922665
2015    0.846891
1992    0.793403
1991    0.744373
1990    0.693113
1989    0.372186
1988    0.276354
1987    0.153778
1986    0.144863
Name: reg_year, dtype: float64

About 63% of the vehicles listed are 10 years old or older.¶

In [46]:

### extracting percentage of vehicles that are 5 years old or younger ###
distr_above_2010 = distr_[distr_.index>2010]
print("Vehicles 5 years old or younger", 
      sum(distr_above_2010), 
      "%")

### extracting percentage of vehicles that are 5-9 years old ###
distr_btw_2006_2010 = distr_[(distr_.index>=2006) & (distr_.index<=2010)]
print("Vehicles 5-9 years old:", 
      sum(distr_btw_2006_2010), 
      "%")

### extracting percentage of vehicles that are 10 years old or older ###
distr_below_2006 = distr_[distr_.index<2006]
print("Vehicles 10 years old or older:", 
      sum(distr_below_2006), 
      "%")

Vehicles 5 years old or younger 13.267216402941832 %
Vehicles 5-9 years old: 23.931357254290173 %
Vehicles 10 years old or older: 62.801426342767996 %

Almost 70% of the vehicles listed are 9 years or older, out of which, about 8% of the vehicles are 20 years older or older.¶

In [47]:

### extracting percentage of vehicles that are 9 years old or older ###
print("Vehicles 9 years old or older:", 
      sum(distr_[distr_.index<=2006]), 
      "%")

### extracting percentage of vehicles that are 20 years old or older ###
print("Vehicles 20 years old or older:", 
      sum(distr_[distr_.index<1996]), 
      "%")

Vehicles 9 years old or older: 68.69623356362827 %
Vehicles 20 years old or older: 8.00757744595498 %

5.2. Calculating the distribution based on the columns: 'date_crawled', 'ad_created' and 'last_seen'¶

In [48]:

### For the 'ad_created', 'date_craweled' and 'last_seen':###
### - extracting the years and months only using the str[] function;###
### - spliting the years and months to 2 seperate columns; and###
### - converting them to integer ###

ad_year = autos["ad_created"].str[:7].str.split("-", expand=True,).astype(int)
ad_year.columns = ["yr_created", "month_created"]
ad_year[["yr_crawled","month_crawled"]]= autos["date_crawled"].str[:7].str.split("-", expand=True,).astype(int)
ad_year[["yr_seen","month_seen"]]= autos["last_seen"].str[:7].str.split("-", expand=True,).astype(int)

print(ad_year.info())
ad_year.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44870 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   yr_created     44870 non-null  int32
 1   month_created  44870 non-null  int32
 2   yr_crawled     44870 non-null  int32
 3   month_crawled  44870 non-null  int32
 4   yr_seen        44870 non-null  int32
 5   month_seen     44870 non-null  int32
dtypes: int32(6)
memory usage: 1.4 MB
None

Out[48]:

	yr_created	month_created	yr_crawled	month_crawled	yr_seen	month_seen
0	2016	3	2016	3	2016	4
1	2016	4	2016	4	2016	4
2	2016	3	2016	3	2016	4
3	2016	3	2016	3	2016	3
4	2016	4	2016	4	2016	4

The number of rows is consistent with the listings. All rows do no have null values. This reorganised dataset based on years and months is used to assist with counting the distribution.¶

In [49]:

### For 'ad_created', 'date_craweled' and 'last_seen', ###
### - counting listings by year ###
### - counting listings by month. ###

print(ad_year[["yr_created", "month_created"]].value_counts())
print('\n')

print(ad_year[["yr_crawled", "month_crawled"]].value_counts())
print('\n')

print(ad_year[["yr_seen", "month_seen"]].value_counts())

yr_created  month_created
2016        3                37540
            4                 7256
            2                   60
            1                    8
2015        12                   2
            6                    1
            8                    1
            9                    1
            11                   1
dtype: int64


yr_crawled  month_crawled
2016        3                37566
            4                 7304
dtype: int64


yr_seen  month_seen
2016     4             25910
         3             18960
dtype: int64

The above shows the followings:¶

1. Out of 44,870 listings, only 6 of them were created in 2015. The rest were created in 2016.¶

2. Out of 44,864 listings created in 2016, 37,540 were created in March and 7,256 were created in April. This means that in the year 2016, 68 listings were creating in the months of January and February and 44,796 listings were created in the months of March and April.¶

3. Between 11 June 2015 - 7 April 2016, overall, about 99.8% of the listings were created in a space of 38 days in March and April 2016.¶

4. All 44,870 listings were crawled only in the months of March and April 2016. This shows that prior to March 2016, the 68 listings created, including those in 2015, were not crawled prior to March 2016¶

5.3. Selecting brand and aggregating mean price.¶

In [50]:

### counting number of listings based on the brand and converting them to percentage ###
brand = autos["brand"].value_counts(normalize=True)*100
brand

Out[50]:

volkswagen        21.201248
bmw               11.123245
opel              10.719857
mercedes_benz      9.538667
audi               8.872298
ford               6.948964
renault            4.773791
peugeot            3.046579
fiat               2.527301
seat               1.863160
skoda              1.664809
nissan             1.555605
mazda              1.542233
smart              1.446401
citroen            1.415199
toyota             1.290394
hyundai            1.029641
mini               0.900379
volvo              0.900379
mitsubishi         0.826833
honda              0.788946
kia                0.726543
alfa_romeo         0.641854
sonstige_autos     0.639626
suzuki             0.599510
porsche            0.554936
chevrolet          0.517049
chrysler           0.347671
dacia              0.274125
daihatsu           0.247381
jeep               0.229552
subaru             0.218409
land_rover         0.207265
saab               0.167150
jaguar             0.151549
daewoo             0.147092
rover              0.138177
lancia             0.102518
trabant            0.066860
lada               0.046802
Name: brand, dtype: float64

In [51]:

### extracting brand that has more than 5% listings in total ###
brand_listing_more_5perc = brand[brand > 5].index
brand_listing_more_5perc

Out[51]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [52]:

### create empty dictionary ###
dic_brand_avg_price = {}
  
### loop over each brand that has more than 5% listings ###
### assign each brand to the empty dictionary as the key ###
### for each brands, calculate the average price ###
### assign the average price for that brand to the emptry dictionary as the value ###
for brand in brand_listing_more_5perc:
    avg_price = autos.loc[autos["brand"] == brand, "price_$"].mean()
    dic_brand_avg_price[brand] = int(avg_price)

dic_brand_avg_price

Out[52]:

{'volkswagen': 5436,
 'bmw': 8375,
 'opel': 3001,
 'mercedes_benz': 8556,
 'audi': 9419,
 'ford': 3490}

Except for ford, all of them have their headquerter in Germany. Audi has the highest average price and opel has the lowest average price. The average price gap between these 2 brands is about $6,400.¶

5.4. Calculating the mean mileage and mean price for each of the top brands¶

In [53]:

### create empty dictionary ###
dic_brand_avg_mileage = {}

### loop over each brand in dic_brand_avg_price ###
### assign each brand to the empty dictionary as the key ###
### for each brands, calculate the average mileage ###
### assign the average mileage for that brand to the emptry dictionary as the value ###

for brand in dic_brand_avg_price:
    avg_mileage = autos.loc[autos["brand"]==brand, "odometer_km"].mean()
    dic_brand_avg_mileage[brand] = int(avg_mileage)
    
dic_brand_avg_mileage

Out[53]:

{'volkswagen': 130877,
 'bmw': 134713,
 'opel': 132144,
 'mercedes_benz': 132517,
 'audi': 130106,
 'ford': 127134}

In [54]:

### assigning dic_brand_avg_price to a variable: brand_avg_price ###
brand_avg_price = pd.Series(dic_brand_avg_price).sort_values(ascending=False)
### assigning dic_brand_avg_mileage to a variable: brand_avg_mileage ###
brand_avg_mileage = pd.Series(dic_brand_avg_mileage).sort_values(ascending=False)

In [55]:

### converting brand_avg_price to dataframe ###
brand_price_mileage = pd.DataFrame(brand_avg_price, columns = ['avg_price'])
### to that dataframe, adding avg_mileage column ###
brand_price_mileage["avg_mileage"] = brand_avg_mileage

brand_price_mileage

Out[55]:

	avg_price	avg_mileage
audi	9419	130106
mercedes_benz	8556	132517
bmw	8375	134713
volkswagen	5436	130877
ford	3490	127134
opel	3001	132144

Even though the average mileage differs by less than 10,000K among all brands (range is 127,000km - 135,000km), there is significant difference in the average price. The average price of audi is about $6,000 more than volkswagen.¶

5.5. Finding the most common brand/model combinations¶

In [56]:

### using groupby() function to count the listings by brand and model and assigning it to a variable: 'brand_model' ### 
brand_model = autos.groupby(["brand","model"]).size().nlargest(6)
brand_model 

Out[56]:

brand       model 
volkswagen  golf      3613
bmw         3er       2545
volkswagen  polo      1558
opel        corsa     1541
volkswagen  passat    1326
opel        astra     1319
dtype: int64

In [57]:

### using groupby() function to count the listings by 'brand', 'model' and 'gearbox' and ###
### assigning it to a variable: brand_model_gear ###
brand_model_gearbox = autos.groupby(["brand","model", "gearbox"]).size().sort_values(ascending=False).head(6)

### calculating the percentage of 'brand_model_gearbox' against 'brand_model' ###
brand_model_gearbox_perc = (brand_model_gearbox/brand_model) * 100
brand_model_gearbox_perc.sort_values(ascending=False).astype(int)

Out[57]:

brand       model   gearbox
volkswagen  polo    manual     92
opel        corsa   manual     91
            astra   manual     89
volkswagen  golf    manual     84
bmw         3er     manual     76
volkswagen  passat  manual     75
dtype: int32

The top 6 brand model combination have one thing in common. More than 2/3 of these listings have their 'gearbox' listed as 'manual'.¶

In [58]:

### using groupby() function to count the listings by 'brand', 'model', 'gearbox' and 'adometer_km' and ###
### assigning it to a variable: brand_model_gearbox_mileage ###
brand_model_gearbox_mileage =autos.groupby(["brand","model", "gearbox", "odometer_km"]).size().sort_values(ascending=False).head(6)

### calculating the percentage of 'brand_model_gearbox_mileage' against 'brand_model_gearbox' ###
brand_model_gearbox_mileage_per = (brand_model_gearbox_mileage/brand_model_gearbox) * 100
brand_model_gearbox_mileage_per.sort_values(ascending=False).astype(int)

Out[58]:

brand       model   gearbox  odometer_km
volkswagen  passat  manual   150000         85
bmw         3er     manual   150000         84
volkswagen  golf    manual   150000         71
opel        astra   manual   150000         70
volkswagen  polo    manual   150000         66
opel        corsa   manual   150000         65
dtype: int32

Furthermore, majority of these manually driven vahicles have 'adomoneter_km' listed as 150,000km.¶

These findings seems to be consistent with the findings in Part 5.1 above that 70% of the vehicles listed are 9 years or older.¶

Observation¶

5.6. Finding if the average prices follows any patterns based on the mileage.¶

In [59]:

### counting number of listings based on 'odometer_km' and assigning it to varaible: mileage ###
mileage = autos["odometer_km"].value_counts()
mileage

Out[59]:

150000    29590
125000     4763
100000     1960
90000      1631
80000      1330
70000      1139
60000      1081
50000       959
40000       775
30000       736
20000       547
10000       192
5000        167
Name: odometer_km, dtype: int64

In [60]:

### extracting the index of mileage above and assigning them to a variable: mileage_ ###
mileage_ = mileage.index

### creating emptry dictionary ###
dic_mileage_avg_price = {}

### loop over each mileage / index and assigning it to the empty dictionary as the key ### 
### calculate average price of all listings based on each mileage/index ###
### assigning the average price to the dictionary as the value ###
for mile in mileage_:
    avg_price = autos.loc[autos["odometer_km"] == mile, "price_$"].mean()
    dic_mileage_avg_price[mile] = int(avg_price)

### extracting the dictionary keys and value and presenting them in a new dataframe format ###
mileage_avg_price = pd.Series(dic_mileage_avg_price).sort_values(ascending=False)
mileage_avg_price = pd.DataFrame(mileage_avg_price, columns = ["avg_price"])
mileage_avg_price

Out[60]:

	avg_price
5000	28708
10000	24005
20000	21884
30000	16710
40000	15617
50000	13823
60000	12350
70000	10946
80000	9768
90000	8373
100000	7900
125000	6213
150000	3742

There is a gradual decrease in price when the adomoter reading for km driven increases.¶

5.7. Finding out how much cheaper are cars with damage than their non-damaged counterparts¶

In [61]:

### using groupby() function to calculate the average difference in price ###
autos.groupby('unrepaired_damage')['price_$'].mean().astype(int)

Out[61]:

unrepaired_damage
no     7085
yes    2252
Name: price_$, dtype: int32

Vehicles with damages repaired are listed at higher price.¶

---------- End of Part D ----------¶