eBay Inc. is an American multinational e-commerce corporation based in San Jose, California, that facilitates consumer-to-consumer and business-to-consumer sales through its website. eBay Kleinanzeigen is a online classified advertisement market in Germany, handles from mobile devices.
The aim of the project is to analyse the car sales data scrapped from eBay Kleinanzeigen and discover patterns in the listings. The dataset was scrapped and uplaoded to Kaggle, the dataset is in german.
The following questions are answered through the analysis:-
A few columns are described as below:-
dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing.
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.
Importing and cleaning the dataset before starting the analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import re
import datetime
from pgeocode import Nominatim
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df = pd.read_csv("autos.csv",encoding='Latin-1', parse_dates=['dateCrawled','dateCreated','lastSeen'])
print(df.shape)
df.head(5)
(50000, 20)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 14:38:50 |
For ease of analysis, a few column names are changed as below.
rm_cols = [
'dateCrawled',
'lastSeen',
'nrOfPictures',
'name'
]
new_cols = {'offerType':'offer_type',
'vehicleType':'vehicle_type','yearOfRegistration':'registration_year',
'powerPS':'power_PS','monthOfRegistration':'registration_month',
'fuelType':'fuel_type','notRepairedDamage':'repaired_damage',
'dateCreated':'date_created',
'postalCode':'postal_code'}
df.rename(columns=new_cols,inplace=True)
df.drop(rm_cols,axis=1,inplace=True)
df.columns
Index(['seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'repaired_damage', 'date_created', 'postal_code'], dtype='object')
df.isna().sum()
seller 0 offer_type 0 price 0 abtest 0 vehicle_type 5095 registration_year 0 gearbox 2680 power_PS 0 model 2758 odometer 0 registration_month 0 fuel_type 4482 brand 0 repaired_damage 9829 date_created 0 postal_code 0 dtype: int64
Only a few columns have null values, namely vehicle_type, registration_year, model, repaired_damage and fuel_type. Since these are real time listings, there is no way these values can be imputed and hence are dropped from the dataset.
df.dropna(inplace=True)
df.head(5)
seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_PS | model | odometer | registration_month | fuel_type | brand | repaired_damage | date_created | postal_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 | 79588 |
1 | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 | 71034 |
2 | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 | 35394 |
3 | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 | 33729 |
4 | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 | 39218 |
The price and the odometer columns are actually numeric in nature, but due to the addition of '$' and 'km', they have been marked as objects. These columns are cleaned for further analysis. The odometer column's name has been changed to odometer_km for intuitiveness
df.price = df.price.str.replace('$','')
df.price = df.price.str.replace(',','').astype('int32')
df.odometer = df.odometer.str.replace('km','')
df.odometer = df.odometer.str.replace(',','').astype('int32')
df.rename(columns={'odometer':'odometer_km'},inplace=True)
The first column to be analysed is the price column.
df.price.describe()
count 35047.00000 mean 7442.56476 std 89491.39433 min 0.00000 25% 1500.00000 50% 3800.00000 75% 8600.00000 max 12345678.00000 Name: price, dtype: float64
df[df.price >= 200000]
seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_PS | model | odometer_km | registration_month | fuel_type | brand | repaired_damage | date_created | postal_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
514 | privat | Angebot | 999999 | test | kombi | 2009 | manuell | 101 | focus | 125000 | 4 | benzin | ford | nein | 2016-03-17 | 12205 |
2897 | privat | Angebot | 11111111 | test | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 | 94469 |
12682 | privat | Angebot | 265000 | control | coupe | 2016 | automatik | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-28 | 70193 |
34723 | privat | Angebot | 299000 | test | coupe | 1977 | manuell | 260 | 911 | 100000 | 7 | benzin | porsche | nein | 2016-03-23 | 61462 |
35923 | privat | Angebot | 295000 | test | cabrio | 2015 | automatik | 400 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-04-03 | 74078 |
36818 | privat | Angebot | 350000 | control | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 | 70499 |
37585 | privat | Angebot | 999990 | test | limousine | 1985 | manuell | 111 | jetta | 150000 | 12 | benzin | volkswagen | ja | 2016-03-29 | 50997 |
37840 | privat | Angebot | 220000 | test | coupe | 2008 | manuell | 415 | 911 | 30000 | 7 | benzin | porsche | nein | 2016-03-21 | 69198 |
47337 | privat | Angebot | 259000 | test | cabrio | 2001 | manuell | 400 | z_reihe | 20000 | 6 | benzin | bmw | nein | 2016-04-05 | 61462 |
47598 | privat | Angebot | 12345678 | control | limousine | 2001 | manuell | 101 | vectra | 150000 | 3 | benzin | opel | nein | 2016-03-31 | 4356 |
The price column has very high values. The mean of the column is relatively very low as compared to the max value. Thus there exist many outliers.
Upon research it was found that no vehicle sold over websites dealing with used cars, sell vehicles cheaper than $1000
. Hence the lower bound is taken to be 1000. There are cars listed upto $200000
, these are high end vehicles from top brands such as Porsche, BMW etc. Thus this is taken to be the upper bound.
df = df[df.price.between(1000,200000)]
data=[
go.Box(
y= df.price,
name= ''
)
]
layout = go.Layout(
title= {
'text':'<b>Distribution of price</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'price (dollars)'
),
width= 800,
height= 600
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The prices of the vehicles are now distributed between 1000 dollars and 200000 dollars.
The odometer_km column specifies the number of Kms the vehicle has already travelled. The below boxplot shows the distribution of the odometer.
df.odometer_km.describe()
count 29417.00000 mean 121130.12884 std 41008.12811 min 5000.00000 25% 90000.00000 50% 150000.00000 75% 150000.00000 max 150000.00000 Name: odometer_km, dtype: float64
data=[
go.Box(
y= df.odometer_km,
name= ''
)
]
layout = go.Layout(
title= {
'text':'<b>Distribution of odometer measurement</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'odometer measurement (km)'
),
width= 800,
height= 600
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The distribution of odometer measurements seem fine.
The power_PS column which describes the power of the vehicle has the distribution as below. The column contains many outliers.
data=[
go.Box(
y= df.power_PS,
name= ''
)
]
layout = go.Layout(
title= {
'text':'<b>Distribution of power of vehicle (in PS)</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'power (PS)'
),
width= 800,
height= 600
)
fig = go.Figure(data= data, layout= layout)
fig.show()
All the outliers are faulty values. The cars with the highest horsepower (power) in the world are not more than 1600 bhp (equivalent to PS), similarly the car with the lowest horsepower in the world is around 55 bhp (PS). Thus the column has to be capped to 1600 maximum and cutt off at 55 minimum. Values beyond these are not valid. Since the cars with such high horsepower are rare in the world, it can be fair to assume that they are definitely not in the used cars list. Thus the upper limit is lowered to a more reasonable estimate like 900. Above 900 horsepower are the epic sport cars which are custom and rare.
df = df[df.power_PS.between(55,900)]
data=[
go.Box(
y= df.power_PS,
name= ''
)
]
layout = go.Layout(
title= {
'text':'<b>Distribution of power of vehicle (in PS)</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'power (PS)'
),
width= 800,
height= 600
)
fig = go.Figure(data= data, layout= layout)
fig.show()
There are still some outliers, but they are in the acceptable range. The outliers maybe because of sports cars listed on the website. Hence these are ingnored for now.
The registration_year and registration_month column specify the year and month the car was registered. This sort of indicates when the car was bought. The two columns seem clean except that the registration_month contains 0s which is not a valid month. This can be considered as missing data. Hence is removed from the dataset.
df[['registration_year','registration_month']].describe()
registration_year | registration_month | |
---|---|---|
count | 27900.00000 | 27900.00000 |
mean | 2004.62369 | 6.27251 |
std | 6.01077 | 3.41458 |
min | 1937.00000 | 0.00000 |
25% | 2001.00000 | 3.00000 |
50% | 2005.00000 | 6.00000 |
75% | 2009.00000 | 9.00000 |
max | 2018.00000 | 12.00000 |
df = df[df.registration_month != 0]
The date_created column specifies the date the listing was created/uploaded. This can be taken as the date the car is in the market to be sold. Thus subtracting date_created and registration_year can result in the age of the car i.e. how old the vehicle is.
df['upload_year'] = df.date_created.apply(lambda x: x.year)
df['vehicle_age'] = abs(df.registration_year - df.upload_year)
df.head(5)
seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_PS | model | odometer_km | registration_month | fuel_type | brand | repaired_damage | date_created | postal_code | upload_year | vehicle_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 79588 | 2016 | 12 |
1 | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 71034 | 2016 | 19 |
2 | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 35394 | 2016 | 7 |
3 | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 33729 | 2016 | 9 |
7 | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 | 53474 | 2016 | 18 |
df.vehicle_age.describe()
count 27458.00000 mean 11.31641 std 5.98866 min 0.00000 25% 7.00000 50% 11.00000 75% 15.00000 max 79.00000 Name: vehicle_age, dtype: float64
The listings show that the vehicles up for sale are extremely new with age in years as 0 or very old, vintage with age 85 years. The histogram below shows the distribution of ages.
fig = px.histogram(df, x='vehicle_age')
fig.update_layout(
title= dict(
text='<b>Distribution of vehicle ages</b>',
x=0.5
),
xaxis= go.layout.XAxis(
title='vehicle age'
),
yaxis= go.layout.YAxis(
title='frequency',
showticklabels=False
)
)
fig.show()
The histogram above is not very readable, thus for ease of analysis, the vehicle age column is cut into bins with the labels as given below :-
Brand new - 0 to 5 years
Fairly used - 5 to 15 years
Old - 15 to 25 years
Vintage - 25 years or more
For each of these labels, the number of vehicles listed and the average price of such vehicles is visualized in the plots below.
df['vehicle_age_label'] = pd.cut(
df.vehicle_age,
bins=[-1,5,15,25,85],
labels=['Brand new','Fairly used','Old','Vintage']
)
grouped = df[['vehicle_age_label','vehicle_age','price']].groupby('vehicle_age_label')
vehicle_ages = grouped.agg({
'price':'mean',
'vehicle_age':'size'
}).reset_index().rename(columns={'price':'avg_price','vehicle_age':'counts'})
vehicle_ages.counts
0 4288 1 17569 2 5030 3 571 Name: counts, dtype: int64
fig = make_subplots(
rows=2,
cols=1,
subplot_titles=[
'<b>Number of vehicles listed per vehicle age category</b>',
'<b>Average price of vehicles listed per vehicle age category</b>'
],
vertical_spacing=0.15
)
fig.add_trace(
go.Bar(
x= vehicle_ages.vehicle_age_label,
y= vehicle_ages.counts,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=1,
col=1
)
fig.add_trace(
go.Bar(
x= vehicle_ages.vehicle_age_label,
y= vehicle_ages.avg_price,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=2,
col=1
)
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(title= 'vehicle age')
fig.update_layout(height=1000, showlegend=False)
fig.show()
The following conclusions are drawn from the two plots :-
This answers the first question - How old are the cars listed on the website? How are they priced?
It is concluded that most of the vehicles listed are about 5 to 15 years old. However, the prices for a Brand new vehicle, 0 to 5 years or a Vintage vehicle, 25 years or more are priced higher than the other vehicles.
The brand column specifies the manufacturer of the vehicle. There are vehicle from various manufacturers that are listed on the website. Vehicles from various manufacturers are priced differently. A brand like BMW or Porsche have vehicles listed at higher price as compared to Opel or Fiat. Hence it is important to consider the average price of vehicles from one manufacturer listed.
A top manufacturer here is broken into two segments. One those manufacturers that are listed most and, Two those manufacturers listed with highest prices.
top_10_listed = df.brand.value_counts().sort_values(ascending=False)[:10]
top_10_listed
volkswagen 5523 bmw 3655 mercedes_benz 3179 audi 2862 opel 2395 ford 1637 renault 1012 peugeot 781 skoda 579 fiat 523 Name: brand, dtype: int64
grouped = df[['brand','price']].groupby('brand')
top_10_priced = grouped.mean().sort_values(by='price',ascending=False).reset_index().iloc[:10]
top_10_priced
brand | price | |
---|---|---|
0 | porsche | 44846.11638 |
1 | land_rover | 20731.81250 |
2 | jaguar | 13543.00000 |
3 | jeep | 11981.80488 |
4 | audi | 11436.58421 |
5 | mini | 11063.33521 |
6 | mercedes_benz | 10199.51180 |
7 | bmw | 9844.65198 |
8 | chevrolet | 7896.10909 |
9 | volkswagen | 7650.45102 |
fig = make_subplots(
rows=2,
cols=1,
subplot_titles=[
'<b>Number of vehicles listed per Manufacturer</b><br>'+
'Top 10 most listed manufacturers (brands) on the website',
'<b>Average price of vehicles listed per Manufacturer</b><br>'+
'Top 10 most priced manufacturers (brands) listed on the website'
],
vertical_spacing=0.15
)
fig.add_trace(
go.Bar(
x= top_10_listed.index,
y= top_10_listed.values,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=1,
col=1
)
fig.add_trace(
go.Bar(
x= top_10_priced.brand,
y= top_10_priced.price,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=2,
col=1
)
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(title= 'manufacturer')
fig.update_layout(height=1000, showlegend=False)
fig.show()
The most listed brand of vehicles are :-
* Volkswagen * BMW * Mercedes Benz
* Audi * Opel * Ford
* Renault * Peugeot * Skoda
* Fiat
The most priced brand of vehicles are :-
* Porsche * Land Rover * Jaguar
* Jeep * Audi * Mini cooper
* Mercedes Benz * BMW * Chevrolet
* Volkswagen
There are various manufacturers that are most listed as well as most priced - Volkswagen, Mercedes Benz, Audi and BMW. All these are German manufacturers and vehicles of these brands are most bought and are of a premium price range. This can give insights into the kind of market germany is. The following theories can be developed :-
The vehicle_type column describes the type of the chassis. The names are given in German as below:-
Limousine - Sedan
Kleinwagen - Compact
Kombi - Estate/station wagon
Coupe - Coupe
Cabrio - Convertible
Bus - Bus
SUV - SUV
Andere - Others
df.vehicle_type.unique()
array(['bus', 'limousine', 'kleinwagen', 'coupe', 'suv', 'cabrio', 'kombi', 'andere'], dtype=object)
grouped = df[['vehicle_type','price']].groupby('vehicle_type')
chassis_cnt = grouped.mean().reset_index().sort_values('price',ascending=False)
chassis_cnt
vehicle_type | price | |
---|---|---|
7 | suv | 14420.05643 |
3 | coupe | 13612.94441 |
2 | cabrio | 12036.05525 |
1 | bus | 7663.94356 |
6 | limousine | 7545.82716 |
5 | kombi | 7381.23940 |
0 | andere | 6020.02920 |
4 | kleinwagen | 4468.32776 |
fig = make_subplots(
rows=2,
cols=1,
subplot_titles=[
'<b>Number of vehicles listed per Type of vehicle (chassis)</b>',
'<b>Average price of vehicles listed per Type of vehicle (chassis)</b>'
],
vertical_spacing=0.15
)
fig.add_trace(
go.Bar(
x= df.vehicle_type.value_counts().index,
y= df.vehicle_type.value_counts().values,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=1,
col=1
)
fig.add_trace(
go.Bar(
x= chassis_cnt.vehicle_type,
y= chassis_cnt.price,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=2,
col=1
)
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(title= 'type of vehicle (chassis)')
fig.update_layout(height=1000, showlegend=False)
fig.show()
The two plots show:-
The division of vehciles types listed as compared to all the vehicles listed on the website is shown below using a pie chart.
data= [
go.Pie(
labels= df.vehicle_type.value_counts().index,
values= df.vehicle_type.value_counts().values,
marker= dict(
line= dict(width=1),
),
marker_colors= px.colors.colorbrewer.Set2,
hole= 0.3,
hovertemplate= '%{label}: %{percent}<extra></extra>'
)
]
layout= go.Layout(
title= {
'text':'<b>Distribution of vehicle types listed on website</b>',
'x':0.5
}
)
fig = go.Figure(data= data, layout= layout)
fig.show()
As concluded earlier, more than 50% of the vehicles listed on the website are of type - limousine, Kleinwagen and Kombi.
This indicates that the used vehicles market mainly deals with cheaper and smaller cars as most of them listed are of such type. Even though luxury and high priced vehicles are listed, they are lesser in proportion. If the current conclusions are overlapped with the previous conclusions from the analysis of manufacturers, it can be said that, most vehicles listed are country made and on a bit cheaper end in terms of vehicle type.
The model column specifies the model of the car. For better understanding, the model and the brand column are merged to give the name of the vehicle. The analysis and visualizations below show the top 10 most listed vehicle on the website.
df['vehicle'] = df.brand+" "+df.model
vehicle_names = df.vehicle.value_counts()[:10][::-1]
vehicle_names
audi a6 598 audi a3 599 mercedes_benz e_klasse 695 opel astra 762 mercedes_benz c_klasse 826 bmw 5er 848 audi a4 854 volkswagen passat 864 bmw 3er 1812 volkswagen golf 2227 Name: vehicle, dtype: int64
data= [
go.Bar(
y= vehicle_names.index,
x= vehicle_names.values,
orientation='h',
marker= dict(color=['#009999','#009999','#009999',
'#009999','#009999','#009999',
'#009999','#009999','#ff9933',
'#ff9933'
]
),
hovertemplate= '%{y}: %{x}<extra></extra>'
)
]
layout= go.Layout(
title= {
'text':'<b>Top 10 vehicles by model listed on the website</b><br>'+
'Number of listings per vehicle model on the website (top 10)',
'x':0.5
},
xaxis= go.layout.XAxis(
title= 'number of vehicles',
showticklabels= False
),
yaxis= go.layout.YAxis(
title= 'vehicle model'
)
)
fig= go.Figure(data= data, layout= layout)
fig.add_shape(
x0=900,
y0=-0.5,
x1=900,
y1=9.6,
line= dict(
color= 'crimson',
width=2,
dash= 'dash'
)
)
fig.show()
The plot shows the top 10 vehicle models that are listed most on the website. The two vehicles that cross the 1000 mark are Volkswagen Golf and BMW 3 series. This indicates that these vehicles are maybe outdated in the market or are not as great as advertised and hence they are being sold the most.
All the vehicles in the top 10 are all German made. This can indicate two things -
The plot below can maybe highlight the reason for the above with respect to the two most listed model.
golf_cnts = df[df.vehicle == 'volkswagen golf'].vehicle_age_label.value_counts().iloc[[2,0,1,3]][::-1]
golf_cnts
Vintage 54 Old 634 Fairly used 1128 Brand new 411 Name: vehicle_age_label, dtype: int64
bmw_cnts = df[df.vehicle == 'bmw 3er'].vehicle_age_label.value_counts().iloc[[2,0,1,3]][::-1]
bmw_cnts
Vintage 37 Old 557 Fairly used 1093 Brand new 125 Name: vehicle_age_label, dtype: int64
data= [
go.Bar(
y= golf_cnts.index,
x= golf_cnts.values,
marker= dict(color='teal'),
orientation='h',
text= golf_cnts.values,
hovertemplate= '%{text}<extra></extra>',
name= 'Volkswagen Golf'
),
go.Bar(
y= bmw_cnts.index,
x= [-1 * i for i in bmw_cnts.values],
marker= dict(color='crimson'),
orientation='h',
text= bmw_cnts.values,
hovertemplate= '%{text}<extra></extra>',
name= 'BMW 3 Series'
)
]
layout= go.Layout(
title= {
'text':'<b>Number of vehicle models listed per age category</b><br>'+
'Number of Volkswagen Golf and BMW 3 series listed per age category',
'x':0.5
},
xaxis= go.layout.XAxis(
title= 'number of listings',
range= [-1150,1150],
showticklabels= False
),
barmode= 'overlay'
)
fig= go.Figure(data=data, layout= layout)
fig.show()
The Fairly used and Old aged vehicles i.e. 5 to 25 years are the most listed for the Volkswagen Golf and BMW 3 series. This indicates that the sellers are looking for an upgrade rather than being dissatisfied with the vehicle as they used it for a good couple of years. A few Brand new vehicles are also up for sale, out of the two Volkswagen Golf is the most listed under that, this could raise questions regarding the popularity of the vehicle, but due to lack of data it cannot be determined.
The repaired_damage specifies whether a vehicle listed has been repaired of all damages or the vehicle listed has damages and the buyer is expected to take care of them. The values of the column are nein (No) and Ja (Yes). The two bar plots below show the number of vehicles listed for both and the average prices.
grouped = df[['repaired_damage','price']].groupby('repaired_damage')
damage_repair = grouped.mean().reset_index().sort_values('price',ascending=False)
damage_repair
repaired_damage | price | |
---|---|---|
1 | nein | 8359.08142 |
0 | ja | 4180.70068 |
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=[
'<b>Number of vehicles with repaired and not repaired damages</b>',
'<b>Average price of repaired vs not repaired vehicles</b>'
],
horizontal_spacing= 0.2
)
fig.add_trace(
go.Bar(
x= df.repaired_damage.value_counts(normalize=True).index,
y= df.repaired_damage.value_counts(normalize=True).values * 100,
hovertemplate= '%{y:.2f}%<extra></extra>'
),
row=1,
col=1
)
fig.add_trace(
go.Bar(
x= damage_repair.repaired_damage,
y= damage_repair.price,
hovertemplate= '%{y:.2f}<extra></extra>'
),
row=1,
col=2
)
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(title= 'Are damages repaired?')
fig.update_layout(showlegend=False)
fig.show()
The plots clearly show that most vehicles listed on the website have been repaired and are on sale in a good condition from a repair point of view. The vehicles that haven't been repaired are listed at lower cost thus expecting the buyer to eventually repair it, whereas vehicles listed after repairs boast a higher price point comparatively.
The column power_PS specifies the brake horse power of the car in the German metric - PS (same as BHP). As a step in the analysis, this column is analyzed to discover the general trend of the power in the vehicles listed.
fig = px.scatter(
df,
x= 'power_PS',
y= 'price',
trendline='lowess',
trendline_color_override='crimson'
)
fig.update_layout(
title= {
'text':'<b>Price of a vehicle vs the Power of vehicle</b><br>'+
'Price of the vehicle compared to the power the vehicle offers (in PS)',
'x':0.5
},
xaxis= go.layout.XAxis(
title= 'power of vehicle (in PS)',
),
yaxis= go.layout.YAxis(
title= 'price of vehicle (in dollars)'
)
)
fig.show()
The LOWESS trend line indicates a slight linear relationship between the price of a vehicle and the power of the vehicle (in PS). As the power of the vehicle increases, so do the prices. Most of the vehicles listed have a power within the 55 to 350 range and a price upto 50k dollars, a very few vehicles as compared to the population have higher than 350 power (in PS) and higher price than 50k dollars. This hints that most of the vehicles for sale online are of lower category in terms of power of the vehicle as well as price. The market of buying and selling of used cars is of the economically middle class population.
The postal_code column identifies the general location from where this vehicle was listed on the website. Locations are a good way for people to filter vehicles and find suitable ones in their area.
For the analysis, the postal codes can be converted to Latitude and Longidtude points. This can help in visualizing the locations of the listings and even form cluster of locations where most cars have been listed from. There may be false entries in the postal_code column. These will become NaN values when converted and will be ignored.
def get_loc(row):
info = geo_coder.query_postal_code(str(row))
return [info['latitude'], info['longitude']]
geo_coder = Nominatim('DE')
locations = df.postal_code.apply(get_loc)
loc = pd.DataFrame(columns=['latitude','longitude'],index=locations.index)
loc.latitude = [i[0] for i in locations]
loc.longitude = [i[1] for i in locations]
df = pd.concat([df,loc],axis=1)
px.set_mapbox_access_token('pk.eyJ1IjoicmFqdHVsbHVyaSIsImEiOiJja2EzeDRrdXMwb3JrM2RwMDA0MzI1Yjc1In0._Ki7JTLX3lBh6gSjftf7EQ')
fig = px.scatter_mapbox(
df,
lat= 'latitude',
lon= 'longitude',
color_discrete_sequence= ['#009999'],
zoom= 4
)
fig.update_layout(
width=1000,
height=550,
title= {
'text':'<b>Locations of vehicles listed on the website</b>',
'x':0.5
}
)
fig.show()
The map shows all the locations are situated within Germany and has a good spread throughout the country. There can be many urban areas and rural areas, urban areas having more access to technology and a wide spread community will have more listings as compared to the rural areas. From the map directly it is not visible which are the hotspots.
Thus a clustering algorithm has been applied to form cluster of areas which will divide the locations into zones and can show which zones are hotspots for listings. Kmeans is applied on the latitude and longitude data derived from the postal codes. To select the optimal value for k clusters, the calinski harabasz
score is used. The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.
The price is also taken into consideration when forming clusters. This way the clusters are dependent on the location and price of the listed vehicles. The price column had a huge range, which had to be scaled using a MinMaxScaler
. The location coordinates were not scaled, this way the price is not a dominating feature for clustering.
scaler = MinMaxScaler()
tmp = pd.DataFrame(scaler.fit_transform(pd.DataFrame(df['price'])),index=df.index)
tmp.columns= ['price_scaled']
df = pd.concat([df,tmp],axis=1)
mod = KMeans(n_init=10)
visualizer = KElbowVisualizer(mod, k=(5,15), metric='calinski_harabasz')
visualizer.fit(df[['latitude','longitude','price_scaled']].dropna())
visualizer.show()
<matplotlib.axes._subplots.AxesSubplot at 0x7f98da6b7350>
The plot below shows that an optimal value for number of clusters would be 10. The model below is trained used 10 clusters (k=10). The resulting labels will be used to assign each point in the dataset a class.
mod = KMeans(n_init=20, n_clusters= 10)
mod.fit(df[['latitude','longitude', 'price_scaled']].dropna())
labels = mod.labels_
ind = df[~df.latitude.isna()].index
label_sr = pd.DataFrame(labels, index=ind, columns=['label'])
df = pd.concat([df,label_sr],axis=1,ignore_index=False)
The Kmeans algorithm has labeled each point according to the clusters. A similar map representation is plotted again to visualize the clusters. This can show how the country is divided into zones which fall closest.
px.set_mapbox_access_token('pk.eyJ1IjoicmFqdHVsbHVyaSIsImEiOiJja2EzeDRrdXMwb3JrM2RwMDA0MzI1Yjc1In0._Ki7JTLX3lBh6gSjftf7EQ')
fig = px.scatter_mapbox(
df,
lat= 'latitude',
lon= 'longitude',
color= 'label',
color_continuous_scale= px.colors.colorbrewer.Accent,
zoom=4
)
fig.update_layout(
width=1000,
height=550,
title= {
'text':'<b>Cluster region of vehicles listed on the website</b>',
'x':0.5
},
showlegend= False
)
fig.show()
The map shows the well defined 10 clusters that have been created. This can show the hotsopts better, for example the label 5 cluster covers the busy spots of Frankfurt and Mannheim, similarly label 2 cluster surrounds Berlin. This way the major hotspots have been highlighted and grouped into zones. Below shows the mapping of labels to the hotspots
Label 0 -> Hambury (Region 0)
Label 1 -> Cologne, Leverkusen, Dortmund (Region 1)
Label 2 -> Nurenberg (Region 2)
Label 3 -> Frankfurt, Mannheim (Region 3)
Label 4 -> Berlin (Region 4)
Label 5 -> Stuttgart (Region 5)
Label 6 -> Saarburken, Strasboury (Region 6)
Label 7 -> Hanover (Region 7)
Label 8 -> Munich (Region 8)
Label 9 -> Bremen, Oldenburg (Region 9)
The analysis further done now deals with the zones.
hotspots = {0:'Region 0',1:'Region 1',2:'Region 2',3:'Region 3'
,4:'Region 4',5:'Region 5',6:'Region 6',7:'Region 7'
,8:'Region 8',9:'Region 9'}
df.label = df.label.map(hotspots)
grouped = df[['label','price']].groupby('label')
region_prices = grouped.mean().reset_index().sort_values('price',ascending=False)
region_prices
label | price | |
---|---|---|
8 | Region 8 | 10420.56463 |
5 | Region 5 | 8644.22536 |
7 | Region 7 | 8166.57445 |
3 | Region 3 | 8156.65128 |
6 | Region 6 | 8115.47087 |
2 | Region 2 | 7960.27984 |
4 | Region 4 | 7700.56178 |
1 | Region 1 | 7500.54931 |
0 | Region 0 | 7399.90443 |
9 | Region 9 | 7349.25647 |
region_cnts = df.label.value_counts()
region_cnts = region_cnts.sort_values(ascending=False)
fig = make_subplots(
rows=2,
cols=1,
subplot_titles=[
'<b>Number of vehicles listed per Cluster region</b><br>'+
'Number of vehicles listed on the website per zone (cluster region)',
'<b>Average price of vehicles listed per Cluster region</b><br>'+
'Average price of vehicles listed on the website per zone (cluster region)'
],
vertical_spacing=0.15
)
fig.add_trace(
go.Bar(
x= region_cnts.index,
y= region_cnts.values,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=1,
col=1
)
fig.add_trace(
go.Bar(
x= region_prices.label,
y= region_prices.price,
hovertemplate= '%{x}: %{y}<extra></extra>'
),
row=2,
col=1
)
fig.update_yaxes(showticklabels=False,title='number of vehicles',row=1,col=1)
fig.update_yaxes(showticklabels=False,title='average price of vehicles',row=2,col=1)
fig.update_xaxes(title= 'cluster region')
fig.update_layout(height=1000, showlegend=False)
fig.show()
data= [
go.Pie(
labels= region_cnts.index,
values= region_cnts.values,
marker= dict(
line= dict(width=1),
),
marker_colors= px.colors.colorbrewer.Set2,
hole= 0.3,
hovertemplate= '%{label}: %{percent}<extra></extra>'
)
]
layout= go.Layout(
title= {
'text':'<b>Distribution of number of vehicle listed on website zone wise</b>',
'x':0.5
}
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The three plots conclude the following :-
The line plot below visualizes the trend of listings observed int the years 2015 and 2016 zone wise i.e. for every region the number of vehicles listed based on the date is visualized below. This shows the regions where there are more listings and regions where there are less listings per time.
grouped = df[['date_created','label','price']].groupby(['date_created','label'])
region_trend = grouped.count().reset_index()
region_trend.rename(columns={'price':'counts'},inplace=True)
region_trend
date_created | label | counts | |
---|---|---|---|
0 | 2015-08-10 | Region 9 | 1 |
1 | 2015-09-09 | Region 2 | 1 |
2 | 2015-11-10 | Region 0 | 1 |
3 | 2015-12-30 | Region 4 | 1 |
4 | 2016-01-07 | Region 1 | 1 |
... | ... | ... | ... |
386 | 2016-04-07 | Region 5 | 2 |
387 | 2016-04-07 | Region 6 | 2 |
388 | 2016-04-07 | Region 7 | 3 |
389 | 2016-04-07 | Region 8 | 3 |
390 | 2016-04-07 | Region 9 | 2 |
391 rows × 3 columns
fig = px.line(
region_trend,
x= 'date_created',
y= 'counts',
color= 'label',
color_discrete_sequence= px.colors.colorbrewer.Paired_r
)
fig.update_layout(
title= {
'text':'<b>Trend in number of listings Zone wise</b><br>'+
'Number of vehicles listed zone wise (cluster region) between 2015 and 2016<br>',
'x':0.5
},
xaxis= go.layout.XAxis(
title= 'date',
rangeslider_visible=True,
rangeselector= dict(
buttons=list([
dict(count=1, label= 'month', step= 'month', stepmode= 'backward'),
dict(count=1, label= 'year', step= 'year', stepmode= 'backward')
])
)
),
yaxis= go.layout.YAxis(
title= 'number of listings',
showticklabels= False
),
height= 550
)
fig.show()
NOTE: The plot above is interactive, one can use the slider on x-axis to set the time frame or the buttons on the top left corner to set the step in the date. You can use the labels on the right, select/deselect the regions for which you want to observe the trend.
The following points are concluded from the trend lines:-
The trend lines can be helpful in determining how the regions, over time have contributed to the listings, and which regions can be said to have high vehicles listed over time. Based on the analysis, the Cologne, Leverkusen and Dortmund region has contributed the most to the total listings on the website.
region_prices.sort_values(by='label',inplace=True)
centers = pd.DataFrame(columns=['latitude','longitude'], index= region_prices.index)
region_loc_prices = pd.concat([region_prices,centers],axis=1)
region_loc_prices.latitude = [i[0] for i in mod.cluster_centers_]
region_loc_prices.longitude = [i[1] for i in mod.cluster_centers_]
region_loc_prices.price = ((region_loc_prices.price - min(region_loc_prices.price)) + 100)
region_loc_prices
label | price | latitude | longitude | |
---|---|---|---|---|
0 | Region 0 | 150.64796 | 53.76043 | 10.03050 |
1 | Region 1 | 251.29283 | 51.22468 | 7.02510 |
2 | Region 2 | 711.02337 | 49.84103 | 10.97672 |
3 | Region 3 | 907.39481 | 50.13431 | 8.51730 |
4 | Region 4 | 451.30531 | 52.66286 | 13.28823 |
5 | Region 5 | 1394.96888 | 48.53677 | 9.39189 |
6 | Region 6 | 866.21439 | 49.02196 | 7.51726 |
7 | Region 7 | 917.31797 | 51.98150 | 10.04863 |
8 | Region 8 | 3171.30816 | 48.39185 | 11.87287 |
9 | Region 9 | 100.00000 | 52.53459 | 8.28820 |
px.set_mapbox_access_token('pk.eyJ1IjoicmFqdHVsbHVyaSIsImEiOiJja2EzeDRrdXMwb3JrM2RwMDA0MzI1Yjc1In0._Ki7JTLX3lBh6gSjftf7EQ')
fig = px.scatter_mapbox(
region_loc_prices,
lat= 'latitude',
lon= 'longitude',
size= 'price',
color= 'label',
size_max= 55,
hover_name= region_prices.price,
zoom= 4
)
fig.update_layout(
title= {
'text':'<b>Average price of vehicles listed per Cluster region</b><br>'+
'Average price of vehicles listed on the website per zone (cluster region)',
'x':0.5
},
width=950,
height=550
)
fig.show()
The map above shows the region wise average prices listed on the website. The size of the bubble represents the average price of the region scaled for visualization purposes. It shows the regions of Munich, Stuttgart, Frankfurt and Mannheim and Hanover have the highest average prices of the vehicles listed.
The final conclusions drawn from the project :-
The prices of vehicles listed are mostly below 20k dollars with the exception of some luxury/sports cars or vintage cars priced upto 200k dollars.
Majority of vehicles listed are about 5 to 15 years old (Fairly used), these vehicles are sold at a comparatively lower price than the Vintage vehicles (> 25 years) or Brand new vehicles (0 to 5 years).
The most listed brand of vehicles are :-
* Volkswagen * BMW * Mercedes Benz
* Audi * Opel * Ford
* Renault * Peugeot * Skoda
* Fiat
The most priced brand of vehicles are :-
* Porsche * Land Rover * Jaguar
* Jeep * Audi * Mini cooper
* Mercedes Benz * BMW * Chevrolet
* Volkswagen
Most of the manufacturers from the above list are German manufacturers, showing that the people prefer country-made vehicles even though they are a little pricey.
A Limousine, Kombi and a Klienwagen are most listed type of vehicles and are on the cheaper end, whereas an SUV, Coupe and Cabrio are expensive vehicle types.
From the vehicles listed,
* Volkswagen Golf
* BMW 3 series
are the most listed model of vehicle. Upon analysis it shows that most of them listed are Fairly used indicating the seller is looking for an upgrade.
Repaired vehicles are sold for higher prices than damaged vehicles.
The price of the vehicle increases as the power (in PS) of the vehicle increases.
The locations when plotted on the map show a clutter of points, thus a clustering algorithm is applied to form zones (regions) within the country.
The following are the clusters identified by the KMeans Algorithm with 10 clusters found by calinski harabasz score,
Label 0 -> Hambury (Region 0)
Label 1 -> Cologne, Leverkusen, Dortmund (Region 1)
Label 2 -> Nurenberg (Region 2)
Label 3 -> Frankfurt, Mannheim (Region 3)
Label 4 -> Berlin (Region 4)
Label 5 -> Stuttgart (Region 5)
Label 6 -> Saarburken, Strasboury (Region 6)
Label 7 -> Hanover (Region 7)
Label 8 -> Munich (Region 8)
Label 9 -> Bremen, Oldenburg (Region 9)
The labels divide the locations into regions which are easier to analyze.
The Cologne, Leverkusen and Dortmund region has the highest number of listings on the website, where as Munich and Stuttgart region have the highest average price of the listings.
The conclusions answer the questions asked and give a detailed view of the scrapped listings of eBay-Kleinanzeigen.