introduction¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen,a section of the German eBay website. The aim of this project is to clean the data and analyze the included used car listings.

In [2]:

import pandas as pd
import numpy as np

In [3]:

autos = pd.read_csv("autos.csv", encoding= 'Latin-1')

In [4]:

autos.columns

Out[4]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [6]:

autos.head()

Out[6]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Observations From the column names we can notice the name is written in camelcase naming convention, i'll convert it to snakecase naming convention. some interger column are mixed with text values like price and odometer.

cleaning column names

In [8]:

autos.columns = autos.columns.map({'dateCrawled':'date_crawled', 'name':'name', 'seller':'seller', 'offerType':'offer_type', 'price':'price', 'abtest':'abtest',
       'vehicleType':'vehicle_type', 'yearOfRegistration':'year_of_registration', 'gearbox':'gearbox', 'powerPS': 'power_ps', 'model':'model',
       'vehicleType':'vehicle_type', 'yearOfRegistration':'year_of_registration', 'gearbox':'gearbox', 'powerPS':'power_ps', 'model':'model',
       'odometer':'odometer', 'monthOfRegistration':'month_of_registration', 'fuelType':'fuel_type', 'brand':'brand',
       'notRepairedDamage':'not_repaired_damage', 'dateCreated':'date_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code',
       'lastSeen':'last_seen'})

In [9]:

autos.columns

Out[9]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model',
       'odometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [10]:

autos.head()

Out[10]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	year_of_registration	gearbox	power_ps	model	odometer	month_of_registration	fuel_type	brand	not_repaired_damage	date_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

i changed the column name convention to snakecase convention, One study has found that readers can recognize snake case values more quickly than camel case

Initial Exploration and Cleaning

In [11]:

autos.describe(include ='all')

Out[11]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	year_of_registration	gearbox	power_ps	model	odometer	month_of_registration	fuel_type	brand	not_repaired_damage	date_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [12]:

autos["price"] = autos["price"].str.replace("$","").str.replace(","," ").str.replace(" ", "")
autos["odometer"] = autos["odometer"].str.replace("km"," ").str.replace(",","")

In [13]:

autos.rename({"date_created":"ad_created","month_of_registration":"registration_month","year_of_registration":"registration_year","not_repaired_damage":"unrepaired_damage",},axis = 1, inplace = True)

In [14]:

autos["price"].head()

Out[14]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: object

In [15]:

autos["odometer"].head()

Out[15]:

0    150000 
1    150000 
2     70000 
3     70000 
4    150000 
Name: odometer, dtype: object

In [16]:

autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True) 
#To rename the column to odometer_km

In [17]:

autos["price"] = autos["price"].astype(int)
autos["odometer_km"] = autos["odometer_km"].astype(int)
#To convert both columns to interger values from strimg 

Exploring the Odometer and price columns

In [18]:

autos["odometer_km"].unique().shape

Out[18]:

(13,)

In [19]:

autos["odometer_km"].describe()

Out[19]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [20]:

autos["odometer_km"].value_counts().sort_index(ascending = False)

Out[20]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

from this column, we see that all the km are rounded up and majority of the care sold on the ebay site has high mileage. there's also no reason for outliners since all the set km are taken by buyers.

In [21]:

autos["price"].unique().shape

Out[21]:

(2357,)

In [23]:

autos["price"].describe()

Out[23]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [24]:

autos["price"].value_counts().sort_index(ascending = True).head(10)

Out[24]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

In [25]:

autos["price"].value_counts().sort_index(ascending = True).tail(10)

Out[25]:

999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

In [26]:

autos = autos[autos["price"].between(0,350000)]

from the price column, we discovered bidders placed bids as high as 99999999 which doesnt sound realistic from a bidding site, we decided to consider all bids lower than 1 and above 350000 as outliners

Exploring the date column

In [27]:

autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[27]:

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

In [28]:

autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[28]:

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038931
2016-04-04    0.036850
2016-04-05    0.011843
2016-04-06    0.003261
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

In [29]:

autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[29]:

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

i observed that a highest level of activity happened on 2016-04-07.

In [30]:

autos["registration_year"].describe()

Out[30]:

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

it's observed that the min and max registration years are 1000 and 9999 which seems absurd

In [31]:

autos["registration_year"].value_counts().sort_index().head(15)

Out[31]:

1000    1
1001    1
1111    1
1500    1
1800    2
1910    9
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
Name: registration_year, dtype: int64

i decided to explore the frequencies of the registration_year column in ascending order, i then decided to remove all enteries before 1900 because that year had 2 frequencies

In [32]:

autos = autos[autos["registration_year"].between(1934, 2016)]

In [33]:

autos["registration_year"].value_counts(normalize = True).head(10)

Out[33]:

2000    0.069869
2005    0.062807
1999    0.062453
2004    0.057016
2003    0.056808
2006    0.056391
2001    0.056287
2002    0.052766
1998    0.051100
2007    0.047996
Name: registration_year, dtype: float64

In [34]:

autos["registration_year"].value_counts(normalize = True).head(10).sum()

Out[34]:

0.5734938755103741

i observed that 57.34% of the vehicles were registered in the past 20years and had the top 10 most registered cars

Exploring cars by Brand

In [35]:

autos["brand"].unique()

Out[35]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

percentage of each brand

In [36]:

autos["brand"].value_counts(normalize = True)

Out[36]:

volkswagen        0.212170
bmw               0.110032
opel              0.108158
mercedes_benz     0.095388
audi              0.086430
ford              0.069744
renault           0.047350
peugeot           0.029539
fiat              0.025873
seat              0.018186
skoda             0.016040
mazda             0.015145
nissan            0.015103
citroen           0.013916
smart             0.013916
toyota            0.012478
sonstige_autos    0.010791
hyundai           0.009853
volvo             0.009249
mini              0.008645
mitsubishi        0.008145
honda             0.007854
kia               0.007104
alfa_romeo        0.006624
porsche           0.006104
suzuki            0.005916
chevrolet         0.005708
chrysler          0.003666
daihatsu          0.002562
dacia             0.002562
jeep              0.002250
subaru            0.002187
land_rover        0.002041
saab              0.001604
jaguar            0.001583
trabant           0.001542
daewoo            0.001500
rover             0.001354
lancia            0.001083
lada              0.000604
Name: brand, dtype: float64

Top twenty most common brand

In [37]:

top_brand = autos["brand"].value_counts().iloc[0:19].index
top_brand
    

Out[37]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'citroen',
       'smart', 'toyota', 'sonstige_autos', 'hyundai', 'volvo'],
      dtype='object')

Top twenty most common brand by average price

In [38]:

top_brand = autos["brand"].value_counts().iloc[0:19].index
top_brand_by_mean_price = {}
for each_brand in top_brand:
    top_brand_row = autos[autos["brand"] == each_brand]["price"].mean()
    top_brand_by_mean_price[each_brand] = top_brand_row
top_brand_by_mean_price

Out[38]:

{'volkswagen': 5231.081983308787,
 'bmw': 8101.893032942067,
 'opel': 2877.7224576271187,
 'mercedes_benz': 8485.239571958942,
 'audi': 9093.65003615329,
 'ford': 3641.134707287933,
 'renault': 2396.2067751869777,
 'peugeot': 3039.4682651622,
 'fiat': 2711.8011272141707,
 'seat': 4296.492554410081,
 'skoda': 6334.91948051948,
 'mazda': 4010.7716643741405,
 'nissan': 4664.891034482758,
 'citroen': 3699.935628742515,
 'smart': 3542.706586826347,
 'toyota': 5115.33388981636,
 'sonstige_autos': 10909.372586872587,
 'hyundai': 5308.53911205074,
 'volvo': 4757.108108108108}

Brands that individually makeup above 5% of the total amount of brands in the brand column

In [39]:

brands_above_5 = autos["brand"].value_counts(normalize = True) 
brands_above_5 = brands_above_5[brands_above_5 > 0.05].index
brands_above_5_mean = {}
for each_brand in brands_above_5:
    brands_above_5_row = autos[autos["brand"]== each_brand]["price"].mean().round()
    brands_above_5_mean[each_brand] = brands_above_5_row
brands_above_5_mean

Out[39]:

{'volkswagen': 5231.0,
 'bmw': 8102.0,
 'opel': 2878.0,
 'mercedes_benz': 8485.0,
 'audi': 9094.0,
 'ford': 3641.0}

The maximum price for each brand

In [40]:

max_price_per_brand = {}
for each_brand in autos["brand"].unique():
    each_brand_row = autos[autos["brand"] == each_brand].sort_values("price",ascending = False).iloc[0]["price"]
    max_price_per_brand[each_brand] = each_brand_row
max_price_per_brand

Out[40]:

{'peugeot': 25500,
 'bmw': 259000,
 'volkswagen': 64500,
 'smart': 16850,
 'ford': 130000,
 'chrysler': 20000,
 'seat': 33500,
 'renault': 93000,
 'mercedes_benz': 180000,
 'audi': 175000,
 'sonstige_autos': 345000,
 'opel': 38990,
 'mazda': 38900,
 'porsche': 350000,
 'mini': 34500,
 'toyota': 49500,
 'dacia': 16350,
 'nissan': 34999,
 'jeep': 55800,
 'saab': 18900,
 'volvo': 29500,
 'mitsubishi': 37800,
 'jaguar': 58900,
 'fiat': 25000,
 'skoda': 32700,
 'subaru': 31800,
 'kia': 28450,
 'citroen': 33000,
 'chevrolet': 56500,
 'hyundai': 29999,
 'honda': 41900,
 'daewoo': 2900,
 'suzuki': 23200,
 'trabant': 6480,
 'land_rover': 99000,
 'alfa_romeo': 55000,
 'lada': 6500,
 'rover': 23500,
 'daihatsu': 9900,
 'lancia': 29500}

below we will convert the brands_above_5_mean dictionary to a series, then subsequently create a new dataframe named df then assign the series as "mean_price"

In [41]:

bmp_series = pd.Series(brands_above_5_mean)
print(bmp_series)

volkswagen       5231.0
bmw              8102.0
opel             2878.0
mercedes_benz    8485.0
audi             9094.0
ford             3641.0
dtype: float64

In [42]:

df= pd.DataFrame(bmp_series, columns = ["mean_price"])
df

Out[42]:

	mean_price
volkswagen	5231.0
bmw	8102.0
opel	2878.0
mercedes_benz	8485.0
audi	9094.0
ford	3641.0

lets find the average mileage for the the above brands that individually have over 5% of the total percentage of cars in the dataset.

In [43]:

top_brand_mileage = {}
unique_brand = autos["brand"].value_counts(normalize = True)
unique_brand = unique_brand[unique_brand > 0.05].index
for each_brand in unique_brand:
    unique_brand_row = autos[autos["brand"] == each_brand]["odometer_km"].mean()
    top_brand_mileage[each_brand] = unique_brand_row
top_brand_mileage 

Out[43]:

{'volkswagen': 128724.10407461954,
 'bmw': 132455.5092767891,
 'opel': 129224.7688751926,
 'mercedes_benz': 130856.0821139987,
 'audi': 129287.78018799711,
 'ford': 124138.29151732377}

In [44]:

top_brand_mileage_series = pd.Series(top_brand_mileage).round()
top_brand_mileage_series

Out[44]:

volkswagen       128724.0
bmw              132456.0
opel             129225.0
mercedes_benz    130856.0
audi             129288.0
ford             124138.0
dtype: float64

Then we add it to the new dataframe df

In [45]:

df["mileage"] = top_brand_mileage_series

In [46]:

df

Out[46]:

	mean_price	mileage
volkswagen	5231.0	128724.0
bmw	8102.0	132456.0
opel	2878.0	129225.0
mercedes_benz	8485.0	130856.0
audi	9094.0	129288.0
ford	3641.0	124138.0

Next step¶

Identify categorical data that uses german words, translate them and map the values to their english counterparts

In [48]:

autos[["vehicle_type","gearbox","fuel_type","unrepaired_damage"]]

Out[48]:

	vehicle_type	gearbox	fuel_type	unrepaired_damage
0	bus	manuell	lpg	nein
1	limousine	automatik	benzin	nein
2	limousine	manuell	benzin	nein
3	kleinwagen	automatik	benzin	nein
4	kombi	manuell	benzin	nein
...	...	...	...	...
49995	limousine	automatik	diesel	nein
49996	cabrio	manuell	benzin	nein
49997	cabrio	automatik	benzin	nein
49998	kombi	manuell	diesel	nein
49999	limousine	manuell	benzin	nein

48004 rows × 4 columns

In [49]:

autos["vehicle_type"].unique()

Out[49]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

convert german words in vehicle_type column to english

In [50]:

autos["vehicle_type"] = autos["vehicle_type"].map({"bus":"bus","limousine":"limousine","nan":"nan","coupe":"coupe","suv":"suv","kleinwagen":"mini car","kombi":"estate car","cabrio":"convertible","andere":"other"})

In [51]:

autos["vehicle_type"].unique()

Out[51]:

array(['bus', 'limousine', 'mini car', 'estate car', nan, 'coupe', 'suv',
       'convertible', 'other'], dtype=object)

convert german words in gearbox column to english

In [52]:

autos["gearbox"].unique()

Out[52]:

array(['manuell', 'automatik', nan], dtype=object)

In [53]:

autos["gearbox"] = autos["gearbox"].map({"manuell":"manual","automatik":"automatic","nan":"nan"})

In [54]:

autos["gearbox"].unique()

Out[54]:

array(['manual', 'automatic', nan], dtype=object)

convert german words in fuel_type column to english

In [55]:

autos["fuel_type"].unique()

Out[55]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [56]:

autos["fuel_type"] = autos["fuel_type"].map({"lpg":"lpg","benzin":"petrol","diesel":"diesel","nan":"nan","cng":"cng","hybrid":"hybrid","elektro":"electric","andere":"other"})

In [57]:

autos["fuel_type"].unique()

Out[57]:

array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electric',
       'other'], dtype=object)

convert german words in unrepaired_damage column to english

In [58]:

autos["unrepaired_damage"].unique()

Out[58]:

array(['nein', nan, 'ja'], dtype=object)

In [59]:

autos["unrepaired_damage"] = autos["unrepaired_damage"].map({"nein":"none","nan":"nan","ja":"yes"})

In [60]:

autos["unrepaired_damage"].unique()

Out[60]:

array(['none', nan, 'yes'], dtype=object)

In [61]:

autos["ad_created"] = autos["ad_created"].str.replace("-","").str.split().str[0]

Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.

In [62]:

autos["ad_created"] = autos["ad_created"].astype(int)
autos["ad_created"]

Out[62]:

0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
           ...   
49995    20160327
49996    20160328
49997    20160402
49998    20160308
49999    20160313
Name: ad_created, Length: 48004, dtype: int64

Analysis next step¶

1. Find the most common brand/model combinations

In [63]:

most_brand_model = {}
most_brand = autos["brand"].value_counts(ascending = False).index
for each_brand in most_brand:
    most_brand_row = autos[autos["brand"] == each_brand]["model"].iloc[0]
    
    most_brand_model[each_brand] = most_brand_row
most_brand_model

Out[63]:

{'volkswagen': 'golf',
 'bmw': '7er',
 'opel': 'vectra',
 'mercedes_benz': nan,
 'audi': 'a3',
 'ford': 'focus',
 'renault': 'megane',
 'peugeot': 'andere',
 'fiat': 'punto',
 'seat': 'arosa',
 'skoda': 'octavia',
 'mazda': 'andere',
 'nissan': 'primera',
 'citroen': 'andere',
 'smart': 'fortwo',
 'toyota': 'andere',
 'sonstige_autos': nan,
 'hyundai': 'i_reihe',
 'volvo': 'andere',
 'mini': 'cooper',
 'mitsubishi': 'colt',
 'honda': 'civic',
 'kia': 'andere',
 'alfa_romeo': '156',
 'porsche': '911',
 'suzuki': 'grand',
 'chevrolet': 'andere',
 'chrysler': 'voyager',
 'daihatsu': 'terios',
 'dacia': 'sandero',
 'jeep': 'wrangler',
 'subaru': 'andere',
 'land_rover': 'freelander',
 'saab': 'andere',
 'jaguar': 'andere',
 'trabant': nan,
 'daewoo': 'kalos',
 'rover': 'andere',
 'lancia': 'lybra',
 'lada': nan}

from the analysis, we can see that volkswagen is the most popular brand and golf is the most popular model

2. Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.

In [108]:

unique_km_per_mean_price = {}
unique_km = autos["odometer_km"].value_counts(ascending = False).sort_index(ascending = False).index
for km in unique_km:
    unique_km_row = autos[autos["odometer_km"] == km]["price"].mean().round()
    unique_km_per_mean_price[km] = unique_km_row
unique_km_per_mean_price     

Out[108]:

{150000: 3654.0,
 125000: 6086.0,
 100000: 7940.0,
 90000: 8350.0,
 80000: 9576.0,
 70000: 10818.0,
 60000: 12287.0,
 50000: 13634.0,
 40000: 15441.0,
 30000: 16414.0,
 20000: 17941.0,
 10000: 19860.0,
 5000: 7699.0}

In [103]:

km_5000 = autos[autos["odometer_km"] == 5000]["price"]
km_5000

Out[103]:

52        3500
71           0
76       31999
102        320
106        150
         ...  
49722      200
49844    22222
49845        1
49865     5200
49997    13200
Name: price, Length: 901, dtype: int64

i analyzed the average price of each unique km and i obsered they were inveresely correlated i.e the higher the Km the lower the average price of such vehicle. except for the list km which is 5000. i probeb further and i discovered most of the values were 0 hence it resulted in a lower than expected value

3. How much cheaper are cars with damage than their non-damaged counterparts

In [112]:

autos["unrepaired_damage"].unique()

Out[112]:

array(['none', nan, 'yes'], dtype=object)

In [124]:

difference = (autos[autos["unrepaired_damage"]== "none"]["price"].mean()) - (autos[autos["unrepaired_damage"] == "yes"]["price"].mean())

In [128]:

difference.round(2)

Out[128]:

4951.75

on the average damaged cars are $4951.75 cheaper than their counterparts