Guided project: Analyzing used car market in Germany¶

In this guided project, we will be cleaning and analyze the dataset of the car market in Germany, as well as try to get an overview of the market.

During the import phase, there was a problem while reading the file in UTF-8, after switching to Latin-1 it worked fine.

In [1]:

import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

Let's explore the data we have here.

In [2]:

autos
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In this step, we convert column names from camelcase to snakecase, and rename some of the column names so that they are shorter and easier to read.

The columns 'seller' and 'offerType' contains mostly one value each, which doesn't' bring any practical meanings to our analysis, so it is perhaps best that they are removed.

In [3]:

autos = autos.drop(['seller', 'offerType'], axis=1)

In [4]:

autos.columns = ['date_crawled', 'name', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
autos.head()

Out[4]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Some observation:¶

Columns that have mostly one value need to be dropped: 'nr_of_pictures'
Columns containing numeric data stored as text need to be converted to number type: 'price' and 'odometer'.

Other columns that need more investigation:¶

There are 5 columns that contain null values. We might have to deal with them later on.
The dataset uses both German and English, which causes incoherence. We need to use just one.
The column 'lastSeen' and 'dateCreated' 's format as datetime should be considered conversion as well.
The maximum value of 'registration_year' is 9999, which is impossible.
'power_ps' maximum value is 17700, which is considerably higher than the mean

In [5]:

autos.describe(include='all')

Out[5]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-05 16:57:05	Ford_Fiesta	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In this step, we convert the 'price' and 'odometer' column to number type, as well as rename 'odometer' to 'odometer_km'.

In [6]:

autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
print(autos['price'])
autos['price'] = autos['price'].astype(int)

0         5000
1         8500
2         8990
3         4350
4         1350
5         7900
6          300
7         1990
8          250
9          590
10         999
11         350
12        5299
13        1350
14        3999
15       18900
16         350
17        5500
18         300
19        4150
20        3500
21       41500
22       25450
23        7999
24       48500
25          90
26         777
27           0
28        5250
29        4999
         ...  
49970    15800
49971      950
49972     3300
49973     6000
49974        0
49975     9700
49976     5900
49977     5500
49978      900
49979    11000
49980      400
49981     2000
49982     1950
49983      600
49984        0
49985     1000
49986    15900
49987    21990
49988     9550
49989      150
49990    17500
49991      500
49992     4800
49993     1650
49994     5000
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: object

In [7]:

autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename(columns={'odometer':'odometer_km'},inplace=True)

print(autos.columns)
print(autos['price'].value_counts())

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'],
      dtype='object')
0         1421
500        781
1500       734
2500       643
1000       639
1200       639
600        531
800        498
3500       498
2000       460
999        434
750        433
900        420
650        419
850        410
700        395
4500       394
300        384
2200       382
950        379
1100       376
1300       371
3000       365
550        356
1800       355
5500       340
1250       335
350        335
1600       327
1999       322
          ... 
46200        1
29600        1
13480        1
21700        1
7373         1
3279         1
4286         1
188          1
17830        1
9130         1
910          1
238          1
2671         1
69900        1
151990       1
2479         1
4510         1
86500        1
47499        1
16998        1
27299        1
41850        1
4780         1
686          1
6495         1
20790        1
8970         1
846          1
2895         1
33980        1
Name: price, Length: 2357, dtype: int64

In [8]:

autos['odometer_km'].unique().shape
autos['odometer_km'].describe()
autos['odometer_km'].value_counts().sort_index(ascending=False).head()

Out[8]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

The odometer value looks pretty normal, so no changes are needed here.

In [9]:

autos['price'].unique().shape
autos['price'].describe()

Out[9]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

Let's take a look at the min and max values of the 'price' column to see if there are any anomalies.

In [10]:

autos['price'].value_counts().sort_index(ascending=False).head(100)

Out[10]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
           ..
75900       1
75000       1
74999       2
74900       3
73996       1
73900       1
73500       1
72900       1
72600       1
72500       1
71000       1
70850       1
70000       1
69999       1
69997       1
69993       1
69900       1
69500       1
68900       1
68750       1
68500       1
68300       1
68000       1
67911       1
67000       1
66964       1
66500       1
65990       1
65700       1
65699       1
Name: price, Length: 100, dtype: int64

In [11]:

autos = autos[autos['price'].between(200,200000)]
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47637 entries, 0 to 49999
Data columns (total 18 columns):
date_crawled          47637 non-null object
name                  47637 non-null object
price                 47637 non-null int64
abtest                47637 non-null object
vehicle_type          43389 non-null object
registration_year     47637 non-null int64
gearbox               45532 non-null object
power_ps              47637 non-null int64
model                 45307 non-null object
odometer_km           47637 non-null int64
registration_month    47637 non-null int64
fuel_type             43907 non-null object
brand                 47637 non-null object
unrepaired_damage     39034 non-null object
ad_created            47637 non-null object
nr_of_pictures        47637 non-null int64
postal_code           47637 non-null int64
last_seen             47637 non-null object
dtypes: int64(7), object(11)
memory usage: 6.9+ MB

As the mean is just around 10,000, we can see that some of the top and bottom values are very irrelevant to the rest of the dataset. Let's choose the appropriate range between 200 and 200,000. After the outliers are removed, we can see there are 47,637 items left, still enough for our analysis.

In [12]:

autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[12]:

2016-03-05    0.025358
2016-03-06    0.014065
2016-03-07    0.036001
2016-03-08    0.033126
2016-03-09    0.033042
2016-03-10    0.032328
2016-03-11    0.032706
2016-03-12    0.036883
2016-03-13    0.015702
2016-03-14    0.036568
2016-03-15    0.034238
2016-03-16    0.029452
2016-03-17    0.031551
2016-03-18    0.012826
2016-03-19    0.034616
2016-03-20    0.037807
2016-03-21    0.037345
2016-03-22    0.032706
2016-03-23    0.032370
2016-03-24    0.029326
2016-03-25    0.031425
2016-03-26    0.032223
2016-03-27    0.031194
2016-03-28    0.034994
2016-03-29    0.033986
2016-03-30    0.033881
2016-03-31    0.031845
2016-04-01    0.033818
2016-04-02    0.035687
2016-04-03    0.038709
2016-04-04    0.036568
2016-04-05    0.013120
2016-04-06    0.003170
2016-04-07    0.001364
Name: date_crawled, dtype: float64

The crawling date range is one month, from early March to early April, 2016. The values in date_crawled are distributed quite evenly, around 3% per day - typical of machine's work I would say.

In [13]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[13]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000042
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000063
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000042
2016-02-05    0.000042
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000042
2016-02-14    0.000042
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000042
2016-02-19    0.000063
2016-02-20    0.000042
2016-02-21    0.000063
                ...   
2016-03-09    0.033147
2016-03-10    0.032034
2016-03-11    0.033021
2016-03-12    0.036694
2016-03-13    0.017109
2016-03-14    0.035183
2016-03-15    0.033965
2016-03-16    0.029956
2016-03-17    0.031215
2016-03-18    0.013456
2016-03-19    0.033524
2016-03-20    0.037891
2016-03-21    0.037555
2016-03-22    0.032538
2016-03-23    0.032202
2016-03-24    0.029263
2016-03-25    0.031551
2016-03-26    0.032265
2016-03-27    0.031131
2016-03-28    0.035078
2016-03-29    0.033965
2016-03-30    0.033692
2016-03-31    0.031887
2016-04-01    0.033797
2016-04-02    0.035372
2016-04-03    0.038961
2016-04-04    0.036925
2016-04-05    0.011840
2016-04-06    0.003254
2016-04-07    0.001218
Name: ad_created, Length: 76, dtype: float64

The values of ad_created is much more varied, but concentrate mainly in March. The number of ads created in Feb and Jan 2016 is comparably small.

In [14]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[14]:

2016-03-05    0.001092
2016-03-06    0.004303
2016-03-07    0.005374
2016-03-08    0.007179
2016-03-09    0.009614
2016-03-10    0.010475
2016-03-11    0.012301
2016-03-12    0.023910
2016-03-13    0.008901
2016-03-14    0.012532
2016-03-15    0.015723
2016-03-16    0.016290
2016-03-17    0.028087
2016-03-18    0.007284
2016-03-19    0.015618
2016-03-20    0.020656
2016-03-21    0.020509
2016-03-22    0.021412
2016-03-23    0.018494
2016-03-24    0.019628
2016-03-25    0.019082
2016-03-26    0.016689
2016-03-27    0.015492
2016-03-28    0.020740
2016-03-29    0.022189
2016-03-30    0.024582
2016-03-31    0.023868
2016-04-01    0.022902
2016-04-02    0.024813
2016-04-03    0.025086
2016-04-04    0.024624
2016-04-05    0.125449
2016-04-06    0.222432
2016-04-07    0.132670
Name: last_seen, dtype: float64

The values from last_seen column are very unevenly distributed. Almost half of the last seen items are from three days: 5-7Apr.

In [15]:

autos['registration_year'].describe()

Out[15]:

count    47637.000000
mean      2004.799631
std         88.431141
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

min and max value (1000 and 9999 respectively) of 'registration_year' are not possible, therefore they need to be removed. Based on the mean of the series and 'common sense', we can safely suggest that the acceptable values can be somewhere between early 1900's and 2016 (the date this data is published).

In [16]:

auto_reg_year = autos.loc[: ,'registration_year']
int_reg_year = auto_reg_year.astype(int)
autos = autos[int_reg_year.between(1900,2016)]

At this stage, I kept encoutering 'SettingWithCopyWarning'. So I worked around by separating the instructions into individual ones, which mean not using chained assignments. Let's check if the data has been successfully cleaned.

In [17]:

autos['registration_year'].value_counts(normalize=True).sort_values(ascending=False)

Out[17]:

2000    0.065949
2005    0.063480
1999    0.062148
2004    0.058849
2003    0.058806
2006    0.058259
2001    0.057014
2002    0.053956
1998    0.049915
2007    0.049631
2008    0.048167
2009    0.045415
1997    0.040456
2011    0.035323
2010    0.034667
2012    0.028529
1996    0.028442
2016    0.025318
1995    0.024444
2013    0.017410
2014    0.014330
1994    0.012954
1993    0.008869
2015    0.008170
1992    0.007755
1991    0.007209
1990    0.006990
1989    0.003692
1988    0.002905
1985    0.001922
          ...   
1960    0.000459
1976    0.000459
1969    0.000415
1975    0.000393
1965    0.000371
1964    0.000262
1963    0.000175
1961    0.000131
1959    0.000131
1937    0.000087
1958    0.000087
1962    0.000087
1956    0.000087
1955    0.000044
1954    0.000044
1957    0.000044
1951    0.000044
1910    0.000044
1934    0.000044
1941    0.000044
1948    0.000022
1953    0.000022
1939    0.000022
1938    0.000022
1943    0.000022
1931    0.000022
1950    0.000022
1929    0.000022
1927    0.000022
1952    0.000022
Name: registration_year, Length: 78, dtype: float64

Most used cars were registered from late 90s to the 00s

In [18]:

autos['brand'].value_counts(normalize=True)

Out[18]:

volkswagen        0.211281
bmw               0.111232
opel              0.105924
mercedes_benz     0.097645
audi              0.087619
ford              0.068745
renault           0.046551
peugeot           0.029949
fiat              0.025099
seat              0.018175
skoda             0.016514
nissan            0.015291
mazda             0.015138
smart             0.014374
citroen           0.014133
toyota            0.012932
hyundai           0.010114
sonstige_autos    0.009524
volvo             0.009197
mini              0.008891
mitsubishi        0.008170
honda             0.007930
kia               0.007143
alfa_romeo        0.006684
porsche           0.005985
suzuki            0.005811
chevrolet         0.005723
chrysler          0.003561
dacia             0.002687
daihatsu          0.002490
jeep              0.002316
land_rover        0.002141
subaru            0.002097
saab              0.001660
jaguar            0.001529
daewoo            0.001485
rover             0.001333
trabant           0.001289
lancia            0.001049
lada              0.000590
Name: brand, dtype: float64

From the value distribution table above, we can see that European brands are the dominating ones in the German market. Asian and even most American brands are quite far behind. In the next step, we are going to aggregate the most common brands, defined as those whose share is more than 2%.

In [19]:

import pprint
selected_brands = ['volkswagen','bmw','opel','mercedes_benz','audi','ford','renault','peugeot','fiat']

print(autos['price'].mean())
top_brands = {}
for brand in selected_brands:
    selected_rows = autos[autos['brand'] == brand]
    top_brands[brand] = selected_rows['price'].mean()

sorted_d_price = sorted(top_brands.items(),reverse=True, key=lambda x: x[1])
pprint.pprint(sorted_d_price)

6043.787714622745
[('audi', 9406.093243580155),
 ('mercedes_benz', 8691.71744966443),
 ('bmw', 8306.00569520817),
 ('volkswagen', 5506.4409636062865),
 ('ford', 3883.292024149984),
 ('peugeot', 3142.0160466812545),
 ('opel', 3077.5772324190557),
 ('fiat', 2925.9547432550044),
 ('renault', 2552.5152510558423)]

Among the selected brands (those with market share above 2 percent), the most common price point is around 3000 USD with the presence of brands such as Ford, Peugeot, Opel and Fiat. The higher end cars (Audi, Mercedes, and BMW) can be as twice as much as those mentioned above. Renault has the cheapest car in the list.

In [20]:

mean_mileage = {}
for brand in selected_brands:
    selected_rows = autos[autos['brand'] == brand]
    mean_mileage[brand] = selected_rows['odometer_km'].mean()

sr_price = pd.Series(top_brands, name='mean_price')
sr_price = sr_price.astype(int)
# pprint.pprint(sr_price)
sr_mileage = pd.Series(mean_mileage, name='mean_mileage')
sr_mileage = sr_mileage.astype(int)
# pprint.pprint(sr_mileage)

df = pd.concat([sr_price,sr_mileage],axis=1)
print(df)

               mean_price  mean_mileage
audi                 9406        129260
bmw                  8306        132839
fiat                 2925        116949
ford                 3883        124095
mercedes_benz        8691        131091
opel                 3077        129231
peugeot              3142        126929
renault              2552        128052
volkswagen           5506        128774

From the dataframe above, we can see that the number of miles driven are quite similar among the most common brands. That might lead to a conclusion that the price in this case depends more on the brand rather than the miles. We can see that the more luxury brands such as Audi, BMW and Mercedes Benz are much more expensive than the rest of the selected brands.

Additional Challenges #1:¶

Replace German words with English counterparts

In [21]:

autos = autos.replace(
{
    'kleinwagen': 'small car',
    'kombi': 'station wagon',
    'andere': 'others',
    'cabrio': 'convertible',
    'bus': 'bus',
    'limousine': 'limousine',
    'coupe': 'coupe',
    'suv': 'suv',
    'manuell': 'manual',
    'automatik': 'automatic',
    'benzin': 'gasoline',
    'nein': 'no',
    'ja': 'yes'
})

print(autos['vehicle_type'].head(10))

0              bus
1        limousine
2        limousine
3        small car
4    station wagon
5              bus
6        limousine
7        limousine
8              NaN
9              bus
Name: vehicle_type, dtype: object

Additional Challenges #2: Convert the dates to integer:¶

There are 2 columns with datetime format, namely 'last_seen' and 'date_created'

In [22]:

autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'] = autos['last_seen'].str.replace('-','').astype(int)
print(autos['last_seen'].head())

# autos['last_seen'] = autos['last_seen'].astype(int)

0    20160406
1    20160406
2    20160406
3    20160315
4    20160401
Name: last_seen, dtype: int64

Additional Challenges #3: Find the most common brand/model combinations¶

Below is a list of 10 most common brand/model combinations

In [23]:

model_brand = autos['brand'] + " " + autos['model']
print(model_brand.value_counts().head(10))

volkswagen golf           3639
bmw 3er                   2596
volkswagen polo           1533
opel corsa                1504
volkswagen passat         1343
opel astra                1318
audi a4                   1223
mercedes_benz c_klasse    1134
bmw 5er                   1123
mercedes_benz e_klasse     952
dtype: int64

Additional Challenges #4: Find the relationship between price and mileage¶

Before segmenting the column, we need to explore to see the minimum and maximum values. As the minimum value is near 0 and maximum value is 150000, 3 segments would be quite appropriate:

0 - 50,000 km
50,000 - 100,000 km
100,000 - 150,000 km

In [24]:

print(autos['odometer_km'].max())
print(autos['odometer_km'].min())

150000
5000

In [25]:

mileage_under_50k = autos[autos['odometer_km'] < 50000]['price'].mean()
mileage_50k_150k = autos[(autos['odometer_km'] >= 50000) & (autos['odometer_km'] <= 100000)]['price'].mean()
mileage_above_150k = autos[autos['odometer_km'] >= 100000]['price'].mean()
print('Average price of used cars with under 50,000 km:', mileage_under_50k.round())
print('Average price of used cars in the 50,000 - 100,000 km range:',mileage_50k_150k.round())
print('Average price of used cars in the 100,000 - 150,000 km range:',mileage_above_150k.round())

Average price of used cars with under 50,000 km: 15318.0
Average price of used cars in the 50,000 - 100,000 km range: 10147.0
Average price of used cars in the 100,000 - 150,000 km range: 4406.0

From the prices shown above, we can see that a used car's price tends to decline at the same rate as it is used.

Additional Challenges #5: Compare the price of damaged and non-damaged cars¶

To do this, we will compare the average value of damaged and non-damaged cars.

In [36]:

price_unrepaired_damage = autos[autos['unrepaired_damage'] == 'yes']['price'].mean()
price_repaired_damage = autos[autos['unrepaired_damage'] == 'no']['price'].mean()
print('Average price of unrepaired cars:', price_unrepaired_damage.round())
print('Average price of repaired cars:', price_repaired_damage.round())

Average price of unrepaired cars: 2362.0
Average price of repaired cars: 7140.0

As we can see, price of repaired cars is much higher (x3) than unrepaired cars.

Conclusion:¶

In this project, we try to understand the landscape of German used car market by analyzing the data gathered from various sources. Here are some conclusions drawn from the available data:

Most used cars on the market are around 20 years old, as of 2016 (the date of registration ranges from 90s to early 2000s).
European brands dominate the German used car market. The four most common brands are: Volkswagen, BMW, Opel, Mercedes Benz, and Audi.
Most used cars cost about USD3,000, but the higher end cars can be as twice as expensive.
The data shows that the more a car is used (higher miles driven), the less costly it becomes. However, we are not sure if the relationship is linear or not, and whether different brands have the same depreciation rate.