In this guided project, we will be cleaning and analyze the dataset of the car market in Germany, as well as try to get an overview of the market.
During the import phase, there was a problem while reading the file in UTF-8, after switching to Latin-1 it worked fine.
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
Let's explore the data we have here.
autos
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In this step, we convert column names from camelcase to snakecase, and rename some of the column names so that they are shorter and easier to read.
The columns 'seller' and 'offerType' contains mostly one value each, which doesn't' bring any practical meanings to our analysis, so it is perhaps best that they are removed.
autos = autos.drop(['seller', 'offerType'], axis=1)
autos.columns = ['date_crawled', 'name', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
In this step, we convert the 'price' and 'odometer' column to number type, as well as rename 'odometer' to 'odometer_km'.
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
print(autos['price'])
autos['price'] = autos['price'].astype(int)
0 5000 1 8500 2 8990 3 4350 4 1350 5 7900 6 300 7 1990 8 250 9 590 10 999 11 350 12 5299 13 1350 14 3999 15 18900 16 350 17 5500 18 300 19 4150 20 3500 21 41500 22 25450 23 7999 24 48500 25 90 26 777 27 0 28 5250 29 4999 ... 49970 15800 49971 950 49972 3300 49973 6000 49974 0 49975 9700 49976 5900 49977 5500 49978 900 49979 11000 49980 400 49981 2000 49982 1950 49983 600 49984 0 49985 1000 49986 15900 49987 21990 49988 9550 49989 150 49990 17500 49991 500 49992 4800 49993 1650 49994 5000 49995 24900 49996 1980 49997 13200 49998 22900 49999 1250 Name: price, Length: 50000, dtype: object
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename(columns={'odometer':'odometer_km'},inplace=True)
print(autos.columns)
print(autos['price'].value_counts())
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object') 0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 1100 376 1300 371 3000 365 550 356 1800 355 5500 340 1250 335 350 335 1600 327 1999 322 ... 46200 1 29600 1 13480 1 21700 1 7373 1 3279 1 4286 1 188 1 17830 1 9130 1 910 1 238 1 2671 1 69900 1 151990 1 2479 1 4510 1 86500 1 47499 1 16998 1 27299 1 41850 1 4780 1 686 1 6495 1 20790 1 8970 1 846 1 2895 1 33980 1 Name: price, Length: 2357, dtype: int64
autos['odometer_km'].unique().shape
autos['odometer_km'].describe()
autos['odometer_km'].value_counts().sort_index(ascending=False).head()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64
The odometer value looks pretty normal, so no changes are needed here.
autos['price'].unique().shape
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Let's take a look at the min and max values of the 'price' column to see if there are any anomalies.
autos['price'].value_counts().sort_index(ascending=False).head(100)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 .. 75900 1 75000 1 74999 2 74900 3 73996 1 73900 1 73500 1 72900 1 72600 1 72500 1 71000 1 70850 1 70000 1 69999 1 69997 1 69993 1 69900 1 69500 1 68900 1 68750 1 68500 1 68300 1 68000 1 67911 1 67000 1 66964 1 66500 1 65990 1 65700 1 65699 1 Name: price, Length: 100, dtype: int64
autos = autos[autos['price'].between(200,200000)]
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47637 entries, 0 to 49999 Data columns (total 18 columns): date_crawled 47637 non-null object name 47637 non-null object price 47637 non-null int64 abtest 47637 non-null object vehicle_type 43389 non-null object registration_year 47637 non-null int64 gearbox 45532 non-null object power_ps 47637 non-null int64 model 45307 non-null object odometer_km 47637 non-null int64 registration_month 47637 non-null int64 fuel_type 43907 non-null object brand 47637 non-null object unrepaired_damage 39034 non-null object ad_created 47637 non-null object nr_of_pictures 47637 non-null int64 postal_code 47637 non-null int64 last_seen 47637 non-null object dtypes: int64(7), object(11) memory usage: 6.9+ MB
As the mean is just around 10,000, we can see that some of the top and bottom values are very irrelevant to the rest of the dataset. Let's choose the appropriate range between 200 and 200,000. After the outliers are removed, we can see there are 47,637 items left, still enough for our analysis.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025358 2016-03-06 0.014065 2016-03-07 0.036001 2016-03-08 0.033126 2016-03-09 0.033042 2016-03-10 0.032328 2016-03-11 0.032706 2016-03-12 0.036883 2016-03-13 0.015702 2016-03-14 0.036568 2016-03-15 0.034238 2016-03-16 0.029452 2016-03-17 0.031551 2016-03-18 0.012826 2016-03-19 0.034616 2016-03-20 0.037807 2016-03-21 0.037345 2016-03-22 0.032706 2016-03-23 0.032370 2016-03-24 0.029326 2016-03-25 0.031425 2016-03-26 0.032223 2016-03-27 0.031194 2016-03-28 0.034994 2016-03-29 0.033986 2016-03-30 0.033881 2016-03-31 0.031845 2016-04-01 0.033818 2016-04-02 0.035687 2016-04-03 0.038709 2016-04-04 0.036568 2016-04-05 0.013120 2016-04-06 0.003170 2016-04-07 0.001364 Name: date_crawled, dtype: float64
The crawling date range is one month, from early March to early April, 2016. The values in date_crawled are distributed quite evenly, around 3% per day - typical of machine's work I would say.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000042 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000063 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000042 2016-02-05 0.000042 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000042 2016-02-14 0.000042 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000042 2016-02-19 0.000063 2016-02-20 0.000042 2016-02-21 0.000063 ... 2016-03-09 0.033147 2016-03-10 0.032034 2016-03-11 0.033021 2016-03-12 0.036694 2016-03-13 0.017109 2016-03-14 0.035183 2016-03-15 0.033965 2016-03-16 0.029956 2016-03-17 0.031215 2016-03-18 0.013456 2016-03-19 0.033524 2016-03-20 0.037891 2016-03-21 0.037555 2016-03-22 0.032538 2016-03-23 0.032202 2016-03-24 0.029263 2016-03-25 0.031551 2016-03-26 0.032265 2016-03-27 0.031131 2016-03-28 0.035078 2016-03-29 0.033965 2016-03-30 0.033692 2016-03-31 0.031887 2016-04-01 0.033797 2016-04-02 0.035372 2016-04-03 0.038961 2016-04-04 0.036925 2016-04-05 0.011840 2016-04-06 0.003254 2016-04-07 0.001218 Name: ad_created, Length: 76, dtype: float64
The values of ad_created is much more varied, but concentrate mainly in March. The number of ads created in Feb and Jan 2016 is comparably small.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001092 2016-03-06 0.004303 2016-03-07 0.005374 2016-03-08 0.007179 2016-03-09 0.009614 2016-03-10 0.010475 2016-03-11 0.012301 2016-03-12 0.023910 2016-03-13 0.008901 2016-03-14 0.012532 2016-03-15 0.015723 2016-03-16 0.016290 2016-03-17 0.028087 2016-03-18 0.007284 2016-03-19 0.015618 2016-03-20 0.020656 2016-03-21 0.020509 2016-03-22 0.021412 2016-03-23 0.018494 2016-03-24 0.019628 2016-03-25 0.019082 2016-03-26 0.016689 2016-03-27 0.015492 2016-03-28 0.020740 2016-03-29 0.022189 2016-03-30 0.024582 2016-03-31 0.023868 2016-04-01 0.022902 2016-04-02 0.024813 2016-04-03 0.025086 2016-04-04 0.024624 2016-04-05 0.125449 2016-04-06 0.222432 2016-04-07 0.132670 Name: last_seen, dtype: float64
The values from last_seen column are very unevenly distributed. Almost half of the last seen items are from three days: 5-7Apr.
autos['registration_year'].describe()
count 47637.000000 mean 2004.799631 std 88.431141 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
min and max value (1000 and 9999 respectively) of 'registration_year' are not possible, therefore they need to be removed. Based on the mean of the series and 'common sense', we can safely suggest that the acceptable values can be somewhere between early 1900's and 2016 (the date this data is published).
auto_reg_year = autos.loc[: ,'registration_year']
int_reg_year = auto_reg_year.astype(int)
autos = autos[int_reg_year.between(1900,2016)]
At this stage, I kept encoutering 'SettingWithCopyWarning'. So I worked around by separating the instructions into individual ones, which mean not using chained assignments. Let's check if the data has been successfully cleaned.
autos['registration_year'].value_counts(normalize=True).sort_values(ascending=False)
2000 0.065949 2005 0.063480 1999 0.062148 2004 0.058849 2003 0.058806 2006 0.058259 2001 0.057014 2002 0.053956 1998 0.049915 2007 0.049631 2008 0.048167 2009 0.045415 1997 0.040456 2011 0.035323 2010 0.034667 2012 0.028529 1996 0.028442 2016 0.025318 1995 0.024444 2013 0.017410 2014 0.014330 1994 0.012954 1993 0.008869 2015 0.008170 1992 0.007755 1991 0.007209 1990 0.006990 1989 0.003692 1988 0.002905 1985 0.001922 ... 1960 0.000459 1976 0.000459 1969 0.000415 1975 0.000393 1965 0.000371 1964 0.000262 1963 0.000175 1961 0.000131 1959 0.000131 1937 0.000087 1958 0.000087 1962 0.000087 1956 0.000087 1955 0.000044 1954 0.000044 1957 0.000044 1951 0.000044 1910 0.000044 1934 0.000044 1941 0.000044 1948 0.000022 1953 0.000022 1939 0.000022 1938 0.000022 1943 0.000022 1931 0.000022 1950 0.000022 1929 0.000022 1927 0.000022 1952 0.000022 Name: registration_year, Length: 78, dtype: float64
Most used cars were registered from late 90s to the 00s
autos['brand'].value_counts(normalize=True)
volkswagen 0.211281 bmw 0.111232 opel 0.105924 mercedes_benz 0.097645 audi 0.087619 ford 0.068745 renault 0.046551 peugeot 0.029949 fiat 0.025099 seat 0.018175 skoda 0.016514 nissan 0.015291 mazda 0.015138 smart 0.014374 citroen 0.014133 toyota 0.012932 hyundai 0.010114 sonstige_autos 0.009524 volvo 0.009197 mini 0.008891 mitsubishi 0.008170 honda 0.007930 kia 0.007143 alfa_romeo 0.006684 porsche 0.005985 suzuki 0.005811 chevrolet 0.005723 chrysler 0.003561 dacia 0.002687 daihatsu 0.002490 jeep 0.002316 land_rover 0.002141 subaru 0.002097 saab 0.001660 jaguar 0.001529 daewoo 0.001485 rover 0.001333 trabant 0.001289 lancia 0.001049 lada 0.000590 Name: brand, dtype: float64
From the value distribution table above, we can see that European brands are the dominating ones in the German market. Asian and even most American brands are quite far behind. In the next step, we are going to aggregate the most common brands, defined as those whose share is more than 2%.
import pprint
selected_brands = ['volkswagen','bmw','opel','mercedes_benz','audi','ford','renault','peugeot','fiat']
print(autos['price'].mean())
top_brands = {}
for brand in selected_brands:
selected_rows = autos[autos['brand'] == brand]
top_brands[brand] = selected_rows['price'].mean()
sorted_d_price = sorted(top_brands.items(),reverse=True, key=lambda x: x[1])
pprint.pprint(sorted_d_price)
6043.787714622745 [('audi', 9406.093243580155), ('mercedes_benz', 8691.71744966443), ('bmw', 8306.00569520817), ('volkswagen', 5506.4409636062865), ('ford', 3883.292024149984), ('peugeot', 3142.0160466812545), ('opel', 3077.5772324190557), ('fiat', 2925.9547432550044), ('renault', 2552.5152510558423)]
Among the selected brands (those with market share above 2 percent), the most common price point is around 3000 USD with the presence of brands such as Ford, Peugeot, Opel and Fiat. The higher end cars (Audi, Mercedes, and BMW) can be as twice as much as those mentioned above. Renault has the cheapest car in the list.
mean_mileage = {}
for brand in selected_brands:
selected_rows = autos[autos['brand'] == brand]
mean_mileage[brand] = selected_rows['odometer_km'].mean()
sr_price = pd.Series(top_brands, name='mean_price')
sr_price = sr_price.astype(int)
# pprint.pprint(sr_price)
sr_mileage = pd.Series(mean_mileage, name='mean_mileage')
sr_mileage = sr_mileage.astype(int)
# pprint.pprint(sr_mileage)
df = pd.concat([sr_price,sr_mileage],axis=1)
print(df)
mean_price mean_mileage audi 9406 129260 bmw 8306 132839 fiat 2925 116949 ford 3883 124095 mercedes_benz 8691 131091 opel 3077 129231 peugeot 3142 126929 renault 2552 128052 volkswagen 5506 128774
From the dataframe above, we can see that the number of miles driven are quite similar among the most common brands. That might lead to a conclusion that the price in this case depends more on the brand rather than the miles. We can see that the more luxury brands such as Audi, BMW and Mercedes Benz are much more expensive than the rest of the selected brands.
Replace German words with English counterparts
autos = autos.replace(
{
'kleinwagen': 'small car',
'kombi': 'station wagon',
'andere': 'others',
'cabrio': 'convertible',
'bus': 'bus',
'limousine': 'limousine',
'coupe': 'coupe',
'suv': 'suv',
'manuell': 'manual',
'automatik': 'automatic',
'benzin': 'gasoline',
'nein': 'no',
'ja': 'yes'
})
print(autos['vehicle_type'].head(10))
0 bus 1 limousine 2 limousine 3 small car 4 station wagon 5 bus 6 limousine 7 limousine 8 NaN 9 bus Name: vehicle_type, dtype: object
There are 2 columns with datetime format, namely 'last_seen' and 'date_created'
autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'] = autos['last_seen'].str.replace('-','').astype(int)
print(autos['last_seen'].head())
# autos['last_seen'] = autos['last_seen'].astype(int)
0 20160406 1 20160406 2 20160406 3 20160315 4 20160401 Name: last_seen, dtype: int64
Below is a list of 10 most common brand/model combinations
model_brand = autos['brand'] + " " + autos['model']
print(model_brand.value_counts().head(10))
volkswagen golf 3639 bmw 3er 2596 volkswagen polo 1533 opel corsa 1504 volkswagen passat 1343 opel astra 1318 audi a4 1223 mercedes_benz c_klasse 1134 bmw 5er 1123 mercedes_benz e_klasse 952 dtype: int64
Before segmenting the column, we need to explore to see the minimum and maximum values. As the minimum value is near 0 and maximum value is 150000, 3 segments would be quite appropriate:
print(autos['odometer_km'].max())
print(autos['odometer_km'].min())
150000 5000
mileage_under_50k = autos[autos['odometer_km'] < 50000]['price'].mean()
mileage_50k_150k = autos[(autos['odometer_km'] >= 50000) & (autos['odometer_km'] <= 100000)]['price'].mean()
mileage_above_150k = autos[autos['odometer_km'] >= 100000]['price'].mean()
print('Average price of used cars with under 50,000 km:', mileage_under_50k.round())
print('Average price of used cars in the 50,000 - 100,000 km range:',mileage_50k_150k.round())
print('Average price of used cars in the 100,000 - 150,000 km range:',mileage_above_150k.round())
Average price of used cars with under 50,000 km: 15318.0 Average price of used cars in the 50,000 - 100,000 km range: 10147.0 Average price of used cars in the 100,000 - 150,000 km range: 4406.0
From the prices shown above, we can see that a used car's price tends to decline at the same rate as it is used.
To do this, we will compare the average value of damaged and non-damaged cars.
price_unrepaired_damage = autos[autos['unrepaired_damage'] == 'yes']['price'].mean()
price_repaired_damage = autos[autos['unrepaired_damage'] == 'no']['price'].mean()
print('Average price of unrepaired cars:', price_unrepaired_damage.round())
print('Average price of repaired cars:', price_repaired_damage.round())
Average price of unrepaired cars: 2362.0 Average price of repaired cars: 7140.0
As we can see, price of repaired cars is much higher (x3) than unrepaired cars.
In this project, we try to understand the landscape of German used car market by analyzing the data gathered from various sources. Here are some conclusions drawn from the available data: