In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The data is 50,000 sampled data points from the full dataset.
# Importing packages and reading in the csv data
import pandas as pd
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gearbox | power_PS | model | kilometers | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 |
1 | 2016-04-04 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 |
2 | 2016-03-26 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 |
3 | 2016-03-12 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 |
4 | 2016-04-01 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900 | control | limousine | 2011 | automatik | 239 | q5 | 100000 | 1 | diesel | audi | nein | 2016-03-27 | 0 | 82131 | 2016-04-01 |
49996 | 2016-03-28 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980 | control | cabrio | 1996 | manuell | 75 | astra | 150000 | 5 | benzin | opel | nein | 2016-03-28 | 0 | 44807 | 2016-04-02 |
49997 | 2016-04-02 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200 | test | cabrio | 2014 | automatik | 69 | 500 | 5000 | 11 | benzin | fiat | nein | 2016-04-02 | 0 | 73430 | 2016-04-04 |
49998 | 2016-03-08 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900 | control | kombi | 2013 | manuell | 150 | a3 | 40000 | 11 | diesel | audi | nein | 2016-03-08 | 0 | 35683 | 2016-04-05 |
49999 | 2016-03-14 | Opel_Vectra_1.6_16V | privat | Angebot | 1250 | control | limousine | 1996 | manuell | 101 | vectra | 150000 | 1 | benzin | opel | nein | 2016-03-13 | 0 | 45897 | 2016-04-06 |
43323 rows × 20 columns
The dataset contains 5000 rows of information with 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
for i in autos.columns:
print(i)
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
old_columns = autos.columns.copy()
new_columns = [
'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
'kilometers', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen'
]
autos.columns = new_columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_PS | model | kilometers | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The following are the new column names that will be used from now on in the dataframe. The column names were changed to that they conform to the snakecase format from the original camelcase in which they were.
autos.describe(include='all')
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
The columns seller, offer_type and abtest have each mostly one value. These columns need to be investigated further.
autos["seller"].unique()
array(['privat', 'gewerblich'], dtype=object)
autos.loc[autos['seller'] == 'privat', 'seller'].value_counts()
privat 49999 Name: seller, dtype: int64
autos["offer_type"].unique()
array(['Angebot', 'Gesuch'], dtype=object)
autos.loc[autos['offer_type'] == 'Angebot', 'offer_type'].value_counts()
Angebot 49999 Name: offer_type, dtype: int64
The two columns seller and offer type can be dropped as they only have mostly one type of information.
autos.dtypes
date_crawled object name object seller object offer_type object price object abtest object vehicle_type object registration_year int64 gearbox object power_PS int64 model object kilometers object registration_month int64 fuel_type object brand object unrepaired_damage object ad_created object nr_of_pictures int64 postal_code int64 last_seen object dtype: object
The infomation in columns price, kilometers should be integers while ad_created and last seen are dates.
autos["price"].tail()
49995 $24,900 49996 $1,980 49997 $13,200 49998 $22,900 49999 $1,250 Name: price, dtype: object
For price, the dollar symbol and the coma is to be removed, type changed to integer and
column name price changed to price_usd
autos['price'] = autos['price'].str.replace('$', "")\
.str.replace(",", "")\
.astype(int)
autos.rename(columns={'price': 'price_usd'}, inplace=True)
autos["price_usd"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price_usd, dtype: int64
autos["kilometers"].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: kilometers, dtype: object
For kilometers, the km symbol and the coma is to be removed and type changed to integer
autos['kilometers'] = autos['kilometers'].str.replace('km', "")\
.str.replace(",", "")\
.astype(int)
autos["kilometers"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: kilometers, dtype: int64
print(autos["kilometers"].unique().shape)
autos["kilometers"].describe()
(13,)
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: kilometers, dtype: float64
autos["kilometers"].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: kilometers, dtype: int64
print(autos["price_usd"].unique().shape)
autos['price_usd'].describe().apply(lambda x: format(x, 'f'))
(2357,)
count 50000.000000 mean 9840.043760 std 481104.380500 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 99999999.000000 Name: price_usd, dtype: object
autos["price_usd"].value_counts().head(10)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 Name: price_usd, dtype: int64
autos["price_usd"].value_counts().sort_index(ascending=False).head(50)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 145000 1 139997 1 137999 1 135000 1 130000 1 129000 1 128000 1 120000 2 119900 1 119500 1 116000 1 115991 1 115000 1 114400 1 109999 1 105000 2 104900 1 99900 2 99000 2 98500 1 Name: price_usd, dtype: int64
autos["price_usd"].value_counts().sort_index(ascending=True)[100:150]
330 8 333 17 340 11 349 15 350 335 356 1 359 1 360 8 369 1 370 21 375 7 378 1 379 1 380 29 385 1 388 1 390 26 395 2 399 71 400 321 410 1 414 1 420 9 425 4 430 13 435 1 440 1 444 8 449 12 450 265 459 1 460 5 470 7 475 4 480 25 485 2 490 34 495 4 499 92 500 781 501 1 510 2 517 1 520 8 525 4 530 8 540 2 549 13 550 356 554 1 Name: price_usd, dtype: int64
For autos price, remove any prices above 999990 which does not seem reasonable and for the minimum price of the cars, we could maybe use 100.
autos = autos[autos["price_usd"].between(100,900000)]
autos.shape
(48224, 20)
autos["price_usd"].describe().apply(lambda x: format(x, "f"))
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price_usd, dtype: object
autos["price_usd"].value_counts().sort_index(ascending=False).head()
350000 1 345000 1 299000 1 295000 1 265000 1 Name: price_usd, dtype: int64
The remaining values are better now, after having removed the outliers.
autos[["date_crawled", "ad_created", "last_seen"]].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos.loc[:, "date_crawled"] = autos.loc[:, "date_crawled"].str[:10]
autos["date_crawled"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.025361 2016-03-06 0.014039 2016-03-07 0.036061 2016-03-08 0.033179 2016-03-09 0.033013 2016-03-10 0.032287 2016-03-11 0.032598 2016-03-12 0.036911 2016-03-13 0.015677 2016-03-14 0.036662 2016-03-15 0.034319 2016-03-16 0.029467 2016-03-17 0.031499 2016-03-18 0.012898 2016-03-19 0.034734 2016-03-20 0.037803 2016-03-21 0.037201 2016-03-22 0.032888 2016-03-23 0.032287 2016-03-24 0.029446 2016-03-25 0.031499 2016-03-26 0.032308 2016-03-27 0.031126 2016-03-28 0.034962 2016-03-29 0.034112 2016-03-30 0.033738 2016-03-31 0.031851 2016-04-01 0.033697 2016-04-02 0.035605 2016-04-03 0.038611 2016-04-04 0.036538 2016-04-05 0.013064 2016-04-06 0.003173 2016-04-07 0.001389 Name: date_crawled, dtype: float64
The data was crawled between 3rd March to 7 April 2016 for each day. The distribution of most of the values are ranging between 1 to 3 percent of the total, except for data that was collected on 6 and 7 April, which was less than 1 percent.
pd.options.display.max_rows = 100
autos.loc[:, "ad_created"] = autos.loc[:, "ad_created"].str[:10]
autos["ad_created"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 2016-02-22 0.000021 2016-02-23 0.000083 2016-02-24 0.000041 2016-02-25 0.000062 2016-02-26 0.000041 2016-02-27 0.000124 2016-02-28 0.000207 2016-02-29 0.000166 2016-03-01 0.000104 2016-03-02 0.000104 2016-03-03 0.000871 2016-03-04 0.001493 2016-03-05 0.022914 2016-03-06 0.015304 2016-03-07 0.034796 2016-03-08 0.033179 2016-03-09 0.033096 2016-03-10 0.031997 2016-03-11 0.032909 2016-03-12 0.036745 2016-03-13 0.017045 2016-03-14 0.035294 2016-03-15 0.034049 2016-03-16 0.029964 2016-03-17 0.031167 2016-03-18 0.013582 2016-03-19 0.033614 2016-03-20 0.037865 2016-03-21 0.037429 2016-03-22 0.032702 2016-03-23 0.032121 2016-03-24 0.029384 2016-03-25 0.031623 2016-03-26 0.032370 2016-03-27 0.031043 2016-03-28 0.035066 2016-03-29 0.034070 2016-03-30 0.033552 2016-03-31 0.031893 2016-04-01 0.033676 2016-04-02 0.035294 2016-04-03 0.038860 2016-04-04 0.036890 2016-04-05 0.011799 2016-04-06 0.003256 2016-04-07 0.001244 Name: ad_created, dtype: float64
The ads were created between 11 June 2015 to 7 April 2016. Adds created appear to start off slow from June 2015 then increase up in March 2016, remain stable for much of the time until begining of April 2016 then slow down again from there.
autos.loc[:, "last_seen"] = autos.loc[:, "last_seen"].str[:10]
autos["last_seen"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.001078 2016-03-06 0.004313 2016-03-07 0.005433 2016-03-08 0.007320 2016-03-09 0.009580 2016-03-10 0.010638 2016-03-11 0.012400 2016-03-12 0.023785 2016-03-13 0.008875 2016-03-14 0.012629 2016-03-15 0.015863 2016-03-16 0.016444 2016-03-17 0.028098 2016-03-18 0.007320 2016-03-19 0.015760 2016-03-20 0.020654 2016-03-21 0.020550 2016-03-22 0.021359 2016-03-23 0.018580 2016-03-24 0.019762 2016-03-25 0.019098 2016-03-26 0.016672 2016-03-27 0.015552 2016-03-28 0.020840 2016-03-29 0.022292 2016-03-30 0.024697 2016-03-31 0.023826 2016-04-01 0.022852 2016-04-02 0.024884 2016-04-03 0.025133 2016-04-04 0.024531 2016-04-05 0.125062 2016-04-06 0.221964 2016-04-07 0.132154 Name: last_seen, dtype: float64
The last_seen date information was collected on a daily basis between 3 March to 7 April 2016 and looks uniformly distributed.
autos["registration_year"].head()
0 2004 1 1997 2 2009 3 2007 4 2003 Name: registration_year, dtype: int64
autos["registration_year"].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration year of 1000 (minimum) and that of 9999 (max) seems to be off, could be a mistake. These values should be removed most probably.
# Determining number of listings with cars that fall outside
# the 1900 - 2016 interval
print(autos[(autos["registration_year"] < 1900) |
(autos["registration_year"] > 2016)].shape)
autos.shape
(1872, 20)
(48224, 20)
There are 1872 from the total of 48224 which is approximately 4%, so they can be removed.
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True)
1910 0.000043 1927 0.000022 1929 0.000022 1931 0.000022 1934 0.000043 1937 0.000086 1938 0.000022 1939 0.000022 1941 0.000043 1943 0.000022 1948 0.000022 1950 0.000022 1951 0.000043 1952 0.000022 1953 0.000022 1954 0.000043 1955 0.000043 1956 0.000086 1957 0.000043 1958 0.000086 1959 0.000129 1960 0.000475 1961 0.000129 1962 0.000086 1963 0.000173 1964 0.000259 1965 0.000367 1966 0.000475 1967 0.000561 1968 0.000561 1969 0.000410 1970 0.000798 1971 0.000561 1972 0.000712 1973 0.000496 1974 0.000518 1975 0.000388 1976 0.000453 1977 0.000475 1978 0.000906 1979 0.000734 1980 0.001747 1981 0.000582 1982 0.000885 1983 0.001100 1984 0.001100 1985 0.002006 1986 0.001532 1987 0.001553 1988 0.002869 1989 0.003689 1990 0.007163 1991 0.007292 1992 0.007918 1993 0.009061 1994 0.013505 1995 0.025738 1996 0.029233 1997 0.041530 1998 0.050483 1999 0.062112 2000 0.066966 2001 0.056718 2002 0.053439 2003 0.058099 2004 0.058228 2005 0.062802 2006 0.057560 2007 0.049038 2008 0.047679 2009 0.044874 2010 0.034238 2011 0.034907 2012 0.028219 2013 0.017281 2014 0.014282 2015 0.008198 2016 0.025932 Name: registration_year, dtype: float64
There is a steady increase in registration from 1910 and it peaks around 1996 to 2009, and then drops from then on.
autos["brand"].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
I choose to agregate by only the brands that have over 5% of total values and these are volkswagen, bmw, mercedes_benz, opel, audi and ford
brands = autos["brand"].value_counts(normalize=True, sort=True)
selected_brands = brands[brands>=0.05]
selected_brands
volkswagen 0.211404 bmw 0.110179 opel 0.107245 mercedes_benz 0.096652 audi 0.086771 ford 0.069835 Name: brand, dtype: float64
# Create an empty dictionary to store the aggregated data
brand_data = {}
# Use a for loop to iterate over the selected brands
for brand in selected_brands.index:
# Assign the mean price to the dictionary with the brand name as key
selected_rows = autos[autos["brand"]==brand]
meanprice = selected_rows['price_usd'].mean()
# Print out the dictionary of aggregated data
brand_data[brand] = meanprice
# initializing K for rounding off
K = 3
# loop to iterate for values
res = dict()
for key in brand_data:
# rounding to K using round()
res[key] = round(brand_data[key], K)
#Rounded result
res
{'volkswagen': 5436.95, 'bmw': 8381.677, 'opel': 3005.496, 'mercedes_benz': 8672.654, 'audi': 9380.719, 'ford': 3779.269}
From the top 5% brands, Audi is the most expensive, followed by Mercedes Benz and BMW. Ford and Opel are the cheapest among the top 5% and Volkwagen is in the middle
autos["kilometers"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: kilometers, dtype: int64
brands = autos["brand"].value_counts(normalize=True, sort=True)
brand_labels = brands[brands >= 0.05].index
brand_labels
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
brand_mean_prices = {}
for brand in brand_labels:
price = autos[autos["brand"] == brand]['price_usd']
brand_mean_prices[brand] = int(price.mean())
brand_mean_prices
{'volkswagen': 5436, 'bmw': 8381, 'opel': 3005, 'mercedes_benz': 8672, 'audi': 9380, 'ford': 3779}
brand_mean_kilos = {}
for brand in brand_labels:
kilos = autos[autos["brand"] == brand]['kilometers']
brand_mean_kilos[brand] = int(kilos.mean())
brand_mean_kilos
{'volkswagen': 128799, 'bmw': 132695, 'opel': 129384, 'mercedes_benz': 131025, 'audi': 129245, 'ford': 124277}
bmp_series = pd.Series(brand_mean_prices)
bmk_series = pd.Series(brand_mean_kilos)
df = pd.DataFrame(bmp_series,columns=['mean_price'])
df['mean_km'] = bmk_series
df.sort_values(by=['mean_price', 'mean_km'], ascending=False)
mean_price | mean_km | |
---|---|---|
audi | 9380 | 129245 |
mercedes_benz | 8672 | 131025 |
bmw | 8381 | 132695 |
volkswagen | 5436 | 128799 |
ford | 3779 | 124277 |
opel | 3005 | 129384 |
The three more expensive brands, Audi, Mecedes benz and BMW also have the highest mileage according to this analysis.
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model', 'kilometers', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos["seller"].unique()
array(['privat', 'gewerblich'], dtype=object)
seller = {"privat":"private", "gewerblich":"commercial"}
autos["seller"] = autos["seller"].map(seller)
autos["seller"].unique()
array(['private', 'commercial'], dtype=object)
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
vehicle = {"bus":"bus", "limousine":"limousine", "kleinwagen": "compact",
"kombi": "station wagon", "coupe": "coupe", "suv": "suv",
"cabrio": "convertible", "andere": "other"}
autos["vehicle_type"] = autos["vehicle_type"].map(vehicle)
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'compact', 'station wagon', nan, 'coupe', 'suv', 'convertible', 'other'], dtype=object)
autos["unrepaired_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
damage = {"nein":"no", "ja": "yes"}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(damage)
autos["unrepaired_damage"].unique()
array(['no', nan, 'yes'], dtype=object)
autos["kilometers"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: kilometers, dtype: int64
gp = autos.groupby('brand')['model'].value_counts().\
sort_values(ascending=False)[0:10]
gp
brand model volkswagen golf 3684 bmw 3er 2602 volkswagen polo 1592 opel corsa 1568 volkswagen passat 1345 opel astra 1337 audi a4 1226 mercedes_benz c_klasse 1135 bmw 5er 1123 mercedes_benz e_klasse 953 Name: model, dtype: int64
VW, BMW and Opel are among the most common brands
autos.groupby('kilometers')['price_usd'].mean()
kilometers 5000 9794.102672 10000 21255.304721 20000 18523.200271 30000 16674.564069 40000 15518.952261 50000 13853.911111 60000 12406.989343 70000 10954.826014 80000 9743.110058 90000 8470.087321 100000 8188.223581 125000 6238.277181 150000 3792.320241 Name: price_usd, dtype: float64
It can be seen that the mean price generally reduces as the mileage increases
pd.options.display.max_rows = 200
top_brands = df.index.to_list()
top_autos = autos[autos['brand'].isin(top_brands)]
damage_cost = top_autos.groupby(['brand', 'unrepaired_damage'])['price_usd'].mean()
damage_cost
brand unrepaired_damage audi no 10929.202288 yes 3324.684685 bmw no 9454.446587 yes 3538.860000 ford no 4666.671089 yes 1408.277228 mercedes_benz no 9815.369760 yes 3933.283626 opel no 3672.347530 yes 1396.977509 volkswagen no 6483.153377 yes 2194.585511 Name: price_usd, dtype: float64
damage_df = pd.DataFrame(damage_cost).reset_index()
damage_df
brand | unrepaired_damage | price_usd | |
---|---|---|---|
0 | audi | no | 10929.202288 |
1 | audi | yes | 3324.684685 |
2 | bmw | no | 9454.446587 |
3 | bmw | yes | 3538.860000 |
4 | ford | no | 4666.671089 |
5 | ford | yes | 1408.277228 |
6 | mercedes_benz | no | 9815.369760 |
7 | mercedes_benz | yes | 3933.283626 |
8 | opel | no | 3672.347530 |
9 | opel | yes | 1396.977509 |
10 | volkswagen | no | 6483.153377 |
11 | volkswagen | yes | 2194.585511 |
It looks like cars with unrepaired damage cost about 1/2 to 1/3 the price of repaired cars of the same brand just looking at the top brands