In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen,a section of the German eBay website. The aim of this project is to clean the data and analyze the included used car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding= 'Latin-1')
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Observations From the column names we can notice the name is written in camelcase naming convention, i'll convert it to snakecase naming convention. some interger column are mixed with text values like price and odometer.
cleaning column names
autos.columns = autos.columns.map({'dateCrawled':'date_crawled', 'name':'name', 'seller':'seller', 'offerType':'offer_type', 'price':'price', 'abtest':'abtest',
'vehicleType':'vehicle_type', 'yearOfRegistration':'year_of_registration', 'gearbox':'gearbox', 'powerPS': 'power_ps', 'model':'model',
'vehicleType':'vehicle_type', 'yearOfRegistration':'year_of_registration', 'gearbox':'gearbox', 'powerPS':'power_ps', 'model':'model',
'odometer':'odometer', 'monthOfRegistration':'month_of_registration', 'fuelType':'fuel_type', 'brand':'brand',
'notRepairedDamage':'not_repaired_damage', 'dateCreated':'date_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code',
'lastSeen':'last_seen'})
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model', 'odometer', 'month_of_registration', 'fuel_type', 'brand', 'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | year_of_registration | gearbox | power_ps | model | odometer | month_of_registration | fuel_type | brand | not_repaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
i changed the column name convention to snakecase convention, One study has found that readers can recognize snake case values more quickly than camel case
Initial Exploration and Cleaning
autos.describe(include ='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | year_of_registration | gearbox | power_ps | model | odometer | month_of_registration | fuel_type | brand | not_repaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos["price"] = autos["price"].str.replace("$","").str.replace(","," ").str.replace(" ", "")
autos["odometer"] = autos["odometer"].str.replace("km"," ").str.replace(",","")
autos.rename({"date_created":"ad_created","month_of_registration":"registration_month","year_of_registration":"registration_year","not_repaired_damage":"unrepaired_damage",},axis = 1, inplace = True)
autos["price"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: object
autos["odometer"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: object
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
#To rename the column to odometer_km
autos["price"] = autos["price"].astype(int)
autos["odometer_km"] = autos["odometer_km"].astype(int)
#To convert both columns to interger values from strimg
Exploring the Odometer and price columns
autos["odometer_km"].unique().shape
(13,)
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["odometer_km"].value_counts().sort_index(ascending = False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
from this column, we see that all the km are rounded up and majority of the care sold on the ebay site has high mileage. there's also no reason for outliners since all the set km are taken by buyers.
autos["price"].unique().shape
(2357,)
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos["price"].value_counts().sort_index(ascending = True).head(10)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 Name: price, dtype: int64
autos["price"].value_counts().sort_index(ascending = True).tail(10)
999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
autos = autos[autos["price"].between(0,350000)]
from the price column, we discovered bidders placed bids as high as 99999999 which doesnt sound realistic from a bidding site, we decided to consider all bids lower than 1 and above 350000 as outliners
Exploring the date column
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025387 2016-03-06 0.013944 2016-03-07 0.035970 2016-03-08 0.033269 2016-03-09 0.033209 2016-03-10 0.032129 2016-03-11 0.032489 2016-03-12 0.036770 2016-03-13 0.015564 2016-03-14 0.036630 2016-03-15 0.033990 2016-03-16 0.029508 2016-03-17 0.031509 2016-03-18 0.013064 2016-03-19 0.034910 2016-03-20 0.037831 2016-03-21 0.037490 2016-03-22 0.032909 2016-03-23 0.032389 2016-03-24 0.029108 2016-03-25 0.031749 2016-03-26 0.032489 2016-03-27 0.031049 2016-03-28 0.034850 2016-03-29 0.034150 2016-03-30 0.033629 2016-03-31 0.031909 2016-04-01 0.033809 2016-04-02 0.035410 2016-04-03 0.038691 2016-04-04 0.036490 2016-04-05 0.013104 2016-04-06 0.003181 2016-04-07 0.001420 Name: date_crawled, dtype: float64
autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2015-06-11 0.000020 2015-08-10 0.000020 2015-09-09 0.000020 2015-11-10 0.000020 2015-12-05 0.000020 ... 2016-04-03 0.038931 2016-04-04 0.036850 2016-04-05 0.011843 2016-04-06 0.003261 2016-04-07 0.001280 Name: ad_created, Length: 76, dtype: float64
autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.001080 2016-03-06 0.004421 2016-03-07 0.005362 2016-03-08 0.007582 2016-03-09 0.009843 2016-03-10 0.010763 2016-03-11 0.012524 2016-03-12 0.023807 2016-03-13 0.008983 2016-03-14 0.012804 2016-03-15 0.015884 2016-03-16 0.016445 2016-03-17 0.027928 2016-03-18 0.007422 2016-03-19 0.015744 2016-03-20 0.020706 2016-03-21 0.020726 2016-03-22 0.021586 2016-03-23 0.018585 2016-03-24 0.019565 2016-03-25 0.019205 2016-03-26 0.016965 2016-03-27 0.016024 2016-03-28 0.020846 2016-03-29 0.022326 2016-03-30 0.024847 2016-03-31 0.023827 2016-04-01 0.023106 2016-04-02 0.024887 2016-04-03 0.025367 2016-04-04 0.024627 2016-04-05 0.124275 2016-04-06 0.220982 2016-04-07 0.130957 Name: last_seen, dtype: float64
i observed that a highest level of activity happened on 2016-04-07.
autos["registration_year"].describe()
count 49986.000000 mean 2005.075721 std 105.727161 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
it's observed that the min and max registration years are 1000 and 9999 which seems absurd
autos["registration_year"].value_counts().sort_index().head(15)
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 Name: registration_year, dtype: int64
i decided to explore the frequencies of the registration_year column in ascending order, i then decided to remove all enteries before 1900 because that year had 2 frequencies
autos = autos[autos["registration_year"].between(1934, 2016)]
autos["registration_year"].value_counts(normalize = True).head(10)
2000 0.069869 2005 0.062807 1999 0.062453 2004 0.057016 2003 0.056808 2006 0.056391 2001 0.056287 2002 0.052766 1998 0.051100 2007 0.047996 Name: registration_year, dtype: float64
autos["registration_year"].value_counts(normalize = True).head(10).sum()
0.5734938755103741
i observed that 57.34% of the vehicles were registered in the past 20years and had the top 10 most registered cars
Exploring cars by Brand
autos["brand"].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
percentage of each brand
autos["brand"].value_counts(normalize = True)
volkswagen 0.212170 bmw 0.110032 opel 0.108158 mercedes_benz 0.095388 audi 0.086430 ford 0.069744 renault 0.047350 peugeot 0.029539 fiat 0.025873 seat 0.018186 skoda 0.016040 mazda 0.015145 nissan 0.015103 citroen 0.013916 smart 0.013916 toyota 0.012478 sonstige_autos 0.010791 hyundai 0.009853 volvo 0.009249 mini 0.008645 mitsubishi 0.008145 honda 0.007854 kia 0.007104 alfa_romeo 0.006624 porsche 0.006104 suzuki 0.005916 chevrolet 0.005708 chrysler 0.003666 daihatsu 0.002562 dacia 0.002562 jeep 0.002250 subaru 0.002187 land_rover 0.002041 saab 0.001604 jaguar 0.001583 trabant 0.001542 daewoo 0.001500 rover 0.001354 lancia 0.001083 lada 0.000604 Name: brand, dtype: float64
Top twenty most common brand
top_brand = autos["brand"].value_counts().iloc[0:19].index
top_brand
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'citroen', 'smart', 'toyota', 'sonstige_autos', 'hyundai', 'volvo'], dtype='object')
Top twenty most common brand by average price
top_brand = autos["brand"].value_counts().iloc[0:19].index
top_brand_by_mean_price = {}
for each_brand in top_brand:
top_brand_row = autos[autos["brand"] == each_brand]["price"].mean()
top_brand_by_mean_price[each_brand] = top_brand_row
top_brand_by_mean_price
{'volkswagen': 5231.081983308787, 'bmw': 8101.893032942067, 'opel': 2877.7224576271187, 'mercedes_benz': 8485.239571958942, 'audi': 9093.65003615329, 'ford': 3641.134707287933, 'renault': 2396.2067751869777, 'peugeot': 3039.4682651622, 'fiat': 2711.8011272141707, 'seat': 4296.492554410081, 'skoda': 6334.91948051948, 'mazda': 4010.7716643741405, 'nissan': 4664.891034482758, 'citroen': 3699.935628742515, 'smart': 3542.706586826347, 'toyota': 5115.33388981636, 'sonstige_autos': 10909.372586872587, 'hyundai': 5308.53911205074, 'volvo': 4757.108108108108}
Brands that individually makeup above 5% of the total amount of brands in the brand column
brands_above_5 = autos["brand"].value_counts(normalize = True)
brands_above_5 = brands_above_5[brands_above_5 > 0.05].index
brands_above_5_mean = {}
for each_brand in brands_above_5:
brands_above_5_row = autos[autos["brand"]== each_brand]["price"].mean().round()
brands_above_5_mean[each_brand] = brands_above_5_row
brands_above_5_mean
{'volkswagen': 5231.0, 'bmw': 8102.0, 'opel': 2878.0, 'mercedes_benz': 8485.0, 'audi': 9094.0, 'ford': 3641.0}
The maximum price for each brand
max_price_per_brand = {}
for each_brand in autos["brand"].unique():
each_brand_row = autos[autos["brand"] == each_brand].sort_values("price",ascending = False).iloc[0]["price"]
max_price_per_brand[each_brand] = each_brand_row
max_price_per_brand
{'peugeot': 25500, 'bmw': 259000, 'volkswagen': 64500, 'smart': 16850, 'ford': 130000, 'chrysler': 20000, 'seat': 33500, 'renault': 93000, 'mercedes_benz': 180000, 'audi': 175000, 'sonstige_autos': 345000, 'opel': 38990, 'mazda': 38900, 'porsche': 350000, 'mini': 34500, 'toyota': 49500, 'dacia': 16350, 'nissan': 34999, 'jeep': 55800, 'saab': 18900, 'volvo': 29500, 'mitsubishi': 37800, 'jaguar': 58900, 'fiat': 25000, 'skoda': 32700, 'subaru': 31800, 'kia': 28450, 'citroen': 33000, 'chevrolet': 56500, 'hyundai': 29999, 'honda': 41900, 'daewoo': 2900, 'suzuki': 23200, 'trabant': 6480, 'land_rover': 99000, 'alfa_romeo': 55000, 'lada': 6500, 'rover': 23500, 'daihatsu': 9900, 'lancia': 29500}
below we will convert the brands_above_5_mean dictionary to a series, then subsequently create a new dataframe named df then assign the series as "mean_price"
bmp_series = pd.Series(brands_above_5_mean)
print(bmp_series)
volkswagen 5231.0 bmw 8102.0 opel 2878.0 mercedes_benz 8485.0 audi 9094.0 ford 3641.0 dtype: float64
df= pd.DataFrame(bmp_series, columns = ["mean_price"])
df
mean_price | |
---|---|
volkswagen | 5231.0 |
bmw | 8102.0 |
opel | 2878.0 |
mercedes_benz | 8485.0 |
audi | 9094.0 |
ford | 3641.0 |
lets find the average mileage for the the above brands that individually have over 5% of the total percentage of cars in the dataset.
top_brand_mileage = {}
unique_brand = autos["brand"].value_counts(normalize = True)
unique_brand = unique_brand[unique_brand > 0.05].index
for each_brand in unique_brand:
unique_brand_row = autos[autos["brand"] == each_brand]["odometer_km"].mean()
top_brand_mileage[each_brand] = unique_brand_row
top_brand_mileage
{'volkswagen': 128724.10407461954, 'bmw': 132455.5092767891, 'opel': 129224.7688751926, 'mercedes_benz': 130856.0821139987, 'audi': 129287.78018799711, 'ford': 124138.29151732377}
top_brand_mileage_series = pd.Series(top_brand_mileage).round()
top_brand_mileage_series
volkswagen 128724.0 bmw 132456.0 opel 129225.0 mercedes_benz 130856.0 audi 129288.0 ford 124138.0 dtype: float64
Then we add it to the new dataframe df
df["mileage"] = top_brand_mileage_series
df
mean_price | mileage | |
---|---|---|
volkswagen | 5231.0 | 128724.0 |
bmw | 8102.0 | 132456.0 |
opel | 2878.0 | 129225.0 |
mercedes_benz | 8485.0 | 130856.0 |
audi | 9094.0 | 129288.0 |
ford | 3641.0 | 124138.0 |
Identify categorical data that uses german words, translate them and map the values to their english counterparts
autos[["vehicle_type","gearbox","fuel_type","unrepaired_damage"]]
vehicle_type | gearbox | fuel_type | unrepaired_damage | |
---|---|---|---|---|
0 | bus | manuell | lpg | nein |
1 | limousine | automatik | benzin | nein |
2 | limousine | manuell | benzin | nein |
3 | kleinwagen | automatik | benzin | nein |
4 | kombi | manuell | benzin | nein |
... | ... | ... | ... | ... |
49995 | limousine | automatik | diesel | nein |
49996 | cabrio | manuell | benzin | nein |
49997 | cabrio | automatik | benzin | nein |
49998 | kombi | manuell | diesel | nein |
49999 | limousine | manuell | benzin | nein |
48004 rows × 4 columns
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
convert german words in vehicle_type column to english
autos["vehicle_type"] = autos["vehicle_type"].map({"bus":"bus","limousine":"limousine","nan":"nan","coupe":"coupe","suv":"suv","kleinwagen":"mini car","kombi":"estate car","cabrio":"convertible","andere":"other"})
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'mini car', 'estate car', nan, 'coupe', 'suv', 'convertible', 'other'], dtype=object)
convert german words in gearbox column to english
autos["gearbox"].unique()
array(['manuell', 'automatik', nan], dtype=object)
autos["gearbox"] = autos["gearbox"].map({"manuell":"manual","automatik":"automatic","nan":"nan"})
autos["gearbox"].unique()
array(['manual', 'automatic', nan], dtype=object)
convert german words in fuel_type column to english
autos["fuel_type"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
autos["fuel_type"] = autos["fuel_type"].map({"lpg":"lpg","benzin":"petrol","diesel":"diesel","nan":"nan","cng":"cng","hybrid":"hybrid","elektro":"electric","andere":"other"})
autos["fuel_type"].unique()
array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electric', 'other'], dtype=object)
convert german words in unrepaired_damage column to english
autos["unrepaired_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
autos["unrepaired_damage"] = autos["unrepaired_damage"].map({"nein":"none","nan":"nan","ja":"yes"})
autos["unrepaired_damage"].unique()
array(['none', nan, 'yes'], dtype=object)
autos["ad_created"] = autos["ad_created"].str.replace("-","").str.split().str[0]
Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
autos["ad_created"] = autos["ad_created"].astype(int)
autos["ad_created"]
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 ... 49995 20160327 49996 20160328 49997 20160402 49998 20160308 49999 20160313 Name: ad_created, Length: 48004, dtype: int64
1. Find the most common brand/model combinations
most_brand_model = {}
most_brand = autos["brand"].value_counts(ascending = False).index
for each_brand in most_brand:
most_brand_row = autos[autos["brand"] == each_brand]["model"].iloc[0]
most_brand_model[each_brand] = most_brand_row
most_brand_model
{'volkswagen': 'golf', 'bmw': '7er', 'opel': 'vectra', 'mercedes_benz': nan, 'audi': 'a3', 'ford': 'focus', 'renault': 'megane', 'peugeot': 'andere', 'fiat': 'punto', 'seat': 'arosa', 'skoda': 'octavia', 'mazda': 'andere', 'nissan': 'primera', 'citroen': 'andere', 'smart': 'fortwo', 'toyota': 'andere', 'sonstige_autos': nan, 'hyundai': 'i_reihe', 'volvo': 'andere', 'mini': 'cooper', 'mitsubishi': 'colt', 'honda': 'civic', 'kia': 'andere', 'alfa_romeo': '156', 'porsche': '911', 'suzuki': 'grand', 'chevrolet': 'andere', 'chrysler': 'voyager', 'daihatsu': 'terios', 'dacia': 'sandero', 'jeep': 'wrangler', 'subaru': 'andere', 'land_rover': 'freelander', 'saab': 'andere', 'jaguar': 'andere', 'trabant': nan, 'daewoo': 'kalos', 'rover': 'andere', 'lancia': 'lybra', 'lada': nan}
from the analysis, we can see that volkswagen is the most popular brand and golf is the most popular model
2. Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
unique_km_per_mean_price = {}
unique_km = autos["odometer_km"].value_counts(ascending = False).sort_index(ascending = False).index
for km in unique_km:
unique_km_row = autos[autos["odometer_km"] == km]["price"].mean().round()
unique_km_per_mean_price[km] = unique_km_row
unique_km_per_mean_price
{150000: 3654.0, 125000: 6086.0, 100000: 7940.0, 90000: 8350.0, 80000: 9576.0, 70000: 10818.0, 60000: 12287.0, 50000: 13634.0, 40000: 15441.0, 30000: 16414.0, 20000: 17941.0, 10000: 19860.0, 5000: 7699.0}
km_5000 = autos[autos["odometer_km"] == 5000]["price"]
km_5000
52 3500 71 0 76 31999 102 320 106 150 ... 49722 200 49844 22222 49845 1 49865 5200 49997 13200 Name: price, Length: 901, dtype: int64
i analyzed the average price of each unique km and i obsered they were inveresely correlated i.e the higher the Km the lower the average price of such vehicle. except for the list km which is 5000. i probeb further and i discovered most of the values were 0 hence it resulted in a lower than expected value
3. How much cheaper are cars with damage than their non-damaged counterparts
autos["unrepaired_damage"].unique()
array(['none', nan, 'yes'], dtype=object)
difference = (autos[autos["unrepaired_damage"]== "none"]["price"].mean()) - (autos[autos["unrepaired_damage"] == "yes"]["price"].mean())
difference.round(2)
4951.75
on the average damaged cars are $4951.75 cheaper than their counterparts