we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The aim of this project is to clean the data and analyze the included used car listings. Observations: The dataset contains 20 columns, most of which are strings. Some columns have null values, but none have more than ~20% null values.
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Windows-1252")
#autos.head()
#autos.info()
#autos.columns
#Let's convert the column names from camelcase to snakecase
# and reword some of the column names based on the data
# dictionary to be more descriptive
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer_km', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# seller columns contains the value private for 49999 rows
# offer_type contains the value Angebot for 49999 rows which means
# both columns doesn't affect the data because the value is the same
# for almost all the rows so we can just drop them
autos["price"] = autos["price"].str.replace(",", "").str.replace("$", "").astype(float)
autos["odometer_km"] = autos ["odometer_km"].str.replace(",", "").str.replace("km", "").astype(float)
# drop seller & offer_type columns
autos.drop(['seller', 'offer_type'], axis=1, inplace=True)
# Their are 1421 row with price as 0 which is about 2% of the
# data so we can remove them
# Their are some prices that are very low starting from 1 dollar
# but it's normal to find bits from 1 dollar on Ebay so we will
# just leave them
# Their are some very high prices up to millions which is unrealistic
# so we will delete prices that is higher than 350000
# For odometer_km the data looks okay
autos = autos[autos['price'].between(1, 350001)]
autos["price"].describe()
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price, dtype: float64
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.025361 2016-03-06 0.014039 2016-03-07 0.036061 2016-03-08 0.033179 2016-03-09 0.033013 2016-03-10 0.032287 2016-03-11 0.032598 2016-03-12 0.036911 2016-03-13 0.015677 2016-03-14 0.036662 2016-03-15 0.034319 2016-03-16 0.029467 2016-03-17 0.031499 2016-03-18 0.012898 2016-03-19 0.034734 2016-03-20 0.037803 2016-03-21 0.037201 2016-03-22 0.032888 2016-03-23 0.032287 2016-03-24 0.029446 2016-03-25 0.031499 2016-03-26 0.032308 2016-03-27 0.031126 2016-03-28 0.034962 2016-03-29 0.034112 2016-03-30 0.033738 2016-03-31 0.031851 2016-04-01 0.033697 2016-04-02 0.035605 2016-04-03 0.038611 2016-04-04 0.036538 2016-04-05 0.013064 2016-04-06 0.003173 2016-04-07 0.001389 Name: date_crawled, dtype: float64
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033096 2016-03-10 0.031997 2016-03-11 0.032909 2016-03-12 0.036745 2016-03-13 0.017045 2016-03-14 0.035294 2016-03-15 0.034049 2016-03-16 0.029964 2016-03-17 0.031167 2016-03-18 0.013582 2016-03-19 0.033614 2016-03-20 0.037865 2016-03-21 0.037429 2016-03-22 0.032702 2016-03-23 0.032121 2016-03-24 0.029384 2016-03-25 0.031623 2016-03-26 0.032370 2016-03-27 0.031043 2016-03-28 0.035066 2016-03-29 0.034070 2016-03-30 0.033552 2016-03-31 0.031893 2016-04-01 0.033676 2016-04-02 0.035294 2016-04-03 0.038860 2016-04-04 0.036890 2016-04-05 0.011799 2016-04-06 0.003256 2016-04-07 0.001244 Name: ad_created, Length: 76, dtype: float64
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.001078 2016-03-06 0.004313 2016-03-07 0.005433 2016-03-08 0.007320 2016-03-09 0.009580 2016-03-10 0.010638 2016-03-11 0.012400 2016-03-12 0.023785 2016-03-13 0.008875 2016-03-14 0.012629 2016-03-15 0.015863 2016-03-16 0.016444 2016-03-17 0.028098 2016-03-18 0.007320 2016-03-19 0.015760 2016-03-20 0.020654 2016-03-21 0.020550 2016-03-22 0.021359 2016-03-23 0.018580 2016-03-24 0.019762 2016-03-25 0.019098 2016-03-26 0.016672 2016-03-27 0.015552 2016-03-28 0.020840 2016-03-29 0.022292 2016-03-30 0.024697 2016-03-31 0.023826 2016-04-01 0.022852 2016-04-02 0.024884 2016-04-03 0.025133 2016-04-04 0.024531 2016-04-05 0.125062 2016-04-06 0.221964 2016-04-07 0.132154 Name: last_seen, dtype: float64
autos = autos[autos['registration_year'].between(1901, 2020)]
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | num_photos | postal_code | |
---|---|---|---|---|---|---|---|
count | 46352.000000 | 46352.000000 | 46352.000000 | 46352.000000 | 46352.000000 | 46352.0 | 46352.000000 |
mean | 6019.937090 | 2002.939787 | 118.382680 | 125734.488264 | 5.846889 | 0.0 | 51112.317548 |
std | 9196.683182 | 7.127146 | 185.411088 | 39608.848152 | 3.661282 | 0.0 | 25745.539291 |
min | 100.000000 | 1910.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 75.000000 | 100000.000000 | 3.000000 | 0.0 | 30853.000000 |
50% | 3150.000000 | 2003.000000 | 109.000000 | 150000.000000 | 6.000000 | 0.0 | 49828.000000 |
75% | 7563.750000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71739.000000 |
max | 350000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
autos["brand"].unique().shape
(40,)
brands = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20].index
brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'], dtype='object')
price_by_brand = {}
for brand in brands:
price_mean = autos.loc[autos['brand'] == brand, 'price'].mean()
price_by_brand[brand] = price_mean
price_by_brand
# Audi, BMW and Mercedes Benz are more expensive
# Ford and Opel are less expensive
# Volkswagen is in between
{'audi': 9380.718547986076, 'bmw': 8381.677305658899, 'citroen': 3796.26267281106, 'fiat': 2836.8736310025274, 'ford': 3779.269076305221, 'hyundai': 5411.075431034483, 'mazda': 4129.774787535411, 'mercedes_benz': 8672.654241071428, 'mini': 10639.450980392157, 'nissan': 4756.659634317863, 'opel': 3005.4960772480385, 'peugeot': 3113.860549132948, 'renault': 2496.070577451879, 'seat': 4433.419621749409, 'skoda': 6409.609724047306, 'smart': 3596.40273556231, 'sonstige_autos': 12784.556561085972, 'toyota': 5167.091062394604, 'volkswagen': 5436.950096948668, 'volvo': 4993.208037825059}
bmp_series = pd.Series(price_by_brand)
print(bmp_series)
audi 9380.718548 bmw 8381.677306 citroen 3796.262673 fiat 2836.873631 ford 3779.269076 hyundai 5411.075431 mazda 4129.774788 mercedes_benz 8672.654241 mini 10639.450980 nissan 4756.659634 opel 3005.496077 peugeot 3113.860549 renault 2496.070577 seat 4433.419622 skoda 6409.609724 smart 3596.402736 sonstige_autos 12784.556561 toyota 5167.091062 volkswagen 5436.950097 volvo 4993.208038 dtype: float64
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
mean_price | |
---|---|
audi | 9380.718548 |
bmw | 8381.677306 |
citroen | 3796.262673 |
fiat | 2836.873631 |
ford | 3779.269076 |
hyundai | 5411.075431 |
mazda | 4129.774788 |
mercedes_benz | 8672.654241 |
mini | 10639.450980 |
nissan | 4756.659634 |
opel | 3005.496077 |
peugeot | 3113.860549 |
renault | 2496.070577 |
seat | 4433.419622 |
skoda | 6409.609724 |
smart | 3596.402736 |
sonstige_autos | 12784.556561 |
toyota | 5167.091062 |
volkswagen | 5436.950097 |
volvo | 4993.208038 |
mileage_by_brand = {}
for brand in brands:
mileage_mean = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
mileage_by_brand[brand] = mileage_mean
mileage_by_brand
{'audi': 129245.40029835903, 'bmw': 132695.32014881534, 'citroen': 119554.53149001536, 'fiat': 116950.29486099411, 'ford': 124277.10843373495, 'hyundai': 106885.77586206897, 'mazda': 124553.82436260623, 'mercedes_benz': 131025.66964285714, 'mini': 88308.82352941176, 'nissan': 118326.30098452883, 'opel': 129384.42969221485, 'peugeot': 127127.8901734104, 'renault': 128281.3932172319, 'seat': 121536.64302600473, 'skoda': 110906.70170827858, 'smart': 99734.0425531915, 'sonstige_autos': 90395.92760180995, 'toyota': 115944.35075885328, 'volkswagen': 128799.87753852434, 'volvo': 138581.56028368796}
mbb_series = pd.Series(mileage_by_brand)
d = {'mean_price': bmp_series, 'mean_mileage': mbb_series}
dff = pd.DataFrame(data=d)
dff
mean_mileage | mean_price | |
---|---|---|
audi | 129245.400298 | 9380.718548 |
bmw | 132695.320149 | 8381.677306 |
citroen | 119554.531490 | 3796.262673 |
fiat | 116950.294861 | 2836.873631 |
ford | 124277.108434 | 3779.269076 |
hyundai | 106885.775862 | 5411.075431 |
mazda | 124553.824363 | 4129.774788 |
mercedes_benz | 131025.669643 | 8672.654241 |
mini | 88308.823529 | 10639.450980 |
nissan | 118326.300985 | 4756.659634 |
opel | 129384.429692 | 3005.496077 |
peugeot | 127127.890173 | 3113.860549 |
renault | 128281.393217 | 2496.070577 |
seat | 121536.643026 | 4433.419622 |
skoda | 110906.701708 | 6409.609724 |
smart | 99734.042553 | 3596.402736 |
sonstige_autos | 90395.927602 | 12784.556561 |
toyota | 115944.350759 | 5167.091062 |
volkswagen | 128799.877539 | 5436.950097 |
volvo | 138581.560284 | 4993.208038 |