In this project we'll take a look at dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. Our dataset contains a lots of details about cars.
The data dictionary provided with data is as follows:
The orginal dataset is available here. We sampled 50,000 data points from the full dataset.
Our aim is to clean the data and analyze the included used car listings. We'll find most popular brands and models.
Let's open our file first.
import pandas as pd
#Oppening file, set encoding on 'Latin-1' to avoid error
autos = pd.read_csv('autos.csv', encoding='Latin-1')
#Print first 5. rows
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#Exploring our dataset's columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
We can spot that our dataset contains 20 columns and most of them have string type objects. Only 5 columns have intiger values.
First, we will change names of columns. They are written in camelcase style, so we will change them on snakecase style, which is more common in Python and will be easier to work with. We don't have much columns so the quickest way is to do it manually.
#columns in camelcase, before change
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#columns in snakecase, after change
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include ='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
As we can see:
Let's check the nr_of_pictures column first.
autos['num_photos'].value_counts()
0 50000 Name: num_photos, dtype: int64
All rows contains 0 number, so we will drop this column plus seller and offer_type columns, because they only have two, unuseful values for us.
#delete columns
autos = autos.drop(["num_photos", "seller", "offer_type"], axis=1)
Now, let's format our price column from text to numeric values. We have to remove characters:
#convering values in price column from text to numeric
autos["price"] = (autos['price'].str.replace('$','')
.str.replace(',','')
.str.replace(' ', '')
.astype(int)
)
#changing column name
autos.rename({'price' : 'price_dolar'}, axis=1, inplace=True)
autos['price_dolar'].unique().shape
(2357,)
autos['price_dolar'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dolar, dtype: float64
autos['price_dolar'].value_counts().head(20)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 Name: price_dolar, dtype: int64
Well, we have 2357 unique values in price_dolar column, most price are rounded, so it seems to be a normal tendency. There are 1421 values with pirce 0$, we can consider deleting them, because it shouldn't affected on analysis.
Let's check first few the highest prices in our dataset.
autos['price_dolar'].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_dolar, dtype: int64
autos['price_dolar'].value_counts().sort_index(ascending=True).head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price_dolar, dtype: int64
As we can see, we have some low prices under 100 and some prices higher than 300,000. In our opinion, the range between 100 and 300,000 is realistic and reasonable. Advertisings with prices lower than 100 can be tainted by seller to gain more popularity and the real price may be higher. Also, prices under 300,000 seem to be unrealistic.
#choosing range price 100 - 300,000$
autos = autos[autos["price_dolar"].between(99,29999)]
autos['price_dolar'].describe()
count 47453.000000 mean 5204.943123 std 5688.115710 min 99.000000 25% 1200.000000 50% 2999.000000 75% 7000.000000 max 29999.000000 Name: price_dolar, dtype: float64
Now, the average of price is 5204 and most of cars' value are under 7000 dolars. The values look more realistic now.
We'll convert values in odometer column also
#convering values in odometer column from text to numeric
autos['odometer'] = (autos['odometer'].str.replace('km','')
.str.replace(',','')
.astype(int)
)
#changing column name
autos.rename({'odometer' : 'odometer_km'}, axis=1, inplace=True)
autos.head()
date_crawled | name | price_dolar | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
Let's explore the odometer_km column to check if all values look right
autos["odometer_km"].unique().shape
(13,)
autos["odometer_km"].describe()
count 47453.000000 mean 127011.885445 std 38558.524149 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
odometer_values = autos["odometer_km"].value_counts()
odometer_values.sort_index(ascending=True)
5000 714 10000 198 20000 654 30000 699 40000 742 50000 931 60000 1092 70000 1162 80000 1369 90000 1689 100000 2054 125000 4995 150000 31154 Name: odometer_km, dtype: int64
We may observed that most of cars have more than 125,000 km vehicle mileage and the average is 125733 km. The minimum mileage is 5,000 km and maximum is 150,000 km. Moreover, there are only 13 unique values, which may indicate a range-selection choice when the the advertising is adding.
Translating german words to english in dataset
We can see that there are some german words in dataset. We will translate them and swap on english.
autos['vehicle_type'].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
mapping_vehicle_type = {'bus' : 'bus',
'limousine' : 'limo',
'kleinwagen' : 'supermini',
'kombi' : 'combi',
'coupe' : 'coupe',
'suv' : 'suv',
'carbio' : 'cabriolet',
'andere' : 'other'
}
autos['vehicle_type'] = autos['vehicle_type'].map(mapping_vehicle_type)
autos['fuel_type'].unique()
mapping_fuel_type = {'benzin' : 'petrol',
'lpg' : 'lpg',
'elektro' : 'electric',
'andere' : 'other',
'diesel' : 'diesel',
'cng' : 'cng',
'hybrid' : 'hybrid'
}
autos['fuel_type'] = autos['fuel_type'].map(mapping_fuel_type)
autos['gearbox'].unique()
mapping_gearbox_type = {'manuell' : 'manual',
'automatik' : 'automatic'
}
autos['gearbox'] = autos['gearbox'].map(mapping_gearbox_type)
autos['unrepaired_damage'].unique()
mapping_unrepaired_damage = {'nein' : 'no',
'ja' : 'yes'
}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(mapping_unrepaired_damage)
autos.head()
date_crawled | name | price_dolar | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limo | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limo | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | supermini | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | combi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
We have few columns with dates:
Let's take a look loser.
autos[['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
#choosing first 10. characters to see date str[:10]
(autos['date_crawled'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_values()
)
2016-04-07 0.001391 2016-04-06 0.003140 2016-03-18 0.012876 2016-04-05 0.012939 2016-03-06 0.014098 2016-03-13 0.015679 2016-03-05 0.025415 2016-03-24 0.029376 2016-03-16 0.029545 2016-03-27 0.030957 2016-03-25 0.031421 2016-03-17 0.031610 2016-03-31 0.031821 2016-03-23 0.032285 2016-03-10 0.032306 2016-03-26 0.032516 2016-03-11 0.032580 2016-03-22 0.032727 2016-03-09 0.033022 2016-03-08 0.033296 2016-04-01 0.033591 2016-03-30 0.033718 2016-03-29 0.034244 2016-03-15 0.034287 2016-03-19 0.034666 2016-03-28 0.034856 2016-04-02 0.035425 2016-03-07 0.036099 2016-04-04 0.036668 2016-03-14 0.036731 2016-03-21 0.037089 2016-03-12 0.037153 2016-03-20 0.037785 2016-04-03 0.038691 Name: date_crawled, dtype: float64
(autos['date_crawled'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.025415 2016-03-06 0.014098 2016-03-07 0.036099 2016-03-08 0.033296 2016-03-09 0.033022 2016-03-10 0.032306 2016-03-11 0.032580 2016-03-12 0.037153 2016-03-13 0.015679 2016-03-14 0.036731 2016-03-15 0.034287 2016-03-16 0.029545 2016-03-17 0.031610 2016-03-18 0.012876 2016-03-19 0.034666 2016-03-20 0.037785 2016-03-21 0.037089 2016-03-22 0.032727 2016-03-23 0.032285 2016-03-24 0.029376 2016-03-25 0.031421 2016-03-26 0.032516 2016-03-27 0.030957 2016-03-28 0.034856 2016-03-29 0.034244 2016-03-30 0.033718 2016-03-31 0.031821 2016-04-01 0.033591 2016-04-02 0.035425 2016-04-03 0.038691 2016-04-04 0.036668 2016-04-05 0.012939 2016-04-06 0.003140 2016-04-07 0.001391 Name: date_crawled, dtype: float64
Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.
(autos['last_seen'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.001096 2016-03-06 0.004362 2016-03-07 0.005437 2016-03-08 0.007439 2016-03-09 0.009694 2016-03-10 0.010769 2016-03-11 0.012539 2016-03-12 0.024024 2016-03-13 0.008977 2016-03-14 0.012707 2016-03-15 0.015995 2016-03-16 0.016437 2016-03-17 0.028323 2016-03-18 0.007312 2016-03-19 0.015932 2016-03-20 0.020736 2016-03-21 0.020652 2016-03-22 0.021432 2016-03-23 0.018713 2016-03-24 0.019851 2016-03-25 0.019282 2016-03-26 0.016690 2016-03-27 0.015658 2016-03-28 0.021052 2016-03-29 0.022506 2016-03-30 0.024930 2016-03-31 0.023939 2016-04-01 0.023012 2016-04-02 0.024993 2016-04-03 0.025288 2016-04-04 0.024803 2016-04-05 0.124017 2016-04-06 0.220682 2016-04-07 0.130719 Name: last_seen, dtype: float64
Dates in this column we can treat as potential sold date.
We can spot in last_seen column that the last 3 dates, which indicate on the beginning of April are disproportionate compared to rest values. The reason may be auto deleting offers after one mounth.
(autos['ad_created'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000042 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000063 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000042 2016-02-05 0.000042 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000042 2016-02-14 0.000042 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000042 2016-02-19 0.000063 2016-02-20 0.000042 2016-02-21 0.000063 2016-02-22 0.000021 ... 2016-03-09 0.033128 2016-03-10 0.032011 2016-03-11 0.032917 2016-03-12 0.036984 2016-03-13 0.017027 2016-03-14 0.035340 2016-03-15 0.034055 2016-03-16 0.030051 2016-03-17 0.031231 2016-03-18 0.013592 2016-03-19 0.033549 2016-03-20 0.037848 2016-03-21 0.037279 2016-03-22 0.032580 2016-03-23 0.032137 2016-03-24 0.029313 2016-03-25 0.031547 2016-03-26 0.032559 2016-03-27 0.030852 2016-03-28 0.035003 2016-03-29 0.034160 2016-03-30 0.033549 2016-03-31 0.031863 2016-04-01 0.033570 2016-04-02 0.035129 2016-04-03 0.038923 2016-04-04 0.037047 2016-04-05 0.011654 2016-04-06 0.003203 2016-04-07 0.001264 Name: ad_created, Length: 75, dtype: float64
There are larger variety of ad created dates (75 diffrence dates). But as we can see, most ad were created 1-2 months of the listing date. Some dates are old, but there are not many here.
autos['registration_year'].describe()
count 47453.000000 mean 2004.576887 std 86.476957 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Registration_year column seems to have some unrealistic dates of year. The minium value is 1000, maximum is 9999. Most dates look correct, but to have more accurate describtion we need deal with incorrect years of car registration.
Convering dates to uniform numeric data
Let's convert dates from string style to numeric. We need only information about day, the hour is not necessary, so we will remove hours.
autos['date_crawled'] = (autos['date_crawled']
.str[:10]
.str.replace('-','')
.astype(int)
)
autos['ad_created'] = (autos['ad_created']
.str[:10]
.str.replace('-','')
.astype(int)
)
autos['last_seen'] = (autos['last_seen']
.str[:10]
.str.replace('-','')
.astype(int)
)
autos.head()
date_crawled | name | price_dolar | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 20160326 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limo | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 20160404 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990 | test | limo | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 20160326 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | supermini | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 20160312 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | combi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 20160401 | 39218 | 20160401 |
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47453 entries, 0 to 49999 Data columns (total 17 columns): date_crawled 47453 non-null int64 name 47453 non-null object price_dolar 47453 non-null int64 ab_test 47453 non-null object vehicle_type 40195 non-null object registration_year 47453 non-null int64 gearbox 45260 non-null object power_ps 47453 non-null int64 model 45092 non-null object odometer_km 47453 non-null int64 registration_month 47453 non-null int64 fuel_type 43598 non-null object brand 47453 non-null object unrepaired_damage 38601 non-null object ad_created 47453 non-null int64 postal_code 47453 non-null int64 last_seen 47453 non-null int64 dtypes: int64(9), object(8) memory usage: 6.5+ MB
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
We'll choose 1900 - 2016 interval.
(~autos['registration_year'].between(1900,2016)).sum() / autos.shape[0]
0.03932311971845826
Values with year older than 2016 and under 1900 are 4% in total of all values, so we can remove them.
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize=True)
2000 0.068068 2005 0.063768 1999 0.063132 2003 0.059030 2004 0.058942 2006 0.058350 2001 0.057604 2002 0.054226 1998 0.051308 2007 0.049378 2008 0.047821 2009 0.045210 1997 0.042227 2010 0.033825 2011 0.033803 1996 0.029723 2012 0.026104 1995 0.026060 2016 0.025950 2013 0.015487 1994 0.013644 2014 0.012087 1993 0.009147 1992 0.007963 1991 0.007349 1990 0.007195 2015 0.005725 1989 0.003751 1988 0.002896 1985 0.002018 ... 1960 0.000461 1977 0.000439 1976 0.000417 1966 0.000417 1975 0.000395 1969 0.000373 1965 0.000307 1964 0.000241 1963 0.000154 1959 0.000110 1961 0.000110 1956 0.000088 1962 0.000088 1937 0.000088 1958 0.000066 1954 0.000044 1934 0.000044 1957 0.000044 1910 0.000044 1941 0.000044 1953 0.000022 1951 0.000022 1931 0.000022 1927 0.000022 1950 0.000022 1929 0.000022 1948 0.000022 1938 0.000022 1939 0.000022 1952 0.000022 Name: registration_year, Length: 76, dtype: float64
autos['registration_year'].mean()
2002.8427841270538
The most cars were registered in 2000. W can spot that there are some cars which are from early 1900s. The average of first registration is 2002.
Let's see what are the most popular brand of cars
autos['brand'].value_counts(normalize=True)
volkswagen 0.213482 bmw 0.109110 opel 0.109088 mercedes_benz 0.095049 audi 0.084454 ford 0.070656 renault 0.047886 peugeot 0.030360 fiat 0.026126 seat 0.018536 skoda 0.016671 nissan 0.015553 mazda 0.015377 smart 0.014434 citroen 0.014280 toyota 0.012964 hyundai 0.010200 volvo 0.009279 sonstige_autos 0.008928 mini 0.008906 mitsubishi 0.008270 honda 0.007963 kia 0.007195 alfa_romeo 0.006756 suzuki 0.006032 chevrolet 0.005681 chrysler 0.003576 porsche 0.002939 dacia 0.002698 daihatsu 0.002545 jeep 0.002128 subaru 0.002128 saab 0.001689 land_rover 0.001667 daewoo 0.001514 jaguar 0.001470 trabant 0.001382 rover 0.001360 lancia 0.001075 lada 0.000592 Name: brand, dtype: float64
German cars are the most popular, almost 60% of the overall listing. Volkswagen is on the 1. place. In our analysys we will explore top 5 brands, because the rest don't have a significant percentage of listings.
brand_counts = autos['brand'].value_counts(normalize=True)
common_brands = autos['brand'].value_counts(normalize=True)[:5].index
common_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi'], dtype='object')
top_5_autos_price_dict = {}
for brand in common_brands:
brand_df = autos[autos['brand'] == brand]
price_mean = brand_df['price_dolar'].mean()
top_5_autos_price_dict[brand] = int(price_mean)
top_5_autos_price_dict
{'audi': 7974, 'bmw': 7447, 'mercedes_benz': 7460, 'opel': 2990, 'volkswagen': 5201}
As we can see, the 3 most expensive brands are:
Opel is less expensive
bmp_series = pd.Series(top_5_autos_price_dict).sort_values(ascending=False)
bmp_df = pd.DataFrame(bmp_series, columns= ['mean_price'])
bmp_df
mean_price | |
---|---|
audi | 7974 |
mercedes_benz | 7460 |
bmw | 7447 |
volkswagen | 5201 |
opel | 2990 |
top_5_autos_odometer_dict = {}
for brand in common_brands:
brand_df = autos[autos['brand'] == brand]
odometer_mean = brand_df['odometer_km'].mean()
top_5_autos_odometer_dict[brand] = int(odometer_mean)
top_5_autos_odometer_dict
{'audi': 132724, 'bmw': 134627, 'mercedes_benz': 133175, 'opel': 129421, 'volkswagen': 129353}
odkm_series = pd.Series(top_5_autos_odometer_dict).sort_values(ascending=False).head(6)
odkm_df = pd.DataFrame(odkm_series, columns=['odometer_km_mean'])
odkm_df
odometer_km_mean | |
---|---|
bmw | 134627 |
mercedes_benz | 133175 |
audi | 132724 |
opel | 129421 |
volkswagen | 129353 |
These brands have very similar mileage, BMW is on top, Volkswagen has the lowest mileage.
odkm_df['mean_price'] = bmp_df
odkm_df
odometer_km_mean | mean_price | |
---|---|---|
bmw | 134627 | 7447 |
mercedes_benz | 133175 | 7460 |
audi | 132724 | 7974 |
opel | 129421 | 2990 |
volkswagen | 129353 | 5201 |
Find the most popular model
Let's check which model are most popular in each brand.
models = odkm_df.index
top_models_dict = {}
for _ in models:
select = autos[autos['brand'] == _]
top = select['model'].value_counts()
top_models_dict[_] = top.index[0]
models_series = pd.Series(top_models_dict)
odkm_df['common_model'] = models_series
top_autos = odkm_df
top_autos
odometer_km_mean | mean_price | common_model | |
---|---|---|---|
bmw | 134627 | 7447 | 3er |
mercedes_benz | 133175 | 7460 | c_klasse |
audi | 132724 | 7974 | a4 |
opel | 129421 | 2990 | corsa |
volkswagen | 129353 | 5201 | golf |
Looking for corelation between mileage and mean price in each mileage interval in 5 top brands
top_autos = (autos[(autos['brand'] == 'bmw')
| (autos['brand'] == 'mercedes_benz')
| (autos['brand'] == 'audi')
| (autos['brand'] == 'opel')
| (autos['brand'] == 'volkswagen')]
)
mileage = top_autos['odometer_km'].unique()
mileage_meanprice_dict = {}
idx = 0
for m in mileage:
selected_rows = top_autos[top_autos['odometer_km'] == m]
price = selected_rows['price_dolar'].mean()
mileage_meanprice_dict[m] = int(price)
mileage_meanprice_dict
{5000: 4457, 10000: 16437, 20000: 15151, 30000: 14966, 40000: 15525, 50000: 14222, 60000: 13150, 70000: 12695, 80000: 11244, 90000: 9908, 100000: 9167, 125000: 7444, 150000: 4251}
odkm_meanprice_series = pd.Series(mileage_meanprice_dict).sort_index(ascending=False)
odkm_meanprice_series
150000 4251 125000 7444 100000 9167 90000 9908 80000 11244 70000 12695 60000 13150 50000 14222 40000 15525 30000 14966 20000 15151 10000 16437 5000 4457 dtype: int64
We can spot that the more mileage, the lower average price is. Cars with mileage more than 150,000km we can buy for 4251 average. There is one value with 5000km and we see that the price is only 4457 avergae. It's looking odd, but let's check if cars with mileage 5000km and less are damage.
mileage_5000 = top_autos[top_autos['odometer_km'] == 5000]
mileage_5000['unrepaired_damage'].value_counts(dropna=False)
no 157 NaN 155 yes 50 Name: unrepaired_damage, dtype: int64
Some of cars are damage so that's could be one of reason why these cars are cheap. Sellers could also choose the lowest mileage on purpose and the real mileage is much higher.
top_autos['unrepaired_damage'].value_counts(dropna=False)
no 20399 NaN 4973 yes 2490 Name: unrepaired_damage, dtype: int64
Most of cars don't have damage, but there are some which not repaired. Let's check what is the diffrence in average price between no-damage and unreapired cars.
mean_price_damage = int(top_autos.loc[top_autos['unrepaired_damage'] == 'yes', 'price_dolar'].mean())
mean_price_damage
2511
mean_price_no_damage = int(top_autos.loc[top_autos['unrepaired_damage'] == 'no', 'price_dolar'].mean())
mean_price_no_damage
6976
diffrence_pirce = mean_price_no_damage - mean_price_damage
diffrence_pirce
4465
Most popular no-damage cars are more expensive by 4465 average than damage cars.
We analyzed dataset with used cars from eBay Kleinanzeigen, and find 5 most popular brands which make up majority of all cars.