In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here.
We've made a few modifications from the original dataset:
The data dictionary provided with data is as follows:
The aim of this project is to clean the data and analyze the included used car listings.
# importing the libraries we need and reading the dataset into pandas
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Our data set sample contains 20 columns. Most of columns are represented as string (15), which indicates that they are represented by strings and the rest are integer. There exist some columns with missing values; however, none of which has null value exceeding 20%.
The column names use camelcase, instead of Python's preferred snakecase. So we will start by cleaning the column names to make the data easier to work with.
We will convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
# Exploring the existing column names
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#Copying the dataframe:
autos_copy_1 = autos.copy()
#W1: Change name of columns
autos_copy_1.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type','vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'gearbox':'gear_box', 'powerPS':'power_PS', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code','lastSeen':'last_seen'}, inplace = 1, axis =1)
autos_copy_1.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_PS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#Copying the dataframe:
autos_copy_2 = autos.copy()
#W2: another way to change names of columns:
autos_copy_2.columns = ['date_crawled','name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'year_of_registration', 'gear_box', 'power_PS', 'model',
'odometer', 'month_of_registration', 'fuel_type', 'brand',
'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos_copy_2.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | year_of_registration | gear_box | power_PS | model | odometer | month_of_registration | fuel_type | brand | not_repaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#W3: alternative way to change names of columns: Series.map(mapping_dict)
# Note: if a value from our series doesn't exist as a key in our dictionary, it will convert that value to NaN
mapping_dict = {'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'price':'price',
'abtest':'abtest',
'model':'model',
'odometer':'odometer',
'brand':'brand',
'offerType':'offer_type',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gear_box',
'powerPS':'power_PS',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
autos.columns=autos.columns.map(mapping_dict)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_PS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We will begin with doing some basic data exploration to determine what other cleaning tasks need to be done. Firstly, we will look for:
The following methods are helpful for exploring the data:
# descriptive statistics for all columns:
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_PS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Our initial observation is that there are some common words that are in German, so we will find them and translate them into English.
print(autos['gear_box'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manuell' 'automatik' nan] ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] ['nein' nan 'ja']
mapping_dic ={'manuell': 'manual',
'automatik':'automatic',
'lpg':'lpg',
'benzin': 'petrol',
'diesel':'diesel',
'cng': 'cng',
'hybrid':'hybrid',
'elektro':'electric',
'andere':'others',
'nein':'no',
'ja':'yes'
}
cols_translated = ['gear_box','fuel_type','unrepaired_damage']
for c in cols_translated:
autos[c]=autos[c].map(mapping_dic)
Secondly, there is also a column, whose content seems to be inappropriate: nr_of_pictures. Let's investigate further:
autos['nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
This whole column only contains values 0, which is meaningless for us.
If we study the data set long enough, we will find that there are 2 sellers: privat and gewerblich. However, with our sample data with 50000 data points, only 1 car was sold by gewerblich. So 'seller' column will not have much statistical meaning for us, which should be removed.The same pattern was observed in 'offer_type' column.
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
So we will remove these columns: 'nr_of_pictures', 'seller', 'offer_type'
autos.drop(['nr_of_pictures','seller','offer_type'],axis = 1, inplace = True)
Thirdly, there are 2 columns that data type can be transformed completely from string to numeric type to do some analysation: 'price' and 'odometer'
print(autos['price'].head())
print('\n')
print(autos['odometer'].head())
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object 0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
In these columns, we will remove the sign '$', 'km', ',' ,transform the data type to float and then change the name of the colums to be more descriptive.
# removing special sign and characters, transform data type:
autos['price']=autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer']=autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
print(autos['price'].head())
print('\n')
print(autos['odometer'].head())
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int32 0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int32
#rename columns:
autos.rename({'price':'price_usd','odometer':'odometer_km'}, inplace = True, axis = 1)
autos.head()
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gear_box | power_PS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
We will continue exploring the data,specifically looking for data that doesn't look right. We'll start by analyzing the 'odometer_km' and 'price_usd' columns. Here are steps we will take:
Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here). When removing outliers, we can do df[(df["col"] >= x ) & (df["col"] <= y )], but it's more readable to use df[df["col"].between(x,y)]
# price_usd column:
print(autos['price_usd'].describe())
print('\n')
print('Number of unique values:',autos['price_usd'].unique().shape[0])
print('\n')
print('Top 10 cheapest deals before: ',autos['price_usd'].value_counts().sort_index(ascending = True).head(15))
print('\n')
print('Top 10 most expensive deals before: ',autos['price_usd'].value_counts().sort_index(ascending = False).head(15))
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64 Number of unique values: 2357 Top 10 cheapest deals before: 0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 Name: price_usd, dtype: int64 Top 10 most expensive deals before: 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 Name: price_usd, dtype: int64
In our samples, the prices of all the deal were provided. There are only 2357 unique values, this seems to be the result of people's tendency to round prices on the site. The lowest value is $0.00 (with 1421 deals) which is about 2.8% of total deals so we can remove them; and the highest value is 99999999. Let's have a closer look in this row:
autos[autos['price_usd'] == 99999999]
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gear_box | power_PS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | 99999999 | control | limousine | 1999 | automatic | 224 | s_klasse | 150000 | 9 | petrol | mercedes_benz | NaN | 2016-03-22 00:00:00 | 73525 | 2016-04-06 05:15:30 |
The price seems to be strange because, this is not a luxury type of car, and the registration year is 1999, which is not too old to be antique. So we will drop this row.
From our results, there is a hug jump of price from 350000 usd to 999990 usd. And the number of transactions between 1 usd - 350000 usd is:
per_1_350000 = (autos[autos['price_usd'].between(1,350000)].shape[0]/50000)*100
print(round(per_1_350000,2),'%')
97.13 %
Therefore, we will drop these rows as follows:
# drop rows with price = 0 usd:
autos.drop(autos.index[autos['price_usd']==0].tolist(),axis = 0, inplace = True)
# drop rows with price > 350000 usd:
list_index_350000 = autos.index[autos['price_usd'] > 350000].tolist() #find the index of rows meeting the condition
autos.drop(list_index_350000, axis = 0, inplace = True) #drop those rows
autos['price_usd'].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_usd, dtype: float64
# 'odometer_km' column
autos['odometer_km'].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
The information of mileage seems to be round again. Additionally, the major cars are high mileage rather than low mileage.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteautos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48565 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 48565 non-null object 1 name 48565 non-null object 2 price_usd 48565 non-null int32 3 abtest 48565 non-null object 4 vehicle_type 43979 non-null object 5 registration_year 48565 non-null int64 6 gear_box 46222 non-null object 7 power_PS 48565 non-null int64 8 model 46107 non-null object 9 odometer_km 48565 non-null int32 10 registration_month 48565 non-null int64 11 fuel_type 44535 non-null object 12 brand 48565 non-null object 13 unrepaired_damage 39464 non-null object 14 ad_created 48565 non-null object 15 postal_code 48565 non-null int64 16 last_seen 48565 non-null object dtypes: int32(2), int64(4), object(11) memory usage: 6.3+ MB
The table above shows that 3 columns 'date_crawled', 'last_seen', and 'ad_created' are represented as object type, which indicate that these columns are all identified as strings values by pandas.
Therefore, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled', 'last_seen', 'ad_created']].head()
date_crawled | last_seen | ad_created | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 |
It is noticed that the 1st 10 characters represent the day (e.g. 2016-03-26). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.
# date_crawled column
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
Data were continuously crawled from 2016-03-05 to 2016-04-07, we can also see that the frequency of crawling has not much difference among days(~ 3% per day), in other words the distribution of listings crawled on each day is roughly uniform.
# last_seen column
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_values()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-18 0.007351 2016-03-08 0.007413 2016-03-13 0.008895 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-14 0.012602 2016-03-27 0.015649 2016-03-19 0.015834 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-26 0.016802 2016-03-23 0.018532 2016-03-25 0.019211 2016-03-24 0.019767 2016-03-21 0.020632 2016-03-20 0.020653 2016-03-28 0.020859 2016-03-22 0.021373 2016-03-29 0.022341 2016-04-01 0.022794 2016-03-31 0.023783 2016-03-12 0.023783 2016-04-04 0.024483 2016-03-30 0.024771 2016-04-02 0.024915 2016-04-03 0.025203 2016-03-17 0.028086 2016-04-05 0.124761 2016-04-07 0.131947 2016-04-06 0.221806 Name: last_seen, dtype: float64
'last_seen' is the information that crawler recorded, that is the date it last saw any post. We can make use of this column to identify the date that a listing was removed, presumably the car was sold.
A disproportionate values were recorded in 3 last days. This hardly imply a massive spike in sales of autos (more than 6-11 times that of the other days). It is more likely that these values are to do with the end of crawling period and don't indicate the peak of car sales.
# ad_created column
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_values()
#autos['ad_created'].str[:10].describe()
2016-02-09 0.000021 2016-01-13 0.000021 2016-01-29 0.000021 2015-11-10 0.000021 2016-02-07 0.000021 ... 2016-03-12 0.036755 2016-04-04 0.036858 2016-03-21 0.037579 2016-03-20 0.037949 2016-04-03 0.038855 Name: ad_created, Length: 76, dtype: float64
There is a large variety of ad created dates. The time listings are created range from 2015-06-11 to 2016-04-07. It is understandable that the farer the date is, the fewer listings exist.
# Registration_year
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Our initial realization is that there exist some incorrect data there, because the first car in the world was created in 1885,so the date of first registration can't be before that.
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
bool_arr = (autos['registration_year']<1900)|(autos['registration_year']>2016)
percent= (autos[bool_arr].shape[0]/autos.shape[0])*100
# alternative code to count the number of listings with cars that fall outside the 1900 - 2016 interval:
#(~autos['registration_year'].between(1900,2016)).sum()
print(round(percent,2),'%')
3.88 %
Given that these count for less than 4% of our data, so we will remove those rows.
autos= autos[autos['registration_year'].between(1910,2016)]
print(autos['registration_year'].describe())
autos['registration_year'].value_counts(normalize = True, dropna = False).sort_values()[-15:]
count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
2010 0.034040 2011 0.034768 1997 0.041794 2009 0.044665 2008 0.047450 2007 0.048778 1998 0.050620 2002 0.053255 2001 0.056468 2006 0.057197 2003 0.057818 2004 0.057904 1999 0.062060 2005 0.062895 2000 0.067608 Name: registration_year, dtype: float64
The range of registration year diversifies a lot. However, 2/3 cars were registered before 2008, especially between 1997-2008.
We will use aggregation to understand the 'brand' column. Let's have an over view of 'brand' column first.
sorted_value = autos['brand'].value_counts(normalize = True,dropna = False).sort_values(ascending = False)[:10]
print(sorted_value)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 Name: brand, dtype: float64
Our initial observation is that in the German auto market, German manufacturers represent six out of the top ten brands, almost 54.8% of the overall listings. Followed by American brands: opel (belonging to GM) and Ford count for 17% market share. The rest of top 10 is other brands of France (renault,peugeot). Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined. There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to 10 most popular brands given above. We will next calculate the corresponding mean price of each of these branches to identify the most popular segment in Germany.
Given 'brand' and 'price_usd', we can classify these into 3 segments and identify the most potential segment of the German market.
brands = sorted_value.index
print(brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
brand_price = {}
for b in brands:
mean_price = autos.loc[autos['brand']==b,'price_usd'].mean()
brand_price[b] = round(mean_price,2)
print(brand_price)
{'volkswagen': 5402.41, 'bmw': 8332.82, 'opel': 2975.24, 'mercedes_benz': 8628.45, 'audi': 9336.69, 'ford': 3749.47, 'renault': 2474.86, 'peugeot': 3094.02, 'fiat': 2813.75, 'seat': 4397.23}
Given this data, we can find out the car of which brand is the most expensive and cheapest. To make our work easier, we will sort these data in descending order.
# sort a dictionary
table_display = []
for key in brand_price:
table_display.append([brand_price[key],key])
brand_price_sort=sorted(table_display,reverse = True)
print(brand_price_sort)
[[9336.69, 'audi'], [8628.45, 'mercedes_benz'], [8332.82, 'bmw'], [5402.41, 'volkswagen'], [4397.23, 'seat'], [3749.47, 'ford'], [3094.02, 'peugeot'], [2975.24, 'opel'], [2813.75, 'fiat'], [2474.86, 'renault']]
# alternative code to sort a dictionary:
sorted_price = sorted(brand_price.items(), key=lambda x: x[1], reverse = True)
print(sorted_price)
[('audi', 9336.69), ('mercedes_benz', 8628.45), ('bmw', 8332.82), ('volkswagen', 5402.41), ('seat', 4397.23), ('ford', 3749.47), ('peugeot', 3094.02), ('opel', 2975.24), ('fiat', 2813.75), ('renault', 2474.86)]
We can realize the distinct price gap over there:
So it is obvious that German manufacturers not only cover the whole upper middle class and middle class but they also supply (~2%) the lower middle class with the brand fiat.
This coincide with our interpretation above that German brands are the most popular in German second-hand auto market. Moreover, we can see that the upper middle class is the most active segment in German auto used market.
In this step, we will use aggregation to understand the average mileage for those cars and if there is any visible link with mean price. While our natuaral instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:
Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly. To do this, we'll need to know two pandas methods:
We have already created a dictionary with a brand key and mean price as value. Now we will continue to calculate the mean mileage for these top 10 brands, and store the results in a dictionary:
# brand_mileage dictionary
brand_mileage = {}
for b in brands:
mean = autos.loc[autos['brand'] == b,'odometer_km'].mean()
brand_mileage[b] = round(mean,2)
print(brand_mileage)
{'volkswagen': 128707.16, 'bmw': 132572.51, 'opel': 129310.04, 'mercedes_benz': 130788.36, 'audi': 129157.39, 'ford': 124266.01, 'renault': 128071.33, 'peugeot': 127153.63, 'fiat': 117121.97, 'seat': 121131.3}
# Convert both dictionaries to series objects, using the series constructor.
bmp_series = pd.Series(brand_price)
print(bmp_series)
print('\n')
bm_series = pd.Series(brand_mileage)
print(bm_series)
volkswagen 5402.41 bmw 8332.82 opel 2975.24 mercedes_benz 8628.45 audi 9336.69 ford 3749.47 renault 2474.86 peugeot 3094.02 fiat 2813.75 seat 4397.23 dtype: float64 volkswagen 128707.16 bmw 132572.51 opel 129310.04 mercedes_benz 130788.36 audi 129157.39 ford 124266.01 renault 128071.33 peugeot 127153.63 fiat 117121.97 seat 121131.30 dtype: float64
# Create a dataframe from the first series object using the dataframe constructor.
dataFrame = pd.DataFrame(bmp_series, columns = ['mean_price'])
dataFrame
mean_price | |
---|---|
volkswagen | 5402.41 |
bmw | 8332.82 |
opel | 2975.24 |
mercedes_benz | 8628.45 |
audi | 9336.69 |
ford | 3749.47 |
renault | 2474.86 |
peugeot | 3094.02 |
fiat | 2813.75 |
seat | 4397.23 |
# Assign the other series as a new column in this dataframe.
dataFrame['mean_mileage']=bm_series
dataFrame
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5402.41 | 128707.16 |
bmw | 8332.82 | 132572.51 |
opel | 2975.24 | 129310.04 |
mercedes_benz | 8628.45 | 130788.36 |
audi | 9336.69 | 129157.39 |
ford | 3749.47 | 124266.01 |
renault | 2474.86 | 128071.33 |
peugeot | 3094.02 | 127153.63 |
fiat | 2813.75 | 117121.97 |
seat | 4397.23 | 121131.30 |
There is not much difference in the range of mileage. There maybe still exist a trend over 3 segments: the number of km the car has run in the upper middle class on average is greater than that of the middle class. Among 3 groups, the smallest number of running km was recorded in the lower middle class. However this relation is not really clear because it is highly affected by the reputation of the brand.
To have a more clearer correlation between mileage and price, we will:
upper_middle_class = autos[(autos['brand'] == 'bmw')|(autos['brand'] == 'audi')|(autos['brand'] == 'mercedes_benz')]
upper_middle_class['brand'].value_counts()
bmw 5137 mercedes_benz 4503 audi 4041 Name: brand, dtype: int64
# splitting odomoter_km into 3 groups
mileage_groups = upper_middle_class["odometer_km"].value_counts(bins = 3).sort_index()
mileage_groups
(4854.999, 53333.333] 986 (53333.333, 101666.667] 1689 (101666.667, 150000.0] 11006 Name: odometer_km, dtype: int64
# calculate the average price of each group
price1 = upper_middle_class.loc[upper_middle_class["odometer_km"] <= 53333.333,'price_usd'].mean() #group 1
print('(4854.999, 53333.333] __ Average price: $', round(price1,2))
price2 = upper_middle_class.loc[upper_middle_class["odometer_km"].between(53333.333, 101666.667),'price_usd'].mean() #group 2
print('(53333.333, 101666.667] __ Average price: $', round(price2,2))
price3 = upper_middle_class.loc[upper_middle_class["odometer_km"] >= 101666.667,'price_usd'].mean() #group 3
print('(101666.667, 150000.0] __ Average price: $', round(price3,2))
(4854.999, 53333.333] __ Average price: $ 23037.13 (53333.333, 101666.667] __ Average price: $ 16511.87 (101666.667, 150000.0] __ Average price: $ 6249.86
Our last conclusion for this relation is that:
We would like to have more specific information of the hotest model consummed of these top 10 brands.
# creating model dictionary
top_brand_model = {}
def hot_model(name):
model=autos[autos['brand']==name]['model'].value_counts()
model_name = model.index[0]
model_size = model[0]
return model_name,model_size
for model in brand_price:
top_brand_model[model] = hot_model(model)
print('Hottest models of top 10 brands: ')
top_brand_model
Hottest models of top 10 brands:
{'volkswagen': ('golf', 3707), 'bmw': ('3er', 2615), 'opel': ('corsa', 1592), 'mercedes_benz': ('c_klasse', 1136), 'audi': ('a4', 1231), 'ford': ('focus', 762), 'renault': ('twingo', 615), 'peugeot': ('2_reihe', 600), 'fiat': ('punto', 415), 'seat': ('ibiza', 328)}
What are top 10 bestseller models and they are of which brands?
This time, instead of using aggregation, we will try to use groupby(): A groupby operation involves:
autos.groupby('brand')['model'].value_counts().sort_values(ascending=False).head(10)
# another way to call a serie from a DataFrame: using dot notation so our alternative code:
#autos.groupby('brand').model.value_counts().sort_values(ascending = False).head(10)
brand model volkswagen golf 3707 bmw 3er 2615 volkswagen polo 1609 opel corsa 1592 volkswagen passat 1349 opel astra 1348 audi a4 1231 mercedes_benz c_klasse 1136 bmw 5er 1132 mercedes_benz e_klasse 958 Name: model, dtype: int64
So the brand/model combination appears most commonly in the market is Volkswagen golf at 3707 listings. And 3 out of 10 tops models are all marked Volkswagen, this is appropriate with the fact that Volkswagen has the highest trading volume in the markets (~21.13%).
# let's practice more with groupby()
autos.groupby(['unrepaired_damage']).price_usd.mean().sort_values(ascending=False)
unrepaired_damage no 7164.033103 yes 2241.146035 Name: price_usd, dtype: float64
upper_middle_class.groupby('unrepaired_damage').brand.value_counts()
unrepaired_damage brand no bmw 3948 mercedes_benz 3465 audi 3064 yes bmw 403 mercedes_benz 343 audi 333 Name: brand, dtype: int64
upper_middle_class.groupby(['unrepaired_damage', 'brand']).price_usd.mean()
unrepaired_damage brand no audi 10914.959856 bmw 9437.709980 mercedes_benz 9798.396537 yes audi 3324.684685 bmw 3512.637717 mercedes_benz 3921.819242 Name: price_usd, dtype: float64
middle_class = autos[(autos['brand']=='volkswagen')|(autos['brand']=='Seat')]
middle_class.groupby(['unrepaired_damage', 'brand']).price_usd.mean()
unrepaired_damage brand no volkswagen 6469.407759 yes volkswagen 2179.405660 Name: price_usd, dtype: float64
From our calculation of 2 different groups, we can conclude that in the same brand, the cars without damages are usually 3 times more expensive than the damaged cars.
autos.groupby('gear_box').price_usd.mean()
gear_box automatic 10972.718547 manual 4716.709175 Name: price_usd, dtype: float64
# Calculate the price difference
autos.groupby('gear_box').price_usd.mean()['automatic'] - autos.groupby('gear_box').price_usd.mean()['manual']
6256.009372369583
Unsurprisingly, the automatic autos are more expensive than the manual ones and the average gap price is $6256.
Through this project we have complete some missions as follows: