The main goal of this project is to clean and analyze used car listings that were scraped from a classifieds section of the German Ebay website and upload to kaggle. However, the dataset we are going to use was modified by Dataquest team since this project is part of one of Dataquest requirements for accomplishing Data Scientist certificate.
So in this sence, our project is structured into 11 steps: first, we are going to import the proper libraries for data analysis, which are Pandas and Numpy, and open the data file. Next, a deep analysis in each column of the dataset will take place aiming to find any patterns. Moving foward, we will need to drop some inaccurate data and some outliers to make our analysis easier and more clear. After a quick cleaning process, we are going move onto data analysis and make some observations about the dataset.
As mentioned above, it is necessary to import two important libraries for data analysis: Pandas and Numpy. Next, we are going to read the dataset and print it.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
# Verifying the dataset
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
# Checking information about each column
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
# printing the first few rows of the dataset
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
According to the information above, there are null values in 5 different columns (but none with more than 20% null values): vehicleType, gearbox, model, fuelType and notRepairedDamage. Besides that, the values of the price and odometer columns are classified as objects. Therefore, it will be necessary to convert these values to int64 in order to make data manipulation easier.
Another point that can be observed is that the columns titles are in camelcase, which are not the Python's preferred notation. That is why we are going to start working with this issue.
# checking the present column names
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# changing the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
# checking the modified dataset
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In this step, we are going to make a quick view on descriptive statistics of each column to identify any odd data.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-30 17:37:35 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
# checking on seller, offer_type and nr_of_pictures columns
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
It is possible to notice in the description above that there are two columns that have mostly one value (49999 of the same value, to be exact): seller and offer_type. Also, it is possible to see thath the column nr_of_pictures has only 0 values. To make our analysis easier, we are going to drop these three columns since they will not make any difference for us.
autos.drop(labels = ['seller', 'offer_type', 'nr_of_pictures'], axis = 1, inplace = True)
# checking the modified dataset
autos.shape
(50000, 17)
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
As it was said before, we have an issue with the price and odometer columns which are classified as object but should contain numeric values instead. Therefore, we are now going to remove any non-numeric characters and convert the strings in these two columns to integers.
autos['price'].unique()
array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'], dtype=object)
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')
# checking the new values
autos['price'].unique()
array(['5000', '8500', '8990', ..., '385', '22200', '16995'], dtype=object)
autos['price'] = autos['price'].astype(int)
# checking the cleaned data
autos['price'].dtype
dtype('int32')
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int32
autos['odometer'].unique()
array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km', '30,000km', '125,000km', '90,000km', '20,000km', '60,000km', '5,000km', '100,000km', '40,000km'], dtype=object)
autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km', '')
# checking the new values
autos['odometer'].unique()
array(['150000', '70000', '50000', '80000', '10000', '30000', '125000', '90000', '20000', '60000', '5000', '100000', '40000'], dtype=object)
autos['odometer'] = autos['odometer'].astype(int)
# checking the cleaned data
autos['odometer'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int32
To finish this first cleaning process, we are going to change price and odometer labels to price_dollar and odometer_km, for better comprehension.
autos.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis = 1, inplace = True)
# checking our modified dataset
autos.head()
date_crawled | name | price_dollar | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
autos[['price_dollar', 'odometer_km']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price_dollar 50000 non-null int32 1 odometer_km 50000 non-null int32 dtypes: int32(2) memory usage: 390.8 KB
In this section, we are going to look for outliers in price_dollar and odometer_km columns aiming to drop them to make our analysis easier. To do so, we are going to use to use Series.unique(), Series.describe() and Series.value_counts() methods, for each column.
autos['price_dollar'].unique().shape
(2357,)
According to the code line above, our dataset has 2357 unique values.
autos[['price_dollar']].describe()
price_dollar | |
---|---|
count | 5.000000e+04 |
mean | 9.840044e+03 |
std | 4.811044e+05 |
min | 0.000000e+00 |
25% | 1.100000e+03 |
50% | 2.950000e+03 |
75% | 7.200000e+03 |
max | 1.000000e+08 |
We can see in the code line above that the minimum value of the cars in our dataset is 0 and the maximum is $100,000,000, which is technically impossible. After a quick research on eBay website, we found out that there are no significative amount of cars that costs more than 1 million dollars and that mosts of the cars that were announced as less than $ 50 are actually donation ads or "give an offer" ads.
autos['price_dollar'].value_counts().sort_index(ascending = True).head(50)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 70 10 75 5 79 1 80 15 89 1 90 5 99 19 100 134 110 3 111 2 115 2 117 1 120 39 122 1 125 8 129 1 130 15 135 1 139 1 140 9 Name: price_dollar, dtype: int64
After this quick analysis, we are now going to remove the outliers (lines that has cars' prices less than 50 and more than $1,000,000.
# before removal
autos['price_dollar'].shape
(50000,)
autos[['price_dollar']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price_dollar 50000 non-null int32 dtypes: int32(1) memory usage: 195.4 KB
autos = autos[autos['price_dollar'].between(50, 1000000)] # code to remove the outliers
# after removal
autos['price_dollar'].shape
(48350,)
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 100000, 40000])
autos['odometer_km'].describe()
count 48350.000000 mean 125876.732161 std 39617.575951 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts()
150000 31290 125000 5046 100000 2105 90000 1733 80000 1414 70000 1215 60000 1153 50000 1011 40000 815 5000 782 30000 778 20000 759 10000 249 Name: odometer_km, dtype: int64
As we can see from the code line above, there are not apparent outliers in the odometer_km column, since it is totally possible to find cars that have driven 150,000 km. Therefore, we are going to let the column as it is.
autos[['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']].head()
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 | 3 | 2004 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 | 6 | 1997 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 | 7 | 2009 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 | 6 | 2007 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 7 | 2003 |
It is possible to see that registration_month and registration_year columns are represented as numerical data, whereas data_crawled, last_seen and ad_created are represented as strings. Because of that, we need to convert these three columns into numerical data. First of all, we are going to isolate the date values (which are represented by the first 10 characters) to understand the date range, for each of the three columns.
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2016-03-05 0.025357 2016-03-06 0.014064 2016-03-07 0.036029 2016-03-08 0.033257 2016-03-09 0.033051 2016-03-10 0.032223 2016-03-11 0.032554 2016-03-12 0.036980 2016-03-13 0.015657 2016-03-14 0.036670 2016-03-15 0.034292 2016-03-16 0.029493 2016-03-17 0.031603 2016-03-18 0.012906 2016-03-19 0.034747 2016-03-20 0.037787 2016-03-21 0.037311 2016-03-22 0.032947 2016-03-23 0.032265 2016-03-24 0.029431 2016-03-25 0.031499 2016-03-26 0.032265 2016-03-27 0.031086 2016-03-28 0.034891 2016-03-29 0.034126 2016-03-30 0.033713 2016-03-31 0.031830 2016-04-01 0.033733 2016-04-02 0.035533 2016-04-03 0.038573 2016-04-04 0.036525 2016-04-05 0.013051 2016-04-06 0.003164 2016-04-07 0.001386 Name: date_crawled, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038821 2016-04-04 0.036898 2016-04-05 0.011768 2016-04-06 0.003247 2016-04-07 0.001241 Name: ad_created, Length: 76, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
2016-03-05 0.001075 2016-03-06 0.004323 2016-03-07 0.005419 2016-03-08 0.007342 2016-03-09 0.009638 2016-03-10 0.010610 2016-03-11 0.012389 2016-03-12 0.023744 2016-03-13 0.008873 2016-03-14 0.012637 2016-03-15 0.015884 2016-03-16 0.016463 2016-03-17 0.028087 2016-03-18 0.007322 2016-03-19 0.015801 2016-03-20 0.020620 2016-03-21 0.020600 2016-03-22 0.021406 2016-03-23 0.018594 2016-03-24 0.019772 2016-03-25 0.019173 2016-03-26 0.016774 2016-03-27 0.015574 2016-03-28 0.020869 2016-03-29 0.022337 2016-03-30 0.024736 2016-03-31 0.023826 2016-04-01 0.022875 2016-04-02 0.024881 2016-04-03 0.025171 2016-04-04 0.024509 2016-04-05 0.124964 2016-04-06 0.221634 2016-04-07 0.132079 Name: last_seen, dtype: float64
After checking simultaneously the 3 columns, it is possible to conclude that ads created before March 2016 are considered as outliers, since the date_crawled and last_seen columns only registered values after March 2016.
autos['registration_year'].value_counts().sort_index(ascending = True)
1000 1 1001 1 1111 1 1800 2 1910 2 .. 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, Length: 95, dtype: int64
From the code cell above, it is possible to see that we have some outliers in the registration_year column. For example, we can see that there are some cars that were registered before 1800 (which is impossilbe since the car was not invented yet) and others were registered after 2016, which are definitely inaccurate since a car can't be registered after the listing was last seen. Therefore, we need to drop these outliers, considering that the first car registration happened in the first few decades of 1900s.
# before removal
autos['registration_year'].shape
(48350,)
autos = autos[autos['registration_year'].between(1900, 2016)]
# after removal
autos['registration_year'].shape
(46476,)
Now we are going to calculate the distribution of the remaining values for the registration_year.
autos['registration_year'].value_counts(normalize = True).sort_index(ascending = True)
1910 0.000043 1927 0.000022 1929 0.000022 1931 0.000022 1934 0.000043 ... 2012 0.028165 2013 0.017235 2014 0.014244 2015 0.008262 2016 0.025970 Name: registration_year, Length: 78, dtype: float64
According to the output from the code cell above, most of the cars were registered after 1995.
In order to identify the top brands of brand column, we are going to use the aggregation method, looping over the brand column. As we can see bellow, there are a total of 40 different brands in the dataset.
autos['brand'].unique().shape[0]
40
autos['brand'].value_counts(normalize = True)
volkswagen 0.211356 bmw 0.109949 opel 0.107303 mercedes_benz 0.096501 audi 0.086604 ford 0.069993 renault 0.047186 peugeot 0.029908 fiat 0.025712 seat 0.018246 skoda 0.016417 nissan 0.015320 mazda 0.015255 smart 0.014179 citroen 0.014050 toyota 0.012759 hyundai 0.010048 sonstige_autos 0.009618 volvo 0.009101 mini 0.008779 mitsubishi 0.008176 honda 0.007854 kia 0.007100 alfa_romeo 0.006649 porsche 0.006046 suzuki 0.005939 chevrolet 0.005702 chrysler 0.003529 dacia 0.002647 daihatsu 0.002517 jeep 0.002281 subaru 0.002109 land_rover 0.002109 saab 0.001657 jaguar 0.001571 daewoo 0.001485 trabant 0.001377 rover 0.001334 lancia 0.001054 lada 0.000581 Name: brand, dtype: float64
Now we are going to loop over the brand column to add the brands that respond for more than 5% of the total of cars in the dataset into a dictionary. The main purpose of this method will be to calculate the mean price of each of the top brands.
top_brands = autos['brand'].value_counts(normalize = True, ascending = False)
common_brands = top_brands[top_brands > 0.05].index
brand_mean_price = {}
for b in common_brands:
mean_price = autos[autos['brand'] == b]['price_dollar'].mean()
brand_mean_price[b] = int(mean_price)
brand_mean_price
{'volkswagen': 5627, 'bmw': 8376, 'opel': 2996, 'mercedes_benz': 8663, 'audi': 9373, 'ford': 4068}
In order to compare the mean price to the average milage of each brand, we are going to create another dictionary for Mean Milage just like we did for the mean price. Next, it will be necessary to transform both dictonaries into data series and then into a dataframe, using the series and dataframe constructors.
First, we are going to transform our previous dictionary(brand_mean_price) into a series.
bmp_series = pd.Series(data=brand_mean_price)
bmp_series
volkswagen 5627 bmw 8376 opel 2996 mercedes_benz 8663 audi 9373 ford 4068 dtype: int64
Now, we can work on the Mean Milage, looping over the milage of the top brands (which we have called previously as common_brands)
brand_mean_milage = {}
for b in common_brands:
mean_milage = autos[autos['brand'] == b]['odometer_km'].mean()
brand_mean_milage[b] = int(mean_milage)
brand_mean_milage
{'volkswagen': 128775, 'bmw': 132651, 'opel': 129440, 'mercedes_benz': 130923, 'audi': 129254, 'ford': 124296}
Moving foward! It is possible now to transform the brand_mean_milage dictionary into a series, such as bmp_series.
bmm_series = pd.Series(data=brand_mean_milage)
bmm_series
volkswagen 128775 bmw 132651 opel 129440 mercedes_benz 130923 audi 129254 ford 124296 dtype: int64
In this final step, we are going to create a DataFrame containing both the brand mean price and the brand mean milage series. The main goal of this step is to make the comparison between both series easier and to identify if there is any link with each other.
mean_price_milage = pd.DataFrame(data = bmp_series, columns = ['mean_price_dollar'])
mean_price_milage
mean_price_dollar | |
---|---|
volkswagen | 5627 |
bmw | 8376 |
opel | 2996 |
mercedes_benz | 8663 |
audi | 9373 |
ford | 4068 |
Now, we must add the bmm_series as a new column of our dataframe.
mean_price_milage['mean_milage_km'] = bmm_series
mean_price_milage.sort_values('mean_price_dollar', ascending = False)
mean_price_dollar | mean_milage_km | |
---|---|---|
audi | 9373 | 129254 |
mercedes_benz | 8663 | 130923 |
bmw | 8376 | 132651 |
volkswagen | 5627 | 128775 |
ford | 4068 | 124296 |
opel | 2996 | 129440 |
Apparently, there is not a direct relation between the price and the milage of the cars when we consider just the top 6 brands, since the mean milage doesn't varies significatively with the price.