In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. However, the dataset we are going to use was modified by Dataquest team since this project is part of one of Dataquest requirements for accomplishing Data Scientist certificate. The data dictionary provided with data is as follows:
# Import the pandas and NumPy libraries
import pandas as pd
import numpy as np
# Read the autos.csv CSV file into pandas, and assign it to the variable name autos
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos.info()
# print first five rows
print(autos.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
The file has twenty columns. The following columns have null values:
Also, The column names use camelCase instead of Python's preferred snake_case.
# checking the present column names
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# changing the column names
autos.rename(columns = {'dateCrawled':'date_crawled',
'offerType':'offer_type',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'powerPS':'power_ps',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'num_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}, inplace=True)
# checking the modified dataset
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We converted the column names from camelcase to snakecase and reworded some of the column names based on the data dictionary to be more descriptive. Then we printed the first five rows.
In this step, we are going to make a quick view on descriptive statistics of each column to identify any odd data.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 18:39:34 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
The following observations can be made:
seller and _offertype have nearly all values same; except one entry
num_pictures are 0 for all the entries
seller , offer_type and num_pictures are candidates to be dropped
prices and odometer are stored as strings they must be changed to numerical data type
registration_year has min value 1000 which was long before cars were invented and max value 9000 which is many years in the future
registration_month has min value 0 which is invalid as months range in 1 to 12
autos = autos.drop(['seller','offer_type','num_pictures'], axis = 1)
# checking the modified dataset
autos.shape
autos.head
<bound method DataFrame.head of date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS 8 2016-03-22 16:51:34 Seat_Arosa 9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau 11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden 12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 2016-03-23 10:48:32 Audi_A3_1.6_tuning 14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi 17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring 19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... 21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio 22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio 23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima 24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. 25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung 26 2016-04-03 22:46:28 Volkswagen_Polo_Fox 27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE 28 2016-03-19 21:56:19 MINI_Cooper_D 29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... ... ... ... 49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... 49971 2016-03-29 14:54:12 W.Lupo_1.0 49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. 49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor 49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... 49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort 49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... 49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic 49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style 49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V 49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung 49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm 49983 2016-03-06 12:43:04 Ford_focus_99 49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... 49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! 49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... 49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... 49988 2016-03-28 19:49:51 BMW_330_Ci 49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban 49991 2016-03-06 15:25:19 Kleinwagen 49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport 49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug 49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V price abtest vehicle_type registration_year gearbox power_ps \ 0 $5,000 control bus 2004 manuell 158 1 $8,500 control limousine 1997 automatik 286 2 $8,990 test limousine 2009 manuell 102 3 $4,350 control kleinwagen 2007 automatik 71 4 $1,350 test kombi 2003 manuell 0 5 $7,900 test bus 2006 automatik 150 6 $300 test limousine 1995 manuell 90 7 $1,990 control limousine 1998 manuell 90 8 $250 test NaN 2000 manuell 0 9 $590 control bus 1997 manuell 90 10 $999 test NaN 2017 manuell 90 11 $350 control NaN 2000 NaN 0 12 $5,299 control kleinwagen 2010 automatik 71 13 $1,350 control limousine 1999 manuell 101 14 $3,999 test kleinwagen 2007 manuell 75 15 $18,900 test coupe 1982 automatik 203 16 $350 test kombi 1999 manuell 101 17 $5,500 test coupe 1990 manuell 205 18 $300 control bus 1995 manuell 150 19 $4,150 control suv 2004 manuell 124 20 $3,500 test kombi 2003 manuell 131 21 $41,500 test cabrio 2004 manuell 320 22 $25,450 control cabrio 2015 manuell 184 23 $7,999 control bus 2010 manuell 120 24 $48,500 control limousine 2014 automatik 306 25 $90 control kombi 1996 manuell 116 26 $777 control kleinwagen 1992 manuell 54 27 $0 control NaN 2005 NaN 0 28 $5,250 control kleinwagen 2007 manuell 110 29 $4,999 test kombi 2004 automatik 204 ... ... ... ... ... ... ... 49970 $15,800 control bus 2010 automatik 136 49971 $950 test kleinwagen 2001 manuell 50 49972 $3,300 control bus 2004 automatik 150 49973 $6,000 control cabrio 2004 manuell 163 49974 $0 control cabrio 1983 manuell 70 49975 $9,700 control kleinwagen 2012 automatik 88 49976 $5,900 test kombi 1992 automatik 150 49977 $5,500 control limousine 2003 manuell 116 49978 $900 control limousine 1996 automatik 136 49979 $11,000 test kleinwagen 2011 manuell 90 49980 $400 control kombi 1995 manuell 105 49981 $2,000 control kombi 1998 manuell 115 49982 $1,950 control kleinwagen 2004 manuell 0 49983 $600 test kleinwagen 1999 manuell 101 49984 $0 test NaN 2000 NaN 0 49985 $1,000 control NaN 1995 automatik 0 49986 $15,900 control limousine 2010 automatik 218 49987 $21,990 control limousine 2013 manuell 150 49988 $9,550 control coupe 2001 manuell 231 49989 $150 test kleinwagen 1997 manuell 0 49990 $17,500 test limousine 2012 manuell 156 49991 $500 control NaN 2016 manuell 0 49992 $4,800 control kleinwagen 2009 manuell 120 49993 $1,650 control kleinwagen 1997 manuell 0 49994 $5,000 control kombi 2001 automatik 299 49995 $24,900 control limousine 2011 automatik 239 49996 $1,980 control cabrio 1996 manuell 75 49997 $13,200 test cabrio 2014 automatik 69 49998 $22,900 control kombi 2013 manuell 150 49999 $1,250 control limousine 1996 manuell 101 model odometer registration_month fuel_type brand \ 0 andere 150,000km 3 lpg peugeot 1 7er 150,000km 6 benzin bmw 2 golf 70,000km 7 benzin volkswagen 3 fortwo 70,000km 6 benzin smart 4 focus 150,000km 7 benzin ford 5 voyager 150,000km 4 diesel chrysler 6 golf 150,000km 8 benzin volkswagen 7 golf 150,000km 12 diesel volkswagen 8 arosa 150,000km 10 NaN seat 9 megane 150,000km 7 benzin renault 10 NaN 150,000km 4 benzin volkswagen 11 NaN 150,000km 0 benzin mercedes_benz 12 fortwo 50,000km 9 benzin smart 13 a3 150,000km 11 benzin audi 14 clio 150,000km 9 benzin renault 15 NaN 80,000km 6 benzin sonstige_autos 16 vectra 150,000km 5 benzin opel 17 scirocco 150,000km 6 benzin volkswagen 18 3er 150,000km 0 benzin bmw 19 andere 150,000km 2 lpg mazda 20 a4 150,000km 5 diesel audi 21 911 150,000km 4 benzin porsche 22 cooper 10,000km 1 benzin mini 23 NaN 150,000km 2 diesel peugeot 24 5er 30,000km 12 benzin bmw 25 NaN 150,000km 4 benzin ford 26 polo 125,000km 2 benzin volkswagen 27 NaN 150,000km 0 NaN ford 28 cooper 150,000km 7 diesel mini 29 e_klasse 150,000km 10 diesel mercedes_benz ... ... ... ... ... ... 49970 c4 60,000km 4 diesel citroen 49971 lupo 150,000km 4 benzin volkswagen 49972 vito 150,000km 10 diesel mercedes_benz 49973 slk 150,000km 11 benzin mercedes_benz 49974 golf 150,000km 2 benzin volkswagen 49975 jazz 100,000km 11 hybrid honda 49976 80 150,000km 12 benzin audi 49977 c_klasse 150,000km 2 diesel mercedes_benz 49978 e_klasse 150,000km 9 benzin mercedes_benz 49979 polo 70,000km 11 diesel volkswagen 49980 escort 125,000km 3 benzin ford 49981 astra 150,000km 12 benzin opel 49982 fabia 90,000km 7 benzin skoda 49983 focus 150,000km 4 benzin ford 49984 NaN 150,000km 0 NaN sonstige_autos 49985 NaN 150,000km 0 benzin volkswagen 49986 300c 125,000km 11 diesel chrysler 49987 a3 50,000km 11 diesel audi 49988 3er 150,000km 10 benzin bmw 49989 polo 150,000km 5 benzin volkswagen 49990 a_klasse 30,000km 12 benzin mercedes_benz 49991 twingo 150,000km 0 benzin renault 49992 andere 125,000km 9 lpg fiat 49993 NaN 150,000km 7 benzin audi 49994 a6 150,000km 1 benzin audi 49995 q5 100,000km 1 diesel audi 49996 astra 150,000km 5 benzin opel 49997 500 5,000km 11 benzin fiat 49998 a3 40,000km 11 diesel audi 49999 vectra 150,000km 1 benzin opel unrepaired_damage ad_created postal_code last_seen 0 nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54 1 nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08 2 nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37 3 nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28 4 nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50 5 NaN 2016-03-21 00:00:00 22962 2016-04-06 09:45:21 6 NaN 2016-03-20 00:00:00 31535 2016-03-23 02:48:59 7 nein 2016-03-16 00:00:00 53474 2016-04-07 03:17:32 8 nein 2016-03-22 00:00:00 7426 2016-03-26 18:18:10 9 nein 2016-03-16 00:00:00 15749 2016-04-06 10:46:35 10 nein 2016-03-14 00:00:00 86157 2016-04-07 03:16:21 11 NaN 2016-03-16 00:00:00 17498 2016-03-16 18:45:34 12 nein 2016-03-31 00:00:00 34590 2016-04-06 14:17:52 13 nein 2016-03-23 00:00:00 12043 2016-04-01 14:17:13 14 NaN 2016-03-23 00:00:00 81737 2016-04-01 15:46:47 15 nein 2016-04-01 00:00:00 61276 2016-04-02 21:10:48 16 nein 2016-03-16 00:00:00 57299 2016-03-18 05:29:37 17 nein 2016-03-29 00:00:00 74821 2016-04-05 20:46:26 18 NaN 2016-03-26 00:00:00 54329 2016-04-02 12:16:41 19 nein 2016-03-17 00:00:00 40878 2016-03-17 14:45:58 20 NaN 2016-03-05 00:00:00 53913 2016-03-07 05:46:46 21 nein 2016-03-06 00:00:00 65428 2016-04-05 23:46:19 22 nein 2016-03-28 00:00:00 44789 2016-04-01 06:45:30 23 nein 2016-03-10 00:00:00 30900 2016-03-17 08:45:17 24 nein 2016-04-03 00:00:00 22547 2016-04-07 13:16:50 25 ja 2016-03-21 00:00:00 27574 2016-04-01 05:16:49 26 nein 2016-04-03 00:00:00 38110 2016-04-05 23:46:48 27 NaN 2016-03-27 00:00:00 66701 2016-03-27 18:45:01 28 ja 2016-03-19 00:00:00 15745 2016-04-07 14:58:48 29 nein 2016-04-02 00:00:00 47638 2016-04-02 12:45:44 ... ... ... ... ... 49970 nein 2016-03-21 00:00:00 14947 2016-04-07 04:17:34 49971 nein 2016-03-29 00:00:00 65197 2016-03-29 20:41:51 49972 ja 2016-03-26 00:00:00 65326 2016-03-28 11:28:18 49973 nein 2016-03-27 00:00:00 53567 2016-03-27 08:25:24 49974 nein 2016-03-20 00:00:00 8209 2016-03-27 19:48:16 49975 nein 2016-03-27 00:00:00 84385 2016-04-05 19:45:34 49976 nein 2016-03-19 00:00:00 36100 2016-04-07 06:16:44 49977 nein 2016-03-31 00:00:00 33739 2016-04-06 12:16:11 49978 ja 2016-04-04 00:00:00 24405 2016-04-06 12:44:20 49979 nein 2016-03-20 00:00:00 48455 2016-04-07 01:45:12 49980 NaN 2016-03-12 00:00:00 56218 2016-04-06 17:16:49 49981 nein 2016-03-15 00:00:00 86859 2016-04-05 17:21:46 49982 NaN 2016-03-29 00:00:00 45884 2016-03-29 18:51:08 49983 NaN 2016-03-06 00:00:00 52477 2016-03-09 06:16:08 49984 NaN 2016-03-31 00:00:00 12103 2016-04-02 19:44:53 49985 NaN 2016-04-02 00:00:00 30900 2016-04-06 15:17:52 49986 nein 2016-04-04 00:00:00 73527 2016-04-06 23:16:00 49987 nein 2016-03-22 00:00:00 94362 2016-03-26 22:46:06 49988 nein 2016-03-28 00:00:00 83646 2016-04-07 02:17:40 49989 ja 2016-03-11 00:00:00 21244 2016-03-12 10:17:55 49990 nein 2016-03-21 00:00:00 58239 2016-04-06 22:46:57 49991 NaN 2016-03-06 00:00:00 61350 2016-03-06 18:24:19 49992 nein 2016-03-10 00:00:00 68642 2016-03-13 01:44:51 49993 NaN 2016-03-15 00:00:00 65203 2016-04-06 19:46:53 49994 nein 2016-03-22 00:00:00 46537 2016-04-06 08:16:39 49995 nein 2016-03-27 00:00:00 82131 2016-04-01 13:47:40 49996 nein 2016-03-28 00:00:00 44807 2016-04-02 14:18:02 49997 nein 2016-04-02 00:00:00 73430 2016-04-04 11:47:27 49998 nein 2016-03-08 00:00:00 35683 2016-04-05 16:45:07 49999 nein 2016-03-13 00:00:00 45897 2016-04-06 21:18:48 [50000 rows x 17 columns]>
As it was said before, we have an issue with the price and odometer columns which are classified as object but should contain numeric values instead. Therefore, we are now going to remove any non-numeric characters and convert the strings in these two columns to integers.
autos['price'].unique()
array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'], dtype=object)
# For price column- Removing non-numeric character ("$" and ",")
autos['price'] = (autos['price']
.str.replace('$','')
.str.replace(',','')
.astype(int))
# checking the cleaned data
autos['price'].dtype
dtype('int64')
autos['odometer'].unique()
array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km', '30,000km', '125,000km', '90,000km', '20,000km', '60,000km', '5,000km', '100,000km', '40,000km'], dtype=object)
# For odometer column- Removing non-numeric character ("," and "km")
autos['odometer'] = (autos['odometer']
.str.replace('km','')
.str.replace(',','')
.astype(int))
# checking the cleaned data
autos['odometer'].dtype
dtype('int64')
autos.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis = 1, inplace = True)
autos.head()
date_crawled | name | price_dollar | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
In this section, we are going to look for outliers in price_dollar and odometer_km columns aiming to drop them to make our analysis easier. To do so, we are going to use to use Series.unique(), Series.describe() and Series.value_counts() methods, for each column.
autos['price_dollar'].unique().shape
(2357,)
According to the code line above, our dataset has 2357 unique values.
autos['price_dollar'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollar, dtype: float64
We can see in the code line above that the minimum value of the cars in our dataset is 0 and the maximum is 100,000,000 dollars which is technically impossible. After a quick research on eBay website, we found out that there are no significative amount of cars that costs more than 1 million dollars.
autos['price_dollar'].value_counts().sort_index(ascending=False).head(50)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 145000 1 139997 1 137999 1 135000 1 130000 1 129000 1 128000 1 120000 2 119900 1 119500 1 116000 1 115991 1 115000 1 114400 1 109999 1 105000 2 104900 1 99900 2 99000 2 98500 1 Name: price_dollar, dtype: int64
From the result above, 99999999 is the highest price followed by 27322222
autos['price_dollar'].value_counts().sort_index().head(50)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 70 10 75 5 79 1 80 15 89 1 90 5 99 19 100 134 110 3 111 2 115 2 117 1 120 39 122 1 125 8 129 1 130 15 135 1 139 1 140 9 Name: price_dollar, dtype: int64
1421 entries are 0 - given that this is only 2% of the total cars, we might consider removing these. There are a number of listings with prices less than 50, and a few listings with prices more than 1 million. Given that eBay is an auction site, there could be items where the bid opens at 1. We will however remove items listed above 27322222 since the next price is way higher than that.
autos = autos[autos['price_dollar'].between(1,27322222)] # code to remove the outliers
autos['price_dollar'].describe()
count 4.857800e+04 mean 8.069542e+03 std 1.800356e+05 min 1.000000e+00 25% 1.200000e+03 50% 3.000000e+03 75% 7.490000e+03 max 2.732222e+07 Name: price_dollar, dtype: float64
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 48578.000000 mean 125765.675820 std 39793.193823 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().sort_index()
5000 837 10000 253 20000 762 30000 780 40000 816 50000 1014 60000 1155 70000 1217 80000 1415 90000 1734 100000 2116 125000 5058 150000 31421 Name: odometer_km, dtype: int64
There are more high mileage cars for sale.
Step 8: Exploring the date columns (date_crawled, last_seen, ad_created, registration_month and registration_year)
autos[['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']].head()
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 | 3 | 2004 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 | 6 | 1997 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 | 7 | 2009 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 | 6 | 2007 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 7 | 2003 |
It is possible to see that registration_month and registration_year columns are represented as numerical data, whereas data_crawled, last_seen and ad_created are represented as strings. Because of that, we need to convert these three columns into numerical data. The dates are stored as YYYY-MM-DD, the first 10 characters represent the date. so we select the first 10 characters in each column.
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025320 2016-03-06 0.014039 2016-03-07 0.036004 2016-03-08 0.033328 2016-03-09 0.033101 2016-03-10 0.032175 2016-03-11 0.032566 2016-03-12 0.036930 2016-03-13 0.015666 2016-03-14 0.036539 2016-03-15 0.034275 2016-03-16 0.029602 2016-03-17 0.031640 2016-03-18 0.012907 2016-03-19 0.034769 2016-03-20 0.037877 2016-03-21 0.037404 2016-03-22 0.032998 2016-03-23 0.032216 2016-03-24 0.029334 2016-03-25 0.031599 2016-03-26 0.032196 2016-03-27 0.031084 2016-03-28 0.034851 2016-03-29 0.034131 2016-03-30 0.033678 2016-03-31 0.031846 2016-04-01 0.033678 2016-04-02 0.035469 2016-04-03 0.038598 2016-04-04 0.036519 2016-04-05 0.013092 2016-04-06 0.003170 2016-04-07 0.001400 Name: date_crawled, dtype: float64
All the entries are from March 5 to April 7, 2016. The site was probably crawled around that time. The percentage number of listings on each day is roughly uniform.
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033163 2016-03-10 0.031887 2016-03-11 0.032896 2016-03-12 0.036766 2016-03-13 0.017004 2016-03-14 0.035181 2016-03-15 0.034007 2016-03-16 0.030117 2016-03-17 0.031290 2016-03-18 0.013586 2016-03-19 0.033678 2016-03-20 0.037939 2016-03-21 0.037610 2016-03-22 0.032813 2016-03-23 0.032052 2016-03-24 0.029273 2016-03-25 0.031743 2016-03-26 0.032257 2016-03-27 0.030981 2016-03-28 0.034975 2016-03-29 0.034069 2016-03-30 0.033493 2016-03-31 0.031887 2016-04-01 0.033678 2016-04-02 0.035139 2016-04-03 0.038845 2016-04-04 0.036889 2016-04-05 0.011816 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
The range of dates spread over 10 months. However, most fall within March and April 2016.
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.001070 2016-03-06 0.004323 2016-03-07 0.005393 2016-03-08 0.007431 2016-03-09 0.009613 2016-03-10 0.010663 2016-03-11 0.012372 2016-03-12 0.023797 2016-03-13 0.008893 2016-03-14 0.012598 2016-03-15 0.015871 2016-03-16 0.016448 2016-03-17 0.028079 2016-03-18 0.007349 2016-03-19 0.015830 2016-03-20 0.020647 2016-03-21 0.020647 2016-03-22 0.021368 2016-03-23 0.018527 2016-03-24 0.019762 2016-03-25 0.019206 2016-03-26 0.016798 2016-03-27 0.015645 2016-03-28 0.020874 2016-03-29 0.022356 2016-03-30 0.024764 2016-03-31 0.023797 2016-04-01 0.022788 2016-04-02 0.024929 2016-04-03 0.025197 2016-04-04 0.024476 2016-04-05 0.124768 2016-04-06 0.221808 2016-04-07 0.131912 Name: last_seen, dtype: float64
The last seen values records the date the last time that listing was seen and probably the day a car was sold, assuming that is why the listing was removed.
Note that the last three days show a disproportionate amount of spike, as much as ten times the other days.
autos['registration_year'].describe()
count 48578.000000 mean 2004.753119 std 88.632571 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The minimum registration year is 1000, before cars were invented and the maximum is 9999, many years into the future.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
(~autos['registration_year'].between(1900,2016)).sum()
1886
After seeing the data, it is safe to assume that valid listings have registration years between 1910 and 2016. We remove all other entries.
autos = autos[autos['registration_year'].between(1910,2016)]
autos['registration_year'].value_counts(normalize=True)
2000 0.067592 2005 0.062880 1999 0.062066 2004 0.057890 2003 0.057804 2006 0.057205 2001 0.056476 2002 0.053243 1998 0.050608 2007 0.048766 2008 0.047439 2009 0.044676 1997 0.041784 2011 0.034760 2010 0.034032 1996 0.029405 2012 0.028056 1995 0.026300 2016 0.026129 2013 0.017198 2014 0.014221 1994 0.013471 1993 0.009102 2015 0.008395 1992 0.007946 1990 0.007432 1991 0.007260 1989 0.003727 1988 0.002891 1985 0.002056 ... 1966 0.000471 1976 0.000450 1969 0.000407 1975 0.000386 1965 0.000364 1964 0.000257 1963 0.000171 1959 0.000129 1961 0.000129 1910 0.000107 1956 0.000086 1958 0.000086 1937 0.000086 1962 0.000086 1950 0.000064 1954 0.000043 1941 0.000043 1951 0.000043 1934 0.000043 1957 0.000043 1955 0.000043 1953 0.000021 1943 0.000021 1929 0.000021 1939 0.000021 1938 0.000021 1948 0.000021 1927 0.000021 1931 0.000021 1952 0.000021 Name: registration_year, Length: 78, dtype: float64
From the result above, most of the cars were registered between 1998 and 2005.
brand_percentages =autos['brand'].value_counts(normalize = True)
print (brand_percentages)
volkswagen 0.211278 bmw 0.110040 opel 0.107577 mercedes_benz 0.096441 audi 0.086546 ford 0.069926 renault 0.047139 peugeot 0.029834 fiat 0.025636 seat 0.018269 skoda 0.016405 nissan 0.015270 mazda 0.015185 smart 0.014157 citroen 0.014028 toyota 0.012700 hyundai 0.010023 sonstige_autos 0.009873 volvo 0.009145 mini 0.008760 mitsubishi 0.008224 honda 0.007839 kia 0.007068 alfa_romeo 0.006639 porsche 0.006125 suzuki 0.005932 chevrolet 0.005697 chrysler 0.003512 dacia 0.002634 daihatsu 0.002506 jeep 0.002270 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001563 daewoo 0.001499 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
Volkswagen is the most popular brand. Now we are going to loop over the brand column to add the brands that respond for more than 5% of the total of cars in the dataset into a dictionary. The main purpose of this method will be to calculate the mean price of each of the top brands.
popular = brand_percentages[brand_percentages > 0.05].index
brand_mean_prices = {}
for brand in popular:
rows = autos[autos['brand']==brand]
mean_price = rows['price_dollar'].mean()
brand_mean_prices[brand] = int(mean_price)
brand_mean_prices
{'audi': 9336, 'bmw': 8571, 'ford': 7456, 'mercedes_benz': 8628, 'opel': 5432, 'volkswagen': 6729}
From the brands that have over 5% of the toal values, the three most expensive cars are Audis, Mercedez Benz and BMW respectively. Ford and Volkswagen are in between and Opel is the least expensive.
In order to compare the mean price to the average milage of each brand, we are going to create another dictionary for Mean Milage just like we did for the mean price. Next, it will be necessary to transform both dictonaries into data series and then into a dataframe, using the series and dataframe constructors.
First, we are going to transform our previous dictionary(brand_mean_price) into a series.
brand_mean_mileage = {}
for brand in popular:
rows = autos[autos['brand']==brand]
mean_mileage = rows['odometer_km'].mean()
brand_mean_mileage[brand] = int(mean_mileage)
brand_mean_mileage
{'audi': 129157, 'bmw': 132575, 'ford': 124243, 'mercedes_benz': 130788, 'opel': 129314, 'volkswagen': 128713}
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)
print(mean_mileage)
print(mean_prices)
bmw 132575 mercedes_benz 130788 opel 129314 audi 129157 volkswagen 128713 ford 124243 dtype: int64 audi 9336 mercedes_benz 8628 bmw 8571 ford 7456 volkswagen 6729 opel 5432 dtype: int64
In this final step, we are going to create a DataFrame containing both the brand mean price and the brand mean milage series. The main goal of this step is to make the comparison between both series easier and to identify if there is any link with each other.
brand_info = pd.DataFrame(mean_mileage,columns=["mean_mileage"])
brand_info["mean_price"] = mean_prices
brand_info
mean_mileage | mean_price | |
---|---|---|
bmw | 132575 | 8571 |
mercedes_benz | 130788 | 8628 |
opel | 129314 | 5432 |
audi | 129157 | 9336 |
volkswagen | 128713 | 6729 |
ford | 124243 | 7456 |
Apparently, there is not a direct relation between the price and the milage of the cars when we consider just the top 6 brands, since the mean milage doesn't varies significatively with the price.