Exploring Ebay Car Sales Data
In this project we'll work with a dataset of used cars from ebay Kleinanzeigen, a classifieds section of the German eBay website.
The aim of this project is to clean the data and analyze the included used car listings.
# Importing the pandas and numpy libraries
import pandas as pd
import numpy as np
# Reading the csv files into pandas
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
print(autos)
dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS 8 2016-03-22 16:51:34 Seat_Arosa 9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau 11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden 12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 2016-03-23 10:48:32 Audi_A3_1.6_tuning 14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi 17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring 19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... 21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio 22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio 23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima 24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. 25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung 26 2016-04-03 22:46:28 Volkswagen_Polo_Fox 27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE 28 2016-03-19 21:56:19 MINI_Cooper_D 29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... ... ... ... 49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... 49971 2016-03-29 14:54:12 W.Lupo_1.0 49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. 49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor 49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... 49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort 49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... 49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic 49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style 49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V 49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung 49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm 49983 2016-03-06 12:43:04 Ford_focus_99 49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... 49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! 49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... 49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... 49988 2016-03-28 19:49:51 BMW_330_Ci 49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban 49991 2016-03-06 15:25:19 Kleinwagen 49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport 49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug 49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 5 privat Angebot $7,900 test bus 2006 6 privat Angebot $300 test limousine 1995 7 privat Angebot $1,990 control limousine 1998 8 privat Angebot $250 test NaN 2000 9 privat Angebot $590 control bus 1997 10 privat Angebot $999 test NaN 2017 11 privat Angebot $350 control NaN 2000 12 privat Angebot $5,299 control kleinwagen 2010 13 privat Angebot $1,350 control limousine 1999 14 privat Angebot $3,999 test kleinwagen 2007 15 privat Angebot $18,900 test coupe 1982 16 privat Angebot $350 test kombi 1999 17 privat Angebot $5,500 test coupe 1990 18 privat Angebot $300 control bus 1995 19 privat Angebot $4,150 control suv 2004 20 privat Angebot $3,500 test kombi 2003 21 privat Angebot $41,500 test cabrio 2004 22 privat Angebot $25,450 control cabrio 2015 23 privat Angebot $7,999 control bus 2010 24 privat Angebot $48,500 control limousine 2014 25 privat Angebot $90 control kombi 1996 26 privat Angebot $777 control kleinwagen 1992 27 privat Angebot $0 control NaN 2005 28 privat Angebot $5,250 control kleinwagen 2007 29 privat Angebot $4,999 test kombi 2004 ... ... ... ... ... ... ... 49970 privat Angebot $15,800 control bus 2010 49971 privat Angebot $950 test kleinwagen 2001 49972 privat Angebot $3,300 control bus 2004 49973 privat Angebot $6,000 control cabrio 2004 49974 privat Angebot $0 control cabrio 1983 49975 privat Angebot $9,700 control kleinwagen 2012 49976 privat Angebot $5,900 test kombi 1992 49977 privat Angebot $5,500 control limousine 2003 49978 privat Angebot $900 control limousine 1996 49979 privat Angebot $11,000 test kleinwagen 2011 49980 privat Angebot $400 control kombi 1995 49981 privat Angebot $2,000 control kombi 1998 49982 privat Angebot $1,950 control kleinwagen 2004 49983 privat Angebot $600 test kleinwagen 1999 49984 privat Angebot $0 test NaN 2000 49985 privat Angebot $1,000 control NaN 1995 49986 privat Angebot $15,900 control limousine 2010 49987 privat Angebot $21,990 control limousine 2013 49988 privat Angebot $9,550 control coupe 2001 49989 privat Angebot $150 test kleinwagen 1997 49990 privat Angebot $17,500 test limousine 2012 49991 privat Angebot $500 control NaN 2016 49992 privat Angebot $4,800 control kleinwagen 2009 49993 privat Angebot $1,650 control kleinwagen 1997 49994 privat Angebot $5,000 control kombi 2001 49995 privat Angebot $24,900 control limousine 2011 49996 privat Angebot $1,980 control cabrio 1996 49997 privat Angebot $13,200 test cabrio 2014 49998 privat Angebot $22,900 control kombi 2013 49999 privat Angebot $1,250 control limousine 1996 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin 5 automatik 150 voyager 150,000km 4 diesel 6 manuell 90 golf 150,000km 8 benzin 7 manuell 90 golf 150,000km 12 diesel 8 manuell 0 arosa 150,000km 10 NaN 9 manuell 90 megane 150,000km 7 benzin 10 manuell 90 NaN 150,000km 4 benzin 11 NaN 0 NaN 150,000km 0 benzin 12 automatik 71 fortwo 50,000km 9 benzin 13 manuell 101 a3 150,000km 11 benzin 14 manuell 75 clio 150,000km 9 benzin 15 automatik 203 NaN 80,000km 6 benzin 16 manuell 101 vectra 150,000km 5 benzin 17 manuell 205 scirocco 150,000km 6 benzin 18 manuell 150 3er 150,000km 0 benzin 19 manuell 124 andere 150,000km 2 lpg 20 manuell 131 a4 150,000km 5 diesel 21 manuell 320 911 150,000km 4 benzin 22 manuell 184 cooper 10,000km 1 benzin 23 manuell 120 NaN 150,000km 2 diesel 24 automatik 306 5er 30,000km 12 benzin 25 manuell 116 NaN 150,000km 4 benzin 26 manuell 54 polo 125,000km 2 benzin 27 NaN 0 NaN 150,000km 0 NaN 28 manuell 110 cooper 150,000km 7 diesel 29 automatik 204 e_klasse 150,000km 10 diesel ... ... ... ... ... ... ... 49970 automatik 136 c4 60,000km 4 diesel 49971 manuell 50 lupo 150,000km 4 benzin 49972 automatik 150 vito 150,000km 10 diesel 49973 manuell 163 slk 150,000km 11 benzin 49974 manuell 70 golf 150,000km 2 benzin 49975 automatik 88 jazz 100,000km 11 hybrid 49976 automatik 150 80 150,000km 12 benzin 49977 manuell 116 c_klasse 150,000km 2 diesel 49978 automatik 136 e_klasse 150,000km 9 benzin 49979 manuell 90 polo 70,000km 11 diesel 49980 manuell 105 escort 125,000km 3 benzin 49981 manuell 115 astra 150,000km 12 benzin 49982 manuell 0 fabia 90,000km 7 benzin 49983 manuell 101 focus 150,000km 4 benzin 49984 NaN 0 NaN 150,000km 0 NaN 49985 automatik 0 NaN 150,000km 0 benzin 49986 automatik 218 300c 125,000km 11 diesel 49987 manuell 150 a3 50,000km 11 diesel 49988 manuell 231 3er 150,000km 10 benzin 49989 manuell 0 polo 150,000km 5 benzin 49990 manuell 156 a_klasse 30,000km 12 benzin 49991 manuell 0 twingo 150,000km 0 benzin 49992 manuell 120 andere 125,000km 9 lpg 49993 manuell 0 NaN 150,000km 7 benzin 49994 automatik 299 a6 150,000km 1 benzin 49995 automatik 239 q5 100,000km 1 diesel 49996 manuell 75 astra 150,000km 5 benzin 49997 automatik 69 500 5,000km 11 benzin 49998 manuell 150 a3 40,000km 11 diesel 49999 manuell 101 vectra 150,000km 1 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 5 chrysler NaN 2016-03-21 00:00:00 0 6 volkswagen NaN 2016-03-20 00:00:00 0 7 volkswagen nein 2016-03-16 00:00:00 0 8 seat nein 2016-03-22 00:00:00 0 9 renault nein 2016-03-16 00:00:00 0 10 volkswagen nein 2016-03-14 00:00:00 0 11 mercedes_benz NaN 2016-03-16 00:00:00 0 12 smart nein 2016-03-31 00:00:00 0 13 audi nein 2016-03-23 00:00:00 0 14 renault NaN 2016-03-23 00:00:00 0 15 sonstige_autos nein 2016-04-01 00:00:00 0 16 opel nein 2016-03-16 00:00:00 0 17 volkswagen nein 2016-03-29 00:00:00 0 18 bmw NaN 2016-03-26 00:00:00 0 19 mazda nein 2016-03-17 00:00:00 0 20 audi NaN 2016-03-05 00:00:00 0 21 porsche nein 2016-03-06 00:00:00 0 22 mini nein 2016-03-28 00:00:00 0 23 peugeot nein 2016-03-10 00:00:00 0 24 bmw nein 2016-04-03 00:00:00 0 25 ford ja 2016-03-21 00:00:00 0 26 volkswagen nein 2016-04-03 00:00:00 0 27 ford NaN 2016-03-27 00:00:00 0 28 mini ja 2016-03-19 00:00:00 0 29 mercedes_benz nein 2016-04-02 00:00:00 0 ... ... ... ... ... 49970 citroen nein 2016-03-21 00:00:00 0 49971 volkswagen nein 2016-03-29 00:00:00 0 49972 mercedes_benz ja 2016-03-26 00:00:00 0 49973 mercedes_benz nein 2016-03-27 00:00:00 0 49974 volkswagen nein 2016-03-20 00:00:00 0 49975 honda nein 2016-03-27 00:00:00 0 49976 audi nein 2016-03-19 00:00:00 0 49977 mercedes_benz nein 2016-03-31 00:00:00 0 49978 mercedes_benz ja 2016-04-04 00:00:00 0 49979 volkswagen nein 2016-03-20 00:00:00 0 49980 ford NaN 2016-03-12 00:00:00 0 49981 opel nein 2016-03-15 00:00:00 0 49982 skoda NaN 2016-03-29 00:00:00 0 49983 ford NaN 2016-03-06 00:00:00 0 49984 sonstige_autos NaN 2016-03-31 00:00:00 0 49985 volkswagen NaN 2016-04-02 00:00:00 0 49986 chrysler nein 2016-04-04 00:00:00 0 49987 audi nein 2016-03-22 00:00:00 0 49988 bmw nein 2016-03-28 00:00:00 0 49989 volkswagen ja 2016-03-11 00:00:00 0 49990 mercedes_benz nein 2016-03-21 00:00:00 0 49991 renault NaN 2016-03-06 00:00:00 0 49992 fiat nein 2016-03-10 00:00:00 0 49993 audi NaN 2016-03-15 00:00:00 0 49994 audi nein 2016-03-22 00:00:00 0 49995 audi nein 2016-03-27 00:00:00 0 49996 opel nein 2016-03-28 00:00:00 0 49997 fiat nein 2016-04-02 00:00:00 0 49998 audi nein 2016-03-08 00:00:00 0 49999 opel nein 2016-03-13 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50 5 22962 2016-04-06 09:45:21 6 31535 2016-03-23 02:48:59 7 53474 2016-04-07 03:17:32 8 7426 2016-03-26 18:18:10 9 15749 2016-04-06 10:46:35 10 86157 2016-04-07 03:16:21 11 17498 2016-03-16 18:45:34 12 34590 2016-04-06 14:17:52 13 12043 2016-04-01 14:17:13 14 81737 2016-04-01 15:46:47 15 61276 2016-04-02 21:10:48 16 57299 2016-03-18 05:29:37 17 74821 2016-04-05 20:46:26 18 54329 2016-04-02 12:16:41 19 40878 2016-03-17 14:45:58 20 53913 2016-03-07 05:46:46 21 65428 2016-04-05 23:46:19 22 44789 2016-04-01 06:45:30 23 30900 2016-03-17 08:45:17 24 22547 2016-04-07 13:16:50 25 27574 2016-04-01 05:16:49 26 38110 2016-04-05 23:46:48 27 66701 2016-03-27 18:45:01 28 15745 2016-04-07 14:58:48 29 47638 2016-04-02 12:45:44 ... ... ... 49970 14947 2016-04-07 04:17:34 49971 65197 2016-03-29 20:41:51 49972 65326 2016-03-28 11:28:18 49973 53567 2016-03-27 08:25:24 49974 8209 2016-03-27 19:48:16 49975 84385 2016-04-05 19:45:34 49976 36100 2016-04-07 06:16:44 49977 33739 2016-04-06 12:16:11 49978 24405 2016-04-06 12:44:20 49979 48455 2016-04-07 01:45:12 49980 56218 2016-04-06 17:16:49 49981 86859 2016-04-05 17:21:46 49982 45884 2016-03-29 18:51:08 49983 52477 2016-03-09 06:16:08 49984 12103 2016-04-02 19:44:53 49985 30900 2016-04-06 15:17:52 49986 73527 2016-04-06 23:16:00 49987 94362 2016-03-26 22:46:06 49988 83646 2016-04-07 02:17:40 49989 21244 2016-03-12 10:17:55 49990 58239 2016-04-06 22:46:57 49991 61350 2016-03-06 18:24:19 49992 68642 2016-03-13 01:44:51 49993 65203 2016-04-06 19:46:53 49994 46537 2016-04-06 08:16:39 49995 82131 2016-04-01 13:47:40 49996 44807 2016-04-02 14:18:02 49997 73430 2016-04-04 11:47:27 49998 35683 2016-04-05 16:45:07 49999 45897 2016-04-06 21:18:48 [50000 rows x 20 columns]
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
# Printing an array of the existing column names
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Copying the array and editing the columns names
autos_test = autos.copy()
autos_test.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
autos.columns = autos_test.columns
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Some of the columns were edited because of mispelled words and phases.
** Next, we will do some basic data exploration to determine what other cleaning tasks need to be done**
# Looking at descriptive statistics for all columns
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
The seller and offer_type columns looks identical. The num_photos column need further investigation.
autos["num_photos"].value_counts()
0 50000 Name: num_photos, dtype: int64
The num_photos column has all of its entry as 0, therefore we need to drop this column along with the other two identical columns we found(seller and offer_type columns).
# Removing any non-numeric characters from price and
# odometer columns and converting to a numeric dtype
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos["price"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
Analyzing the odometer_km and price columns using minimum and maximum values to look for unrealistically high or low outliers to remove
# Exploring the data to see how many unique values
# Exploring the data to view min/max/median/mean etc
autos["price"].unique().shape
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos["odometer_km"].unique().shape
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["price"].value_counts().head(30)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 1100 376 1300 371 3000 365 550 356 1800 355 5500 340 1250 335 350 335 1600 327 1999 322 Name: price, dtype: int64
autos["odometer_km"].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos["price"].value_counts().sort_index(ascending = True).head(30)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 Name: price, dtype: int64
autos["price"].value_counts().sort_index(ascending = False).head(30)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 Name: price, dtype: int64
autos["odometer_km"].value_counts().sort_index(ascending = False).head(30)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
autos["odometer_km"].value_counts().sort_index(ascending = True).head(30)
5000 967 10000 264 20000 784 30000 789 40000 819 50000 1027 60000 1164 70000 1230 80000 1436 90000 1757 100000 2169 125000 5170 150000 32424 Name: odometer_km, dtype: int64
There are lot of number of listings below $50 and about 1500 listings at exactly $0
# Removing Outliers
autos = autos[autos["price"].between(1,350000)]
autos["price"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
Understanding The Date Range The Data Covers And The Date Columns
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The distribution of listings crawled on each day is roughly uniform.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
After exploring the registration_year column, we observed that it contains some odd values,where the minimum value is 1000 before cars were invented and the maximum value is 9999 which is like forever into the future.
** Counting The Number of Listing With Cars That Fall Outside 1900-2006 interval**
# Deciding What the highest and lowest
# acceptable values are for the registration_year column
(~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]
0.038793369710697
Since this is around 3% and less than 4%, we are going to remove the rows involved.
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).head(10)
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 Name: registration_year, dtype: float64
We can observe that the first set of cars were registered two decades ago.
Aggregating The Brands Columns
# Exploring the values in the brand columns
top_brands = autos['brand'].value_counts(normalize = True)
significant_top_brands = top_brands[top_brands > 0.05].index
significant_top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
# Iterating over the significant_top_brands
brand_mean_price = {}
for brand in significant_top_brands:
selected_brand = autos[autos['brand'] == brand]
price_mean = selected_brand['price'].mean()
brand_mean_price[brand] = int(price_mean)
print(brand_mean_price)
{'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749, 'volkswagen': 5402, 'bmw': 8332, 'opel': 2975}
On average, the expensive cars are audi,mercedes_benz and bmw while opel and ford are the less expensive ones.
Aggregating The Average Mileage of The Brands
brand_mean_mileage = {}
for brand in significant_top_brands:
selected_brand = autos[autos['brand'] == brand]
mileage_mean = selected_brand['odometer_km'].mean()
brand_mean_mileage[brand] = int(mileage_mean)
print(brand_mean_mileage)
{'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266, 'volkswagen': 128707, 'bmw': 132572, 'opel': 129310}
# converting both dictionaries to series objects
# using the series constructor
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_price).sort_values(ascending=False)
brand_info = pd.DataFrame(mean_mileage,columns=['mean_mileage'])
brand_info
mean_mileage | |
---|---|
bmw | 132572 |
mercedes_benz | 130788 |
opel | 129310 |
audi | 129157 |
volkswagen | 128707 |
ford | 124266 |
brand_info = pd.DataFrame(mean_prices,columns=['mean_prices'])
brand_info
mean_prices | |
---|---|
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
volkswagen | 5402 |
ford | 3749 |
opel | 2975 |
# Assigning both series as a new column in this dtataframe
brand_info["mean_price"] = mean_prices
brand_info
mean_prices | mean_price | |
---|---|---|
audi | 9336 | 9336 |
mercedes_benz | 8628 | 8628 |
bmw | 8332 | 8332 |
volkswagen | 5402 | 5402 |
ford | 3749 | 3749 |
opel | 2975 | 2975 |