import pandas as pd
import numpy as np
autos=pd.read_csv('autos.csv',encoding='Latin-1')
print(autos)
dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS 8 2016-03-22 16:51:34 Seat_Arosa 9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau 11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden 12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 2016-03-23 10:48:32 Audi_A3_1.6_tuning 14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi 17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring 19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... 21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio 22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio 23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima 24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. 25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung 26 2016-04-03 22:46:28 Volkswagen_Polo_Fox 27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE 28 2016-03-19 21:56:19 MINI_Cooper_D 29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... ... ... ... 49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... 49971 2016-03-29 14:54:12 W.Lupo_1.0 49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. 49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor 49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... 49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort 49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... 49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic 49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style 49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V 49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung 49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm 49983 2016-03-06 12:43:04 Ford_focus_99 49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... 49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! 49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... 49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... 49988 2016-03-28 19:49:51 BMW_330_Ci 49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban 49991 2016-03-06 15:25:19 Kleinwagen 49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport 49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug 49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 5 privat Angebot $7,900 test bus 2006 6 privat Angebot $300 test limousine 1995 7 privat Angebot $1,990 control limousine 1998 8 privat Angebot $250 test NaN 2000 9 privat Angebot $590 control bus 1997 10 privat Angebot $999 test NaN 2017 11 privat Angebot $350 control NaN 2000 12 privat Angebot $5,299 control kleinwagen 2010 13 privat Angebot $1,350 control limousine 1999 14 privat Angebot $3,999 test kleinwagen 2007 15 privat Angebot $18,900 test coupe 1982 16 privat Angebot $350 test kombi 1999 17 privat Angebot $5,500 test coupe 1990 18 privat Angebot $300 control bus 1995 19 privat Angebot $4,150 control suv 2004 20 privat Angebot $3,500 test kombi 2003 21 privat Angebot $41,500 test cabrio 2004 22 privat Angebot $25,450 control cabrio 2015 23 privat Angebot $7,999 control bus 2010 24 privat Angebot $48,500 control limousine 2014 25 privat Angebot $90 control kombi 1996 26 privat Angebot $777 control kleinwagen 1992 27 privat Angebot $0 control NaN 2005 28 privat Angebot $5,250 control kleinwagen 2007 29 privat Angebot $4,999 test kombi 2004 ... ... ... ... ... ... ... 49970 privat Angebot $15,800 control bus 2010 49971 privat Angebot $950 test kleinwagen 2001 49972 privat Angebot $3,300 control bus 2004 49973 privat Angebot $6,000 control cabrio 2004 49974 privat Angebot $0 control cabrio 1983 49975 privat Angebot $9,700 control kleinwagen 2012 49976 privat Angebot $5,900 test kombi 1992 49977 privat Angebot $5,500 control limousine 2003 49978 privat Angebot $900 control limousine 1996 49979 privat Angebot $11,000 test kleinwagen 2011 49980 privat Angebot $400 control kombi 1995 49981 privat Angebot $2,000 control kombi 1998 49982 privat Angebot $1,950 control kleinwagen 2004 49983 privat Angebot $600 test kleinwagen 1999 49984 privat Angebot $0 test NaN 2000 49985 privat Angebot $1,000 control NaN 1995 49986 privat Angebot $15,900 control limousine 2010 49987 privat Angebot $21,990 control limousine 2013 49988 privat Angebot $9,550 control coupe 2001 49989 privat Angebot $150 test kleinwagen 1997 49990 privat Angebot $17,500 test limousine 2012 49991 privat Angebot $500 control NaN 2016 49992 privat Angebot $4,800 control kleinwagen 2009 49993 privat Angebot $1,650 control kleinwagen 1997 49994 privat Angebot $5,000 control kombi 2001 49995 privat Angebot $24,900 control limousine 2011 49996 privat Angebot $1,980 control cabrio 1996 49997 privat Angebot $13,200 test cabrio 2014 49998 privat Angebot $22,900 control kombi 2013 49999 privat Angebot $1,250 control limousine 1996 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin 5 automatik 150 voyager 150,000km 4 diesel 6 manuell 90 golf 150,000km 8 benzin 7 manuell 90 golf 150,000km 12 diesel 8 manuell 0 arosa 150,000km 10 NaN 9 manuell 90 megane 150,000km 7 benzin 10 manuell 90 NaN 150,000km 4 benzin 11 NaN 0 NaN 150,000km 0 benzin 12 automatik 71 fortwo 50,000km 9 benzin 13 manuell 101 a3 150,000km 11 benzin 14 manuell 75 clio 150,000km 9 benzin 15 automatik 203 NaN 80,000km 6 benzin 16 manuell 101 vectra 150,000km 5 benzin 17 manuell 205 scirocco 150,000km 6 benzin 18 manuell 150 3er 150,000km 0 benzin 19 manuell 124 andere 150,000km 2 lpg 20 manuell 131 a4 150,000km 5 diesel 21 manuell 320 911 150,000km 4 benzin 22 manuell 184 cooper 10,000km 1 benzin 23 manuell 120 NaN 150,000km 2 diesel 24 automatik 306 5er 30,000km 12 benzin 25 manuell 116 NaN 150,000km 4 benzin 26 manuell 54 polo 125,000km 2 benzin 27 NaN 0 NaN 150,000km 0 NaN 28 manuell 110 cooper 150,000km 7 diesel 29 automatik 204 e_klasse 150,000km 10 diesel ... ... ... ... ... ... ... 49970 automatik 136 c4 60,000km 4 diesel 49971 manuell 50 lupo 150,000km 4 benzin 49972 automatik 150 vito 150,000km 10 diesel 49973 manuell 163 slk 150,000km 11 benzin 49974 manuell 70 golf 150,000km 2 benzin 49975 automatik 88 jazz 100,000km 11 hybrid 49976 automatik 150 80 150,000km 12 benzin 49977 manuell 116 c_klasse 150,000km 2 diesel 49978 automatik 136 e_klasse 150,000km 9 benzin 49979 manuell 90 polo 70,000km 11 diesel 49980 manuell 105 escort 125,000km 3 benzin 49981 manuell 115 astra 150,000km 12 benzin 49982 manuell 0 fabia 90,000km 7 benzin 49983 manuell 101 focus 150,000km 4 benzin 49984 NaN 0 NaN 150,000km 0 NaN 49985 automatik 0 NaN 150,000km 0 benzin 49986 automatik 218 300c 125,000km 11 diesel 49987 manuell 150 a3 50,000km 11 diesel 49988 manuell 231 3er 150,000km 10 benzin 49989 manuell 0 polo 150,000km 5 benzin 49990 manuell 156 a_klasse 30,000km 12 benzin 49991 manuell 0 twingo 150,000km 0 benzin 49992 manuell 120 andere 125,000km 9 lpg 49993 manuell 0 NaN 150,000km 7 benzin 49994 automatik 299 a6 150,000km 1 benzin 49995 automatik 239 q5 100,000km 1 diesel 49996 manuell 75 astra 150,000km 5 benzin 49997 automatik 69 500 5,000km 11 benzin 49998 manuell 150 a3 40,000km 11 diesel 49999 manuell 101 vectra 150,000km 1 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 5 chrysler NaN 2016-03-21 00:00:00 0 6 volkswagen NaN 2016-03-20 00:00:00 0 7 volkswagen nein 2016-03-16 00:00:00 0 8 seat nein 2016-03-22 00:00:00 0 9 renault nein 2016-03-16 00:00:00 0 10 volkswagen nein 2016-03-14 00:00:00 0 11 mercedes_benz NaN 2016-03-16 00:00:00 0 12 smart nein 2016-03-31 00:00:00 0 13 audi nein 2016-03-23 00:00:00 0 14 renault NaN 2016-03-23 00:00:00 0 15 sonstige_autos nein 2016-04-01 00:00:00 0 16 opel nein 2016-03-16 00:00:00 0 17 volkswagen nein 2016-03-29 00:00:00 0 18 bmw NaN 2016-03-26 00:00:00 0 19 mazda nein 2016-03-17 00:00:00 0 20 audi NaN 2016-03-05 00:00:00 0 21 porsche nein 2016-03-06 00:00:00 0 22 mini nein 2016-03-28 00:00:00 0 23 peugeot nein 2016-03-10 00:00:00 0 24 bmw nein 2016-04-03 00:00:00 0 25 ford ja 2016-03-21 00:00:00 0 26 volkswagen nein 2016-04-03 00:00:00 0 27 ford NaN 2016-03-27 00:00:00 0 28 mini ja 2016-03-19 00:00:00 0 29 mercedes_benz nein 2016-04-02 00:00:00 0 ... ... ... ... ... 49970 citroen nein 2016-03-21 00:00:00 0 49971 volkswagen nein 2016-03-29 00:00:00 0 49972 mercedes_benz ja 2016-03-26 00:00:00 0 49973 mercedes_benz nein 2016-03-27 00:00:00 0 49974 volkswagen nein 2016-03-20 00:00:00 0 49975 honda nein 2016-03-27 00:00:00 0 49976 audi nein 2016-03-19 00:00:00 0 49977 mercedes_benz nein 2016-03-31 00:00:00 0 49978 mercedes_benz ja 2016-04-04 00:00:00 0 49979 volkswagen nein 2016-03-20 00:00:00 0 49980 ford NaN 2016-03-12 00:00:00 0 49981 opel nein 2016-03-15 00:00:00 0 49982 skoda NaN 2016-03-29 00:00:00 0 49983 ford NaN 2016-03-06 00:00:00 0 49984 sonstige_autos NaN 2016-03-31 00:00:00 0 49985 volkswagen NaN 2016-04-02 00:00:00 0 49986 chrysler nein 2016-04-04 00:00:00 0 49987 audi nein 2016-03-22 00:00:00 0 49988 bmw nein 2016-03-28 00:00:00 0 49989 volkswagen ja 2016-03-11 00:00:00 0 49990 mercedes_benz nein 2016-03-21 00:00:00 0 49991 renault NaN 2016-03-06 00:00:00 0 49992 fiat nein 2016-03-10 00:00:00 0 49993 audi NaN 2016-03-15 00:00:00 0 49994 audi nein 2016-03-22 00:00:00 0 49995 audi nein 2016-03-27 00:00:00 0 49996 opel nein 2016-03-28 00:00:00 0 49997 fiat nein 2016-04-02 00:00:00 0 49998 audi nein 2016-03-08 00:00:00 0 49999 opel nein 2016-03-13 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50 5 22962 2016-04-06 09:45:21 6 31535 2016-03-23 02:48:59 7 53474 2016-04-07 03:17:32 8 7426 2016-03-26 18:18:10 9 15749 2016-04-06 10:46:35 10 86157 2016-04-07 03:16:21 11 17498 2016-03-16 18:45:34 12 34590 2016-04-06 14:17:52 13 12043 2016-04-01 14:17:13 14 81737 2016-04-01 15:46:47 15 61276 2016-04-02 21:10:48 16 57299 2016-03-18 05:29:37 17 74821 2016-04-05 20:46:26 18 54329 2016-04-02 12:16:41 19 40878 2016-03-17 14:45:58 20 53913 2016-03-07 05:46:46 21 65428 2016-04-05 23:46:19 22 44789 2016-04-01 06:45:30 23 30900 2016-03-17 08:45:17 24 22547 2016-04-07 13:16:50 25 27574 2016-04-01 05:16:49 26 38110 2016-04-05 23:46:48 27 66701 2016-03-27 18:45:01 28 15745 2016-04-07 14:58:48 29 47638 2016-04-02 12:45:44 ... ... ... 49970 14947 2016-04-07 04:17:34 49971 65197 2016-03-29 20:41:51 49972 65326 2016-03-28 11:28:18 49973 53567 2016-03-27 08:25:24 49974 8209 2016-03-27 19:48:16 49975 84385 2016-04-05 19:45:34 49976 36100 2016-04-07 06:16:44 49977 33739 2016-04-06 12:16:11 49978 24405 2016-04-06 12:44:20 49979 48455 2016-04-07 01:45:12 49980 56218 2016-04-06 17:16:49 49981 86859 2016-04-05 17:21:46 49982 45884 2016-03-29 18:51:08 49983 52477 2016-03-09 06:16:08 49984 12103 2016-04-02 19:44:53 49985 30900 2016-04-06 15:17:52 49986 73527 2016-04-06 23:16:00 49987 94362 2016-03-26 22:46:06 49988 83646 2016-04-07 02:17:40 49989 21244 2016-03-12 10:17:55 49990 58239 2016-04-06 22:46:57 49991 61350 2016-03-06 18:24:19 49992 68642 2016-03-13 01:44:51 49993 65203 2016-04-06 19:46:53 49994 46537 2016-04-06 08:16:39 49995 82131 2016-04-01 13:47:40 49996 44807 2016-04-02 14:18:02 49997 73430 2016-04-04 11:47:27 49998 35683 2016-04-05 16:45:07 49999 45897 2016-04-06 21:18:48 [50000 rows x 20 columns]
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The dataset contains 20 columns, most of which are strings. Some columns have null values, but none have more than ~20% null values. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.rename(columns={'yearOfRegistration':'registration_year',
'monthOfRegistration':'registration_month',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated':'ad_created'},inplace=True)
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | registration_year | gearbox | powerPS | model | odometer | registration_month | fuelType | brand | unrepaired_damage | ad_created | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
dateCrawled | name | seller | offerType | price | abtest | vehicleType | registration_year | gearbox | powerPS | model | odometer | registration_month | fuelType | brand | unrepaired_damage | ad_created | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-08 10:40:35 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
1,Any columns that have mostly one value that are candidates to be dropped 2,Any columns that need more investigation. 3,Any examples of numeric data stored as text that needs to be cleaned. For example, Price and Oldmeter, needto rmove non-numeric dtype and change datatype
autos['price']=autos['price'].str.replace(',',"").str.replace('$','').astype(int)
Rmove '$' and ',', change datatype to int
autos['odometer'].value_counts()
#remove 'km'and ',' and change datatype to int
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
autos['odometer']=autos['odometer'].str.replace(',','').str.replace('km','').astype(int)
autos.rename(columns={'odometer':'odometer_km'},inplace=True)
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model', 'odometer_km', 'registration_month', 'fuelType', 'brand', 'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos['price'].unique().shape
#see how many unique values
(2357,)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index(ascending=True)
#find out the max and the min so I can remove the outliner
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 ... 151990 1 155000 1 163500 1 163991 1 169000 1 169999 1 175000 1 180000 1 190000 1 194000 1 197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, Length: 2357, dtype: int64
autos=autos[autos['price'].between(1,265000)]
#remove any number is bigger than 265000 and lower than 1
autos['price'].describe()
count 48561.000000 mean 5862.876732 std 8590.106661 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 265000.000000 Name: price, dtype: float64
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 48561.000000 mean 125777.784642 std 39779.658015 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().sort_index()
#numbers look reasonalble, dont need to remove outliners
5000 834 10000 253 20000 761 30000 780 40000 815 50000 1012 60000 1155 70000 1217 80000 1415 90000 1734 100000 2114 125000 5057 150000 31414 Name: odometer_km, dtype: int64
Next step is to clean the date column. change datatype to numeric for date_crawled, last_seen, and ad_created columns
autos['lastSeen'].describe()
count 48561 unique 38471 top 2016-04-07 06:17:27 freq 8 Name: lastSeen, dtype: object
#Extract year-month-date only and save to autos
autos['dateCrawled']=autos['dateCrawled'] .str[:10]
autos['ad_created']=autos['ad_created'].str[:10]
autos['lastSeen']=autos['lastSeen'].str[:10]
autos['dateCrawled'].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025329 2016-03-06 0.014044 2016-03-07 0.036017 2016-03-08 0.033298 2016-03-09 0.033092 2016-03-10 0.032186 2016-03-11 0.032578 2016-03-12 0.036923 2016-03-13 0.015671 2016-03-14 0.036552 2016-03-15 0.034287 2016-03-16 0.029612 2016-03-17 0.031630 2016-03-18 0.012912 2016-03-19 0.034781 2016-03-20 0.037890 2016-03-21 0.037376 2016-03-22 0.032989 2016-03-23 0.032207 2016-03-24 0.029345 2016-03-25 0.031610 2016-03-26 0.032207 2016-03-27 0.031074 2016-03-28 0.034863 2016-03-29 0.034101 2016-03-30 0.033669 2016-03-31 0.031836 2016-04-01 0.033690 2016-04-02 0.035481 2016-04-03 0.038591 2016-04-04 0.036490 2016-04-05 0.013097 2016-04-06 0.003171 2016-04-07 0.001400 Name: dateCrawled, dtype: float64
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033154 2016-03-10 0.031898 2016-03-11 0.032907 2016-03-12 0.036758 2016-03-13 0.017010 2016-03-14 0.035193 2016-03-15 0.034019 2016-03-16 0.030127 2016-03-17 0.031280 2016-03-18 0.013591 2016-03-19 0.033690 2016-03-20 0.037952 2016-03-21 0.037582 2016-03-22 0.032804 2016-03-23 0.032042 2016-03-24 0.029283 2016-03-25 0.031754 2016-03-26 0.032269 2016-03-27 0.030971 2016-03-28 0.034987 2016-03-29 0.034040 2016-03-30 0.033484 2016-03-31 0.031877 2016-04-01 0.033690 2016-04-02 0.035152 2016-04-03 0.038838 2016-04-04 0.036861 2016-04-05 0.011820 2016-04-06 0.003254 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
autos['ad_created'].value_counts(normalize=True, dropna=False).describe()
count 76.000000 mean 0.013158 std 0.015956 min 0.000021 25% 0.000021 50% 0.000144 75% 0.032099 max 0.038838 Name: ad_created, dtype: float64
autos['lastSeen'].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009596 2016-03-10 0.010667 2016-03-11 0.012376 2016-03-12 0.023785 2016-03-13 0.008896 2016-03-14 0.012603 2016-03-15 0.015877 2016-03-16 0.016454 2016-03-17 0.028088 2016-03-18 0.007352 2016-03-19 0.015836 2016-03-20 0.020654 2016-03-21 0.020634 2016-03-22 0.021375 2016-03-23 0.018533 2016-03-24 0.019769 2016-03-25 0.019213 2016-03-26 0.016804 2016-03-27 0.015630 2016-03-28 0.020860 2016-03-29 0.022343 2016-03-30 0.024773 2016-03-31 0.023785 2016-04-01 0.022796 2016-04-02 0.024917 2016-04-03 0.025185 2016-04-04 0.024485 2016-04-05 0.124771 2016-04-06 0.221804 2016-04-07 0.131937 Name: lastSeen, dtype: float64
autos['registration_year'].describe()
count 48561.000000 mean 2004.755400 std 88.647415 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos['year']=autos['registration_year'][:4]
autos['registration_year'].value_counts().sort_index(ascending=False)
#highest yeas is 9999, lowest year is 1000
#remove the upper and the lower bounds. only keep data from 1958-2016
9999 3 9000 1 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 2 2018 470 2017 1392 2016 1219 2015 391 2014 663 2013 803 2012 1309 2011 1623 2010 1589 2009 2085 2008 2215 2007 2277 2006 2670 2005 2936 2004 2703 2003 2699 2002 2486 2001 2636 2000 3156 ... 1964 12 1963 8 1962 4 1961 6 1960 23 1959 6 1958 4 1957 2 1956 4 1955 2 1954 2 1953 1 1952 1 1951 2 1950 3 1948 1 1943 1 1941 2 1939 1 1938 1 1937 4 1934 2 1931 1 1929 1 1927 1 1910 5 1800 2 1111 1 1001 1 1000 1 Name: registration_year, Length: 95, dtype: int64
autos=autos[autos['registration_year'].between(1958,2016)]
autos['registration_year'].value_counts(normalize=True)
#2000 has the most listings with cars, it has 6.77% and the second one is year 2005, has 6.30%
#The lowesr is year 2019 only has
2000 0.067667 2005 0.062950 1999 0.062114 2004 0.057955 2003 0.057869 2006 0.057247 2001 0.056518 2002 0.053302 1998 0.050665 2007 0.048821 2008 0.047491 2009 0.044704 1997 0.041831 2011 0.034798 2010 0.034069 1996 0.029438 2012 0.028066 1995 0.026308 2016 0.026136 2013 0.017217 2014 0.014215 1994 0.013486 1993 0.009112 2015 0.008383 1992 0.007933 1990 0.007440 1991 0.007268 1989 0.003731 1988 0.002895 1985 0.002037 1980 0.001822 1987 0.001544 1986 0.001544 1984 0.001093 1983 0.001093 1978 0.000943 1982 0.000879 1970 0.000815 1979 0.000729 1972 0.000708 1981 0.000600 1967 0.000557 1968 0.000557 1971 0.000557 1974 0.000515 1960 0.000493 1973 0.000493 1966 0.000472 1976 0.000450 1977 0.000450 1969 0.000407 1975 0.000386 1965 0.000364 1964 0.000257 1963 0.000172 1961 0.000129 1959 0.000129 1958 0.000086 1962 0.000086 Name: registration_year, dtype: float64
The next step is to explore variations across different car brands. We can see what are the top 20 brands for lisiting.
top_brands = autos['brand'].value_counts(normalize=True)
top_brands=top_brands[top_brands>0.05].index
average_price_brand={}
for brand in top_brands:
selected_rows=autos[autos['brand']==brand]
mean_of_selected_brand=selected_rows['price'].mean()
average_price_brand[brand]=mean_of_selected_brand
sort_average_price_brand=sorted(average_price_brand.items(),key=lambda x:x[1],reverse=True)
print(sort_average_price_brand)
[('audi', 9336.687453600594), ('mercedes_benz', 8577.71002890816), ('bmw', 8332.203855140187), ('volkswagen', 5398.709055876686), ('ford', 3720.318809450752), ('opel', 2974.688122758071)]
In order to understand what band hold the best value in the used car listing,we analyzed average price for each brand by taking brands having over 5% of listings.
We can conclude that Audi,Mercedes_benz and BMW are more expensive. Audi takes the first place,which is $758.98 more than the second place,Mercedes_benz. However, There are only $245.51 difference between the second place and the third place. Ford and Opel are less expensive and Volkswagen is in between.
For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.
average_mileage={}
for brand in top_brands:
selected_rows=autos[autos['brand']==brand]
mean_of_mileage=selected_rows['odometer_km'].mean()
average_mileage[brand]=mean_of_mileage
sort_average_mileage=sorted(average_mileage.items(),key=lambda x:x[1],reverse=True)
print(sort_average_mileage)
[('bmw', 132597.35202492212), ('mercedes_benz', 130886.14631976874), ('opel', 129342.3674770825), ('audi', 129157.38678544914), ('volkswagen', 128710.06997261941), ('ford', 124390.91745934336)]
avg_price=pd.Series(average_price_brand)
avg_price=pd.DataFrame(avg_price,columns=['Mean Price'])
print(avg_price)
Mean Price audi 9336.687454 bmw 8332.203855 ford 3720.318809 mercedes_benz 8577.710029 opel 2974.688123 volkswagen 5398.709056
avg_mil=pd.Series(average_mileage)
avg_mil=pd.DataFrame(avg_mil,columns=['Mean Mileage'])
avg_price['Mean Mileage']=avg_mil['Mean Mileage']# put above 2 dfs into 1 df
avg_price=avg_price.sort_values(by='Mean Price',ascending=False)
print(avg_price)
Mean Price Mean Mileage audi 9336.687454 129157.386785 mercedes_benz 8577.710029 130886.146320 bmw 8332.203855 132597.352025 volkswagen 5398.709056 128710.069973 ford 3720.318809 124390.917459 opel 2974.688123 129342.367477
Mileage does not impact price much on all 6 brands. Compared the hightest mean price brand,Audi,and the lowest mean price brand, Opel, mean price has over $6000 difference while mean mileage only has 185km difference. Audi has 185km less than Opel on average mean mileage. Even thought Mercedes and BMW have a higher mean mileage than Volkswagen,Ford and Opel, their mean prices are still higher than those three.