Okay, here we go with the project. I hope I can get it to work. We'll be importing a dataset from the German eBay kleinen ads for cars, do some cleanup and then a bit of analysis.
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding = 'Latin-1' )
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head(3)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
We have 20 columns CamelCase instead of snake_case
date: maybe convert to day and get rid of the time? price: remove dollar sign, convert to integer, rename column vehicleType: missing some data gearbox: missing some data model: missing some data odometer: remove km, convert to integer, rename column fuelType: missing some data notRepairedDamage: missing data dateCreated: maybe convert to day and get rid of the time? lastSeen: maybe convert to day and get rid of the time?
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#first, let's convert from camelcase to snakecase
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | number_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | number_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
It looks like we can drop number of pictures, seller, and offertype, since they provide almost no information. I think abtest probably refers to whether the advertisement is itself tested, which isn't relevant here. Is the crawl date, ad creation date, and last seen date relevant?
Maybe, if a car hasn't sold in a while there might be a reason. We'll look at cars that have been listed more recently.
autos = autos.drop(['seller', 'offer_type','number_pictures','abtest'], axis=1)
Some more clean up: Odometer needs to be a number, change column name Price needs to be a number -- we've already changed the column name.
#Cleaning up the price and odometer columns
autos["price_usd"] = autos["price_usd"].str.replace("$","")
autos["price_usd"] = autos["price_usd"].str.replace(",","").astype(int)
autos["odometer"] = autos["odometer"].str.replace("km","")
autos["odometer"] = autos["odometer"].str.replace(",","").astype(int)
autos.rename(columns = {'odometer':'odometer_km'}, inplace=True)
#Lets take a look to find weird data in odometers
autos['odometer_km'].shape
(50000,)
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Well, that doesn't tell us much. It looks like the data tops out at 150k mileage, so who knows how many miles are actually on a car indicated with 150 (it's actually "above 150")
Let's see how many cars have 150k or more miles.
autos["odometer_km"].value_counts().sort_index(ascending=False).head(10)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 Name: odometer_km, dtype: int64
Wow, that's a lot. Probably we'll filter out those high mileage cars later on.
Let's look at he price now
autos['price_usd'].shape
(50000,)
autos["price_usd"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64
autos["price_usd"].value_counts().sort_index(ascending=False).head(50)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 145000 1 139997 1 137999 1 135000 1 130000 1 129000 1 128000 1 120000 2 119900 1 119500 1 116000 1 115991 1 115000 1 114400 1 109999 1 105000 2 104900 1 99900 2 99000 2 98500 1 Name: price_usd, dtype: int64
Wow. There's a few cars that are crazy expensive. I'm guessing a few of those might be a typo, let's take a look...
autos[autos["price_usd"] > 150000].head(50)
date_crawled | name | price_usd | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
514 | 2016-03-17 09:53:08 | Ford_Focus_Turnier_1.6_16V_Style | 999999 | kombi | 2009 | manuell | 101 | focus | 125000 | 4 | benzin | ford | nein | 2016-03-17 00:00:00 | 12205 | 2016-04-06 07:17:35 |
2897 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 11111111 | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 94469 | 2016-03-12 22:45:27 |
7814 | 2016-04-04 11:53:31 | Ferrari_F40 | 1300000 | coupe | 1992 | NaN | 0 | NaN | 50000 | 12 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 60598 | 2016-04-05 11:34:11 |
10500 | 2016-03-17 12:56:38 | Porsche_991 | 155000 | coupe | 2013 | NaN | 476 | 911 | 20000 | 11 | NaN | porsche | nein | 2016-03-17 00:00:00 | 90768 | 2016-03-26 23:16:41 |
11137 | 2016-03-29 23:52:57 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | 10000000 | coupe | 1960 | manuell | 368 | NaN | 100000 | 1 | benzin | sonstige_autos | nein | 2016-03-29 00:00:00 | 73033 | 2016-04-06 21:18:11 |
11433 | 2016-03-30 21:44:21 | Audi_R8_V10_plus_5.2_FSI_quattro_S_tronic_Akra... | 175000 | coupe | 2016 | automatik | 610 | andere | 5000 | 1 | benzin | audi | nein | 2016-03-30 00:00:00 | 94244 | 2016-04-05 11:46:33 |
12682 | 2016-03-28 22:48:01 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | 265000 | coupe | 2016 | automatik | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-28 00:00:00 | 70193 | 2016-04-05 03:44:51 |
14715 | 2016-03-30 08:37:24 | Rolls_Royce_Phantom_Drophead_Coupe | 345000 | cabrio | 2012 | automatik | 460 | NaN | 20000 | 8 | benzin | sonstige_autos | nein | 2016-03-30 00:00:00 | 73525 | 2016-04-07 00:16:26 |
17140 | 2016-03-05 17:39:22 | Mercedes_Benz_SLS_AMG_Roadster_AMG_SPEEDSHIFT_DCT | 180000 | cabrio | 2012 | automatik | 571 | andere | 50000 | 2 | benzin | mercedes_benz | nein | 2016-03-05 00:00:00 | 14193 | 2016-03-20 19:44:36 |
18509 | 2016-03-27 18:43:02 | Porsche_991_GT3 | 169000 | coupe | 2013 | automatik | 476 | 911 | 20000 | 9 | benzin | porsche | nein | 2016-03-27 00:00:00 | 83629 | 2016-04-05 15:15:29 |
20351 | 2016-03-22 17:47:59 | BMW_Z8_roadster | 190000 | cabrio | 2000 | manuell | 400 | z_reihe | 60000 | 9 | benzin | bmw | nein | 2016-03-22 00:00:00 | 14473 | 2016-04-02 11:44:28 |
22673 | 2016-03-05 18:40:13 | Porsche_911_991_GT3_LED_Licht_Chrono_Navi_Club... | 163991 | coupe | 2014 | automatik | 476 | 911 | 10000 | 6 | benzin | porsche | nein | 2016-03-05 00:00:00 | 51702 | 2016-03-09 16:45:30 |
22947 | 2016-03-22 12:54:19 | Bmw_530d_zum_ausschlachten | 1234566 | kombi | 1999 | automatik | 190 | NaN | 150000 | 2 | diesel | bmw | NaN | 2016-03-22 00:00:00 | 17454 | 2016-04-02 03:17:32 |
24384 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | 11111111 | NaN | 1995 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 18519 | 2016-03-21 14:40:18 |
27371 | 2016-03-09 15:45:47 | Fiat_Punto | 12345678 | NaN | 2017 | NaN | 95 | punto | 150000 | 0 | NaN | fiat | NaN | 2016-03-09 00:00:00 | 96110 | 2016-03-09 15:45:47 |
28090 | 2016-03-08 19:50:06 | Tesla_Model_X_P90D_Signature_Sondermodel__Neuw... | 194000 | suv | 2016 | automatik | 0 | NaN | 5000 | 3 | elektro | sonstige_autos | nein | 2016-03-08 00:00:00 | 97990 | 2016-03-16 04:15:31 |
32840 | 2016-03-22 08:55:27 | Mercedes_Benz_G_500_7G_TRONIC_Cabrio_Desingno_... | 169999 | cabrio | 2012 | automatik | 387 | g_klasse | 40000 | 3 | benzin | mercedes_benz | nein | 2016-03-22 00:00:00 | 42655 | 2016-04-05 20:47:17 |
33638 | 2016-03-24 12:54:19 | Porsche_911_GT3_RS | 151990 | coupe | 2007 | manuell | 415 | 911 | 20000 | 10 | benzin | porsche | nein | 2016-03-24 00:00:00 | 40479 | 2016-04-07 03:46:46 |
34723 | 2016-03-23 16:37:29 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | 299000 | coupe | 1977 | manuell | 260 | 911 | 100000 | 7 | benzin | porsche | nein | 2016-03-23 00:00:00 | 61462 | 2016-04-06 16:44:50 |
35923 | 2016-04-03 07:56:23 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | 295000 | cabrio | 2015 | automatik | 400 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-04-03 00:00:00 | 74078 | 2016-04-03 08:56:20 |
36818 | 2016-03-27 18:37:37 | Porsche_991 | 350000 | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 70499 | 2016-03-27 18:37:37 |
37585 | 2016-03-29 11:38:54 | Volkswagen_Jetta_GT | 999990 | limousine | 1985 | manuell | 111 | jetta | 150000 | 12 | benzin | volkswagen | ja | 2016-03-29 00:00:00 | 50997 | 2016-03-29 11:38:54 |
37840 | 2016-03-21 10:50:12 | Porsche_997 | 220000 | coupe | 2008 | manuell | 415 | 911 | 30000 | 7 | benzin | porsche | nein | 2016-03-21 00:00:00 | 69198 | 2016-04-06 04:46:14 |
38299 | 2016-03-28 22:25:25 | Glas_BMW_mit_Wasser | 250000 | NaN | 2015 | NaN | 0 | x_reihe | 5000 | 0 | NaN | bmw | NaN | 2016-03-28 00:00:00 | 60489 | 2016-03-28 22:25:25 |
39377 | 2016-03-08 23:53:51 | Tausche_volvo_v40_gegen_van | 12345678 | NaN | 2018 | manuell | 95 | v40 | 150000 | 6 | NaN | volvo | nein | 2016-03-08 00:00:00 | 14542 | 2016-04-06 23:17:31 |
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | 99999999 | limousine | 1999 | automatik | 224 | s_klasse | 150000 | 9 | benzin | mercedes_benz | NaN | 2016-03-22 00:00:00 | 73525 | 2016-04-06 05:15:30 |
40918 | 2016-03-20 18:40:05 | Porsche_911_991_GT3_RS | 198000 | coupe | 2015 | automatik | 500 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-03-20 00:00:00 | 51491 | 2016-03-21 21:46:36 |
42221 | 2016-03-08 20:39:05 | Leasinguebernahme | 27322222 | limousine | 2014 | manuell | 163 | c4 | 40000 | 2 | diesel | citroen | NaN | 2016-03-08 00:00:00 | 76532 | 2016-03-08 20:39:05 |
43049 | 2016-03-21 19:53:52 | 2_VW_Busse_T3 | 999999 | bus | 1981 | manuell | 70 | transporter | 150000 | 1 | benzin | volkswagen | NaN | 2016-03-21 00:00:00 | 99880 | 2016-03-28 17:18:28 |
43668 | 2016-03-16 18:47:26 | Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts... | 197000 | coupe | 1998 | manuell | 450 | 911 | 150000 | 3 | NaN | porsche | nein | 2016-03-16 00:00:00 | 46147 | 2016-04-07 02:44:47 |
45387 | 2016-03-18 16:41:22 | Porsche_911_GT3__Clubsp._Lift_nur_bis_25.03._v... | 163500 | coupe | 2015 | automatik | 476 | 911 | 10000 | 9 | benzin | porsche | nein | 2016-03-18 00:00:00 | 74078 | 2016-03-18 16:41:22 |
47337 | 2016-04-05 10:25:38 | BMW_Z8_roadster | 259000 | cabrio | 2001 | manuell | 400 | z_reihe | 20000 | 6 | benzin | bmw | nein | 2016-04-05 00:00:00 | 61462 | 2016-04-05 12:07:32 |
47598 | 2016-03-31 18:56:54 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | 12345678 | limousine | 2001 | manuell | 101 | vectra | 150000 | 3 | benzin | opel | nein | 2016-03-31 00:00:00 | 4356 | 2016-03-31 18:56:54 |
47634 | 2016-04-04 21:25:21 | Ferrari_FXX | 3890000 | coupe | 2006 | NaN | 799 | NaN | 5000 | 7 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 60313 | 2016-04-05 12:07:37 |
Looks like there's some typos in there. Let's get rid of all cars with a price greater than 200000 - there's only one legit entry (A ferrari), and we can't afford it anyway..
autos = autos[autos.price_usd < 200000]
There's also a bunch of autos that are way too cheap. We aren't in the market for a really bad car - let's get rid of those too.
autos = autos[autos.price_usd > 2500]
Now let's take a quick look at the data we have left. First make sure we got rid of the high prices.
autos[autos["price_usd"] > 150000].sort_index(ascending=False).head(50)
date_crawled | name | price_usd | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45387 | 2016-03-18 16:41:22 | Porsche_911_GT3__Clubsp._Lift_nur_bis_25.03._v... | 163500 | coupe | 2015 | automatik | 476 | 911 | 10000 | 9 | benzin | porsche | nein | 2016-03-18 00:00:00 | 74078 | 2016-03-18 16:41:22 |
43668 | 2016-03-16 18:47:26 | Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts... | 197000 | coupe | 1998 | manuell | 450 | 911 | 150000 | 3 | NaN | porsche | nein | 2016-03-16 00:00:00 | 46147 | 2016-04-07 02:44:47 |
40918 | 2016-03-20 18:40:05 | Porsche_911_991_GT3_RS | 198000 | coupe | 2015 | automatik | 500 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-03-20 00:00:00 | 51491 | 2016-03-21 21:46:36 |
33638 | 2016-03-24 12:54:19 | Porsche_911_GT3_RS | 151990 | coupe | 2007 | manuell | 415 | 911 | 20000 | 10 | benzin | porsche | nein | 2016-03-24 00:00:00 | 40479 | 2016-04-07 03:46:46 |
32840 | 2016-03-22 08:55:27 | Mercedes_Benz_G_500_7G_TRONIC_Cabrio_Desingno_... | 169999 | cabrio | 2012 | automatik | 387 | g_klasse | 40000 | 3 | benzin | mercedes_benz | nein | 2016-03-22 00:00:00 | 42655 | 2016-04-05 20:47:17 |
28090 | 2016-03-08 19:50:06 | Tesla_Model_X_P90D_Signature_Sondermodel__Neuw... | 194000 | suv | 2016 | automatik | 0 | NaN | 5000 | 3 | elektro | sonstige_autos | nein | 2016-03-08 00:00:00 | 97990 | 2016-03-16 04:15:31 |
22673 | 2016-03-05 18:40:13 | Porsche_911_991_GT3_LED_Licht_Chrono_Navi_Club... | 163991 | coupe | 2014 | automatik | 476 | 911 | 10000 | 6 | benzin | porsche | nein | 2016-03-05 00:00:00 | 51702 | 2016-03-09 16:45:30 |
20351 | 2016-03-22 17:47:59 | BMW_Z8_roadster | 190000 | cabrio | 2000 | manuell | 400 | z_reihe | 60000 | 9 | benzin | bmw | nein | 2016-03-22 00:00:00 | 14473 | 2016-04-02 11:44:28 |
18509 | 2016-03-27 18:43:02 | Porsche_991_GT3 | 169000 | coupe | 2013 | automatik | 476 | 911 | 20000 | 9 | benzin | porsche | nein | 2016-03-27 00:00:00 | 83629 | 2016-04-05 15:15:29 |
17140 | 2016-03-05 17:39:22 | Mercedes_Benz_SLS_AMG_Roadster_AMG_SPEEDSHIFT_DCT | 180000 | cabrio | 2012 | automatik | 571 | andere | 50000 | 2 | benzin | mercedes_benz | nein | 2016-03-05 00:00:00 | 14193 | 2016-03-20 19:44:36 |
11433 | 2016-03-30 21:44:21 | Audi_R8_V10_plus_5.2_FSI_quattro_S_tronic_Akra... | 175000 | coupe | 2016 | automatik | 610 | andere | 5000 | 1 | benzin | audi | nein | 2016-03-30 00:00:00 | 94244 | 2016-04-05 11:46:33 |
10500 | 2016-03-17 12:56:38 | Porsche_991 | 155000 | coupe | 2013 | NaN | 476 | 911 | 20000 | 11 | NaN | porsche | nein | 2016-03-17 00:00:00 | 90768 | 2016-03-26 23:16:41 |
Now we'll look to make sure we got rid of the low values
autos[autos["price_usd"] < 5000].sort_index(ascending=True).head(50)
date_crawled | name | price_usd | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | 3999 | kleinwagen | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 81737 | 2016-04-01 15:46:47 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | 4150 | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | 3500 | kombi | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 53913 | 2016-03-07 05:46:46 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | 4999 | kombi | 2004 | automatik | 204 | e_klasse | 150000 | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 47638 | 2016-04-02 12:45:44 |
31 | 2016-03-14 16:53:09 | Noch_gut_erhaltenen_C_320 | 2850 | kombi | 2002 | automatik | 218 | c_klasse | 150000 | 7 | benzin | mercedes_benz | nein | 2016-03-14 00:00:00 | 41065 | 2016-03-16 07:19:04 |
33 | 2016-03-15 12:25:26 | Audi_A3_2.0_TDI_Sportback_DPF_Kupplung_defekt_... | 4800 | kombi | 2006 | manuell | 140 | a3 | 150000 | 8 | diesel | audi | nein | 2016-03-15 00:00:00 | 61169 | 2016-04-07 13:15:31 |
34 | 2016-04-05 07:55:07 | Bmw_530d__navi_tuv03/18 | 2600 | kombi | 2000 | automatik | 193 | 5er | 150000 | 6 | diesel | bmw | NaN | 2016-04-05 00:00:00 | 95111 | 2016-04-07 12:15:47 |
38 | 2016-03-21 15:51:10 | Volkswagen_Golf_1.4_Special | 2850 | limousine | 2002 | manuell | 75 | golf | 125000 | 2 | benzin | volkswagen | nein | 2016-03-21 00:00:00 | 63674 | 2016-03-28 12:16:06 |
40 | 2016-03-07 14:50:03 | VW_Golf__4_Cabrio_2.0_GTI_16V___Leder_MFA_Alus... | 3500 | cabrio | 1999 | manuell | 150 | golf | 150000 | 1 | benzin | volkswagen | nein | 2016-03-07 00:00:00 | 6780 | 2016-03-12 02:15:52 |
41 | 2016-03-10 10:46:08 | Passat_3b_1.9_TDI_Highline__angemeldet_mit_tuv... | 3200 | kombi | 2003 | manuell | 131 | NaN | 150000 | 7 | NaN | volkswagen | nein | 2016-03-10 00:00:00 | 28259 | 2016-04-06 20:19:08 |
47 | 2016-03-28 19:51:51 | Mercedes_Benz_A_150 | 3499 | limousine | 2008 | manuell | 95 | a_klasse | 150000 | 8 | benzin | mercedes_benz | nein | 2016-03-28 00:00:00 | 59557 | 2016-04-01 05:17:51 |
52 | 2016-03-25 18:50:03 | Senator_A_3.0E_Karosserie_restauriert_m._viele... | 3500 | limousine | 1985 | NaN | 0 | andere | 5000 | 0 | benzin | opel | nein | 2016-03-25 00:00:00 | 63500 | 2016-04-07 00:46:00 |
57 | 2016-03-12 11:46:56 | Volkswagen_Passat_Variant_2.0_TDI_DPF_Trendline | 4100 | kombi | 2007 | manuell | 140 | passat | 150000 | 2 | diesel | volkswagen | nein | 2016-03-12 00:00:00 | 27404 | 2016-03-20 09:49:08 |
58 | 2016-03-30 12:25:18 | Volkswagen_Golf_1.4 | 4200 | limousine | 2004 | manuell | 75 | golf | 125000 | 11 | benzin | volkswagen | nein | 2016-03-30 00:00:00 | 56073 | 2016-04-01 08:29:34 |
67 | 2016-03-29 20:50:41 | Jeep_Wrangler_YJ | 4800 | suv | 1992 | manuell | 121 | wrangler | 150000 | 5 | benzin | jeep | NaN | 2016-03-29 00:00:00 | 46325 | 2016-04-06 07:17:12 |
73 | 2016-03-18 23:41:21 | VW_T3_Doka_1_7D_TÜV_07/2017 | 2700 | bus | 1986 | manuell | 57 | transporter | 125000 | 8 | diesel | volkswagen | NaN | 2016-03-18 00:00:00 | 82205 | 2016-04-05 23:16:13 |
75 | 2016-04-02 18:52:03 | Volvo_960/964_Limousine_6_Zylinder | 3800 | limousine | 1990 | automatik | 204 | andere | 150000 | 8 | benzin | volvo | nein | 2016-04-02 00:00:00 | 92721 | 2016-04-06 18:47:41 |
82 | 2016-03-09 20:18:38 | BMW_320i_M_Paket_&_NEUWERTIGER_Motor | 4700 | limousine | 2001 | manuell | 170 | 3er | 150000 | 8 | benzin | bmw | nein | 2016-03-08 00:00:00 | 60439 | 2016-03-10 13:29:35 |
100 | 2016-03-12 18:57:48 | Mazda_5_2.0_CD_DPF_Exclusive | 4500 | bus | 2007 | manuell | 143 | 5_reihe | 90000 | 7 | diesel | mazda | ja | 2016-03-12 00:00:00 | 51375 | 2016-04-07 12:47:06 |
103 | 2016-03-11 22:55:39 | Nissan_Micra_1.2_visia_1._Hand_Bj._2010 | 3799 | kleinwagen | 2004 | manuell | 65 | micra | 70000 | 2 | benzin | nissan | nein | 2016-03-11 00:00:00 | 12157 | 2016-03-17 23:44:25 |
108 | 2016-03-27 19:38:15 | Mercedes_Benz_Vito_111_CDI_Kompakt | 3500 | bus | 2003 | manuell | 109 | vito | 150000 | 10 | diesel | mercedes_benz | nein | 2016-03-27 00:00:00 | 23556 | 2016-03-29 09:16:26 |
109 | 2016-03-21 18:37:36 | Seat_Cordoba | 3399 | limousine | 2006 | manuell | 75 | cordoba | 125000 | 6 | benzin | seat | nein | 2016-03-21 00:00:00 | 48301 | 2016-04-06 19:47:12 |
117 | 2016-04-01 12:51:37 | Audi_A4_2.5_TDI | 4700 | limousine | 2005 | manuell | 163 | a4 | 150000 | 3 | diesel | audi | nein | 2016-04-01 00:00:00 | 57078 | 2016-04-07 09:17:53 |
119 | 2016-03-17 13:58:19 | Peugeot_206_CC_110_Quiksilver | 2999 | cabrio | 2004 | manuell | 109 | 2_reihe | 150000 | 9 | benzin | peugeot | nein | 2016-03-17 00:00:00 | 92242 | 2016-04-06 19:46:59 |
124 | 2016-03-16 14:51:35 | Passat_2.0_TDI_DPF_Comfortline | 3100 | limousine | 2006 | NaN | 0 | passat | 150000 | 8 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 83071 | 2016-03-16 14:51:35 |
125 | 2016-03-17 17:56:46 | Seat_Altea_1_6 | 4000 | bus | 2005 | manuell | 105 | altea | 100000 | 11 | benzin | seat | nein | 2016-03-17 00:00:00 | 34474 | 2016-04-06 02:45:09 |
132 | 2016-04-03 13:57:28 | Skoda_Superb_2_5_TDI_Elegance_Voll_Ausst_Motor... | 2900 | limousine | 2004 | manuell | 163 | superb | 150000 | 4 | diesel | skoda | nein | 2016-04-03 00:00:00 | 13355 | 2016-04-05 13:18:25 |
141 | 2016-04-01 09:54:58 | Mercedes_Benz_C_200_Kompressor_tuev/_01.2017 | 2600 | limousine | 2002 | manuell | 163 | c_klasse | 150000 | 2 | benzin | mercedes_benz | nein | 2016-04-01 00:00:00 | 45896 | 2016-04-01 09:54:58 |
151 | 2016-03-27 09:55:44 | Volkswagen_Golf_Cabrio_2.0_Generation | 3250 | cabrio | 2000 | manuell | 116 | golf | 150000 | 6 | benzin | volkswagen | nein | 2016-03-27 00:00:00 | 14169 | 2016-04-07 04:46:03 |
152 | 2016-03-20 12:45:01 | Ssanyong_Rexton_2.7 | 4499 | suv | 2005 | automatik | 163 | NaN | 150000 | 9 | diesel | sonstige_autos | nein | 2016-03-20 00:00:00 | 73312 | 2016-03-25 11:45:09 |
154 | 2016-04-01 13:56:00 | Volkswagen_Transporter_T4_TDI_LKW/Wohmo | 3950 | bus | 1998 | manuell | 102 | transporter | 150000 | 6 | diesel | volkswagen | nein | 2016-04-01 00:00:00 | 23554 | 2016-04-05 08:51:08 |
162 | 2016-03-28 10:54:23 | Fiat_Grande_Punto_1.4_16V_Sport | 4900 | kleinwagen | 2009 | manuell | 95 | andere | 125000 | 9 | benzin | fiat | nein | 2016-03-28 00:00:00 | 50374 | 2016-04-06 13:44:51 |
164 | 2016-03-13 20:39:16 | Opel_Meriva__nur_76000_Km__unfallfrei__scheckh... | 4800 | NaN | 2018 | manuell | 0 | meriva | 80000 | 4 | benzin | opel | nein | 2016-03-13 00:00:00 | 37627 | 2016-04-04 16:48:02 |
178 | 2016-03-26 21:25:23 | Volkswagen_Touran_2.0_TDI_Highline | 4699 | bus | 2003 | manuell | 136 | touran | 150000 | 11 | diesel | volkswagen | nein | 2016-03-26 00:00:00 | 60488 | 2016-03-30 09:18:29 |
179 | 2016-03-13 22:47:34 | Mitsubishi_Colt_CZ3_1.1 | 2600 | limousine | 2005 | manuell | 75 | colt | 150000 | 11 | benzin | mitsubishi | nein | 2016-03-13 00:00:00 | 18609 | 2016-04-06 19:17:49 |
184 | 2016-04-04 23:56:50 | Peugeot_207_75_Urban_Move__Panoramadach__Klima... | 4490 | kleinwagen | 2009 | manuell | 73 | 2_reihe | 100000 | 3 | benzin | peugeot | nein | 2016-04-04 00:00:00 | 51647 | 2016-04-07 09:15:23 |
193 | 2016-03-07 17:45:19 | SEHR_GEPFLEGTER_FIAT_STILO | 2650 | kleinwagen | 2003 | manuell | 103 | NaN | 125000 | 11 | benzin | fiat | nein | 2016-03-07 00:00:00 | 30179 | 2016-03-12 12:15:23 |
195 | 2016-04-03 11:43:27 | Fiat_Grande_Punto_1.4_16V_Sport | 3300 | kleinwagen | 2006 | manuell | 95 | andere | 125000 | 12 | benzin | fiat | nein | 2016-04-03 00:00:00 | 40599 | 2016-04-07 12:45:00 |
198 | 2016-03-21 11:06:18 | Fiat_Ducato_Maxi_Lang_L5.___3_Liter_Motor_158_ps | 4200 | andere | 2007 | manuell | 160 | ducato | 150000 | 3 | diesel | fiat | ja | 2016-03-21 00:00:00 | 65824 | 2016-03-21 22:40:48 |
205 | 2016-03-30 12:43:45 | Opel_Tigra_Twin_Top_1.4_Easytronic__Enjoy_ | 4800 | cabrio | 2004 | automatik | 90 | tigra | 80000 | 12 | benzin | opel | nein | 2016-03-30 00:00:00 | 17033 | 2016-04-07 03:15:55 |
206 | 2016-03-29 13:49:04 | Nissan_Micra_1.2 | 3400 | kleinwagen | 2004 | manuell | 65 | micra | 125000 | 6 | benzin | nissan | nein | 2016-03-29 00:00:00 | 26899 | 2016-03-29 13:49:04 |
207 | 2016-04-03 10:56:10 | Peugeot_107_70_2_Tronic_Filou | 3000 | kleinwagen | 2007 | automatik | 68 | 1_reihe | 150000 | 11 | benzin | peugeot | nein | 2016-04-03 00:00:00 | 15345 | 2016-04-03 11:42:04 |
213 | 2016-04-06 15:06:36 | Opel_Signum_2.2_DTI | 2950 | kombi | 2003 | manuell | 125 | signum | 150000 | 5 | diesel | opel | nein | 2016-04-06 00:00:00 | 60386 | 2016-04-06 15:25:17 |
220 | 2016-04-01 10:36:44 | Ford_Ka_1.2 | 4500 | kleinwagen | 2009 | NaN | 0 | ka | 80000 | 12 | benzin | ford | nein | 2016-04-01 00:00:00 | 73054 | 2016-04-07 08:16:46 |
223 | 2016-03-28 19:48:02 | Ford_Fiesta_1.4_TDCI | 3150 | kleinwagen | 2005 | manuell | 68 | fiesta | 150000 | 9 | diesel | ford | nein | 2016-03-28 00:00:00 | 30900 | 2016-04-07 02:16:50 |
225 | 2016-03-15 09:57:37 | Toyota_RAV_4 | 2900 | suv | 1997 | manuell | 128 | rav | 150000 | 5 | benzin | toyota | nein | 2016-03-15 00:00:00 | 54636 | 2016-03-16 05:18:46 |
237 | 2016-03-15 23:57:30 | VW_Touran_1.9_tdi_TueV_10.2016 | 3699 | bus | 2003 | manuell | 0 | touran | 150000 | 11 | diesel | volkswagen | nein | 2016-03-15 00:00:00 | 28219 | 2016-03-20 06:16:13 |
239 | 2016-03-21 11:53:34 | Mercedes_benz_c_220_cdi | 3500 | limousine | 2000 | manuell | 143 | c_klasse | 150000 | 9 | diesel | mercedes_benz | NaN | 2016-03-21 00:00:00 | 20359 | 2016-04-06 06:46:57 |
247 | 2016-03-24 15:38:30 | Volkswagen_Golf_1.6_Champ | 3200 | limousine | 2002 | manuell | 105 | golf | 90000 | 11 | benzin | volkswagen | nein | 2016-03-24 00:00:00 | 93158 | 2016-03-24 15:38:30 |
Now let's take a look at where we are...
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 26820 entries, 0 to 49998 Data columns (total 16 columns): date_crawled 26820 non-null object name 26820 non-null object price_usd 26820 non-null int64 vehicle_type 25348 non-null object registration_year 26820 non-null int64 gearbox 26007 non-null object power_ps 26820 non-null int64 model 25858 non-null object odometer_km 26820 non-null int64 registration_month 26820 non-null int64 fuel_type 25589 non-null object brand 26820 non-null object unrepaired_damage 23634 non-null object ad_created 26820 non-null object postal_code 26820 non-null int64 last_seen 26820 non-null object dtypes: int64(6), object(10) memory usage: 3.5+ MB
Let's see if there's any interesting info in the dates. First, let's grab a sample of the data.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
Hm. Well, the timestamp down to the second isn't very meaningful. Let's strip out the time and see what we are left with.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.026025 2016-03-06 0.014206 2016-03-07 0.034564 2016-03-08 0.031394 2016-03-09 0.032662 2016-03-10 0.031879 2016-03-11 0.032774 2016-03-12 0.037435 2016-03-13 0.017114 2016-03-14 0.037062 2016-03-15 0.033110 2016-03-16 0.029456 2016-03-17 0.030239 2016-03-18 0.012603 2016-03-19 0.035459 2016-03-20 0.038367 2016-03-21 0.035906 2016-03-22 0.031357 2016-03-23 0.032364 2016-03-24 0.028822 2016-03-25 0.030089 2016-03-26 0.033072 2016-03-27 0.032662 2016-03-28 0.036540 2016-03-29 0.033333 2016-03-30 0.033669 2016-03-31 0.031171 2016-04-01 0.035608 2016-04-02 0.036465 2016-04-03 0.039374 2016-04-04 0.037286 2016-04-05 0.013572 2016-04-06 0.002796 2016-04-07 0.001566 Name: date_crawled, dtype: float64
Looks like a pretty even distribution, with about 3% of the total for each day. there are some exceptions, seems like sundays don't have as many crawl dates. Is this meaningful? Probably not. Moving on.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2015-06-11 0.000037 2015-08-10 0.000037 2015-09-09 0.000037 2015-11-10 0.000037 2016-01-03 0.000037 2016-01-10 0.000075 2016-01-13 0.000037 2016-01-14 0.000037 2016-01-16 0.000037 2016-01-22 0.000037 2016-01-27 0.000112 2016-01-29 0.000037 2016-02-01 0.000037 2016-02-02 0.000037 2016-02-05 0.000075 2016-02-07 0.000037 2016-02-09 0.000037 2016-02-12 0.000075 2016-02-14 0.000075 2016-02-17 0.000037 2016-02-18 0.000075 2016-02-19 0.000075 2016-02-20 0.000037 2016-02-21 0.000075 2016-02-22 0.000037 2016-02-23 0.000149 2016-02-24 0.000075 2016-02-25 0.000112 2016-02-26 0.000037 2016-02-27 0.000112 ... 2016-03-09 0.032513 2016-03-10 0.031805 2016-03-11 0.033147 2016-03-12 0.037062 2016-03-13 0.018941 2016-03-14 0.035123 2016-03-15 0.033072 2016-03-16 0.030089 2016-03-17 0.029791 2016-03-18 0.012901 2016-03-19 0.034452 2016-03-20 0.038330 2016-03-21 0.036503 2016-03-22 0.030947 2016-03-23 0.031805 2016-03-24 0.029232 2016-03-25 0.030015 2016-03-26 0.033371 2016-03-27 0.032401 2016-03-28 0.036764 2016-03-29 0.033221 2016-03-30 0.033482 2016-03-31 0.031544 2016-04-01 0.035272 2016-04-02 0.036242 2016-04-03 0.039709 2016-04-04 0.037584 2016-04-05 0.012081 2016-04-06 0.002908 2016-04-07 0.001380 Name: ad_created, Length: 70, dtype: float64
Looks like the most recent dates have the most created. We'll be sure and filter out cars advertised more than a month or so...
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2016-03-05 0.001044 2016-03-06 0.002908 2016-03-07 0.003840 2016-03-08 0.004922 2016-03-09 0.007942 2016-03-10 0.008837 2016-03-11 0.010216 2016-03-12 0.019463 2016-03-13 0.007308 2016-03-14 0.011559 2016-03-15 0.013572 2016-03-16 0.014094 2016-03-17 0.023117 2016-03-18 0.006935 2016-03-19 0.013497 2016-03-20 0.017711 2016-03-21 0.017040 2016-03-22 0.019090 2016-03-23 0.017040 2016-03-24 0.017077 2016-03-25 0.016219 2016-03-26 0.014952 2016-03-27 0.012714 2016-03-28 0.017934 2016-03-29 0.019426 2016-03-30 0.022073 2016-03-31 0.021439 2016-04-01 0.022521 2016-04-02 0.023192 2016-04-03 0.023826 2016-04-04 0.021887 2016-04-05 0.138628 2016-04-06 0.252535 2016-04-07 0.155444 Name: last_seen, dtype: float64
Looks like people check in reasonably frequently. Let's take a closer look at how old these cars are...
autos['registration_year'].describe()
count 26820.000000 mean 2007.464243 std 101.162141 min 1001.000000 25% 2004.000000 50% 2007.000000 75% 2010.000000 max 9999.000000 Name: registration_year, dtype: float64
We have some outliers in reg year -- let's update the data to remove too old or in-the-future registrations. We're only looking for cars less than six years old or so, so we'll start with that (remembering we only have data up to 2016)
autos = autos[autos.registration_year >= 2010]
Okay, now let's see where we are...'
autos['registration_year'].describe()
count 7408.000000 mean 2018.372975 std 191.381871 min 2010.000000 25% 2011.000000 50% 2012.000000 75% 2014.000000 max 9999.000000 Name: registration_year, dtype: float64
Need to get rid of the high values...
autos = autos[autos.registration_year <= 2018]
Another look ...
autos['registration_year'].describe()
count 7400.000000 mean 2012.514865 std 2.312414 min 2010.000000 25% 2011.000000 50% 2012.000000 75% 2014.000000 max 2018.000000 Name: registration_year, dtype: float64
Most cares are late 90s to 2010 or so
Okay, now we can take a closer look at some cars we might want to buy. First thing, let's get rid of any cars with damage...'
autos['unrepaired_damage'].describe()
count 6634 unique 2 top nein freq 6407 Name: unrepaired_damage, dtype: object
autos['unrepaired_damage'].unique()
array(['nein', nan, 'ja'], dtype=object)
Let's get rid of any cars with unrepaired damage'
autos = autos[autos.unrepaired_damage != 'ja']
Let's take a look at what we're left with
autos.describe()
price_usd | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 7173.000000 | 7173.000000 | 7173.000000 | 7173.000000 | 7173.000000 | 7173.000000 |
mean | 15625.960128 | 2012.518751 | 143.144152 | 80396.626237 | 5.937544 | 53250.257912 |
std | 12656.158518 | 2.313445 | 186.258926 | 46793.363619 | 3.483849 | 25955.019904 |
min | 2550.000000 | 2010.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 7900.000000 | 2011.000000 | 97.000000 | 40000.000000 | 3.000000 | 33014.000000 |
50% | 12900.000000 | 2012.000000 | 136.000000 | 70000.000000 | 6.000000 | 53119.000000 |
75% | 19800.000000 | 2014.000000 | 170.000000 | 125000.000000 | 9.000000 | 74731.000000 |
max | 198000.000000 | 2018.000000 | 14009.000000 | 150000.000000 | 12.000000 | 99976.000000 |
Okay, we're down to about 7000 possibilities. Let's look at the distribution of brands
autos['brand'].value_counts(normalize=True)
volkswagen 0.235188 audi 0.115712 bmw 0.103304 mercedes_benz 0.089084 opel 0.068451 ford 0.054649 renault 0.031228 skoda 0.030949 seat 0.026628 fiat 0.024258 mini 0.022445 hyundai 0.022166 smart 0.018960 peugeot 0.017426 nissan 0.016590 citroen 0.015614 toyota 0.013662 kia 0.013384 mazda 0.010874 chevrolet 0.010316 dacia 0.009898 porsche 0.008086 sonstige_autos 0.007249 suzuki 0.005995 honda 0.005158 volvo 0.005019 mitsubishi 0.004879 land_rover 0.003346 jeep 0.002928 alfa_romeo 0.002370 jaguar 0.001255 chrysler 0.000836 subaru 0.000697 lada 0.000418 lancia 0.000418 daihatsu 0.000279 daewoo 0.000139 saab 0.000139 Name: brand, dtype: float64
We want the biggest selection of cars, but we don't want to spent too much time sorting through the long tail. Instead, we'll take any brand with less than 2% of the overall share and aggregate them as "other"
brand_quantity = autos['brand'].value_counts(normalize=True)
brands_to_aggregate = brand_quantity[brand_quantity < .02].index
print(brands_to_aggregate)
Index(['smart', 'peugeot', 'nissan', 'citroen', 'toyota', 'kia', 'mazda', 'chevrolet', 'dacia', 'porsche', 'sonstige_autos', 'suzuki', 'honda', 'volvo', 'mitsubishi', 'land_rover', 'jeep', 'alfa_romeo', 'jaguar', 'chrysler', 'subaru', 'lada', 'lancia', 'daihatsu', 'daewoo', 'saab'], dtype='object')
for brand in brands_to_aggregate:
autos.loc[autos['brand'] == brand, 'brand'] = 'other'
Now take a look at what we have
autos['brand'].value_counts(normalize=True)
volkswagen 0.235188 other 0.175938 audi 0.115712 bmw 0.103304 mercedes_benz 0.089084 opel 0.068451 ford 0.054649 renault 0.031228 skoda 0.030949 seat 0.026628 fiat 0.024258 mini 0.022445 hyundai 0.022166 Name: brand, dtype: float64
Now that we have aggregated our brands, let's take a look at price and mileage. First, we'll get an array of the brands so that we can calculate mean prices in a loop.
brand_explore = autos['brand'].unique()
brand_explore
array(['other', 'mini', 'bmw', 'seat', 'volkswagen', 'audi', 'mercedes_benz', 'opel', 'ford', 'hyundai', 'fiat', 'renault', 'skoda'], dtype=object)
Now we calculate mean price
mean_price = {}
for brand in brand_explore:
branddf = autos[autos['brand'] == brand]
mean = branddf['price_usd'].mean()
mean_price[brand] = int(mean)
mean_price
{'audi': 21911, 'bmw': 20622, 'fiat': 7587, 'ford': 11418, 'hyundai': 10145, 'mercedes_benz': 21886, 'mini': 15749, 'opel': 10317, 'other': 15078, 'renault': 8716, 'seat': 10736, 'skoda': 11857, 'volkswagen': 14201}
Let's do the same thing for mileage
mean_odometer = {}
for brand in brand_explore:
branddf = autos[autos['brand'] == brand]
odometer = branddf['odometer_km'].mean()
mean_odometer[brand] = int(odometer)
mean_odometer
{'audi': 82861, 'bmw': 94453, 'fiat': 65977, 'ford': 76645, 'hyundai': 70849, 'mercedes_benz': 90837, 'mini': 60186, 'opel': 78869, 'other': 67983, 'renault': 74107, 'seat': 77486, 'skoda': 72725, 'volkswagen': 86147}
Here we have the remaining data set
mean_odom = pd.Series(mean_odometer).sort_values(ascending=False)
mean_prices = pd.Series(mean_price).sort_values(ascending=False)
final_table = pd.DataFrame(mean_prices, columns=['mean_prices'])
final_table["mean_odometer"] = mean_odom
final_table
mean_prices | mean_odometer | |
---|---|---|
audi | 21911 | 82861 |
mercedes_benz | 21886 | 90837 |
bmw | 20622 | 94453 |
mini | 15749 | 60186 |
other | 15078 | 67983 |
volkswagen | 14201 | 86147 |
skoda | 11857 | 72725 |
ford | 11418 | 76645 |
seat | 10736 | 77486 |
opel | 10317 | 78869 |
hyundai | 10145 | 70849 |
renault | 8716 | 74107 |
fiat | 7587 | 65977 |
Cheapest and lowest mileage seem to be fiats
If you have a bit more money, look at hyundais
If you have even more money, look at minis and "other"
Audis, BMWs and Mercedes are all most expensive and have among the highest mileage. Be cautious.