The given datasaet contains about 50,000 entries of deatils of used cars put up for sales. We will do some analysis on this dataset.
import pandas as pd
import numpy as np
autos=pd.read_csv('autos.csv',encoding = 'Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head(4)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
Columns "gearbox", "model", "vehicleType", "fuelType" and "notRepairedDamage" have null values for some entries. 5 columns have int type entries. While rest of them are object.
def clean_string(char):
word=""
for a in char:
if a.isupper():
new_a="_"+ a.lower()
a=new_a
word+=a
word=word.strip("_")
return word
auto_copy=autos
print(auto_copy.columns)
label_list=[]
for name in auto_copy.columns:
if name=="yearOfRegistration":
name="registration_year"
elif name=="monthOfRegistration":
name="registration_month"
elif name=="notRepairedDamage":
name="unrepaired_damage"
elif name=="dateCreated":
name="ad_created"
else:
name=clean_string(name)
label_list.append(name)
auto_copy.columns=label_list
print(auto_copy.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object') Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
The column names have been made more readable.
print(auto_copy.describe())
registration_year power_p_s registration_month nr_of_pictures \ count 50000.000000 50000.000000 50000.000000 50000.0 mean 2005.073280 116.355920 5.723360 0.0 std 105.712813 209.216627 3.711984 0.0 min 1000.000000 0.000000 0.000000 0.0 25% 1999.000000 70.000000 3.000000 0.0 50% 2003.000000 105.000000 6.000000 0.0 75% 2008.000000 150.000000 9.000000 0.0 max 9999.000000 17700.000000 12.000000 0.0 postal_code count 50000.000000 mean 50813.627300 std 25779.747957 min 1067.000000 25% 30451.000000 50% 49577.000000 75% 71540.000000 max 99998.000000
1. "nr_of_pictures" seems to have no data. This column can be deleted.
2. "price" and "odometer" have been stored as a string. This needs to be converted into a numeric type data.
3. To calculate how old the car is, "registration year" could be converted into an datetime object. Also the "ad_created" column needs to be converted into a datetime object. With reference to date of ad creation, we can get an idea of how old the car is.
print(auto_copy.loc[:,"odometer"].head(2))
0 150,000km 1 150,000km Name: odometer, dtype: object
auto_copy["odometer"]=auto_copy["odometer"].str.replace(",","").str.replace("km","")
auto_copy["odometer"]=auto_copy["odometer"].astype(float)
auto_copy.rename({"odometer":"odometer_km"},axis=1,inplace=True)
print(auto_copy["price"].head(2))
0 $5,000 1 $8,500 Name: price, dtype: object
auto_copy["price"]=auto_copy["price"].str.replace(",","").str.replace("$","")
auto_copy["price"]=auto_copy["price"].astype(float)
print(auto_copy["price"].describe())
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
print(auto_copy["price"].value_counts().sort_index().head(50))
0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 40.0 6 45.0 4 47.0 1 49.0 4 50.0 49 55.0 2 59.0 1 60.0 9 65.0 5 66.0 1 70.0 10 75.0 5 79.0 1 80.0 15 89.0 1 90.0 5 99.0 19 100.0 134 110.0 3 111.0 2 115.0 2 117.0 1 120.0 39 122.0 1 125.0 8 129.0 1 130.0 15 135.0 1 139.0 1 140.0 9 Name: price, dtype: int64
auto_copy=auto_copy[auto_copy["price"].between(100,4000000)]
auto_copy["price"].value_counts().sort_index(ascending=False)
3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 194000.0 1 190000.0 1 180000.0 1 175000.0 1 169999.0 1 169000.0 1 163991.0 1 163500.0 1 155000.0 1 151990.0 1 145000.0 1 139997.0 1 137999.0 1 135000.0 1 130000.0 1 ... 199.0 41 198.0 1 195.0 2 193.0 1 190.0 16 188.0 1 185.0 1 180.0 35 179.0 1 175.0 12 173.0 1 170.0 7 160.0 8 156.0 2 150.0 224 149.0 7 145.0 2 140.0 9 139.0 1 135.0 1 130.0 15 129.0 1 125.0 8 122.0 1 120.0 39 117.0 1 115.0 2 111.0 2 110.0 3 100.0 134 Name: price, Length: 2315, dtype: int64
#date_crawled
a=auto_copy["date_crawled"].str[:10]
print(a.value_counts(normalize=True,dropna=False).sort_index(ascending=False))
#ad_created
b=auto_copy["ad_created"].str[:10]
print(b.value_counts(normalize=True,dropna=False).sort_index(ascending=False).head(50))
#date_crawled
c=auto_copy["last_seen"].str[:10]
print(c.value_counts(normalize=True,dropna=False).sort_index(ascending=False))
2016-04-07 0.001389 2016-04-06 0.003172 2016-04-05 0.013062 2016-04-04 0.036575 2016-04-03 0.038607 2016-04-02 0.035600 2016-04-01 0.033693 2016-03-31 0.031847 2016-03-30 0.033734 2016-03-29 0.034128 2016-03-28 0.034957 2016-03-27 0.031122 2016-03-26 0.032304 2016-03-25 0.031495 2016-03-24 0.029442 2016-03-23 0.032283 2016-03-22 0.032905 2016-03-21 0.037217 2016-03-20 0.037798 2016-03-19 0.034729 2016-03-18 0.012897 2016-03-17 0.031516 2016-03-16 0.029463 2016-03-15 0.034315 2016-03-14 0.036658 2016-03-13 0.015675 2016-03-12 0.036906 2016-03-11 0.032594 2016-03-10 0.032283 2016-03-09 0.033009 2016-03-08 0.033174 2016-03-07 0.036056 2016-03-06 0.014037 2016-03-05 0.025358 Name: date_crawled, dtype: float64 2016-04-07 0.001244 2016-04-06 0.003255 2016-04-05 0.011798 2016-04-04 0.036927 2016-04-03 0.038855 2016-04-02 0.035289 2016-04-01 0.033672 2016-03-31 0.031889 2016-03-30 0.033548 2016-03-29 0.034087 2016-03-28 0.035061 2016-03-27 0.031039 2016-03-26 0.032366 2016-03-25 0.031619 2016-03-24 0.029380 2016-03-23 0.032117 2016-03-22 0.032718 2016-03-21 0.037446 2016-03-20 0.037860 2016-03-19 0.033610 2016-03-18 0.013581 2016-03-17 0.031184 2016-03-16 0.029961 2016-03-15 0.034045 2016-03-14 0.035289 2016-03-13 0.017043 2016-03-12 0.036741 2016-03-11 0.032905 2016-03-10 0.031993 2016-03-09 0.033091 2016-03-08 0.033174 2016-03-07 0.034792 2016-03-06 0.015302 2016-03-05 0.022911 2016-03-04 0.001493 2016-03-03 0.000871 2016-03-02 0.000104 2016-03-01 0.000104 2016-02-29 0.000166 2016-02-28 0.000207 2016-02-27 0.000124 2016-02-26 0.000041 2016-02-25 0.000062 2016-02-24 0.000041 2016-02-23 0.000083 2016-02-22 0.000021 2016-02-21 0.000062 2016-02-20 0.000041 2016-02-19 0.000062 2016-02-18 0.000041 Name: ad_created, dtype: float64 2016-04-07 0.132138 2016-04-06 0.221957 2016-04-05 0.125088 2016-04-04 0.024528 2016-04-03 0.025130 2016-04-02 0.024902 2016-04-01 0.022849 2016-03-31 0.023823 2016-03-30 0.024694 2016-03-29 0.022310 2016-03-28 0.020858 2016-03-27 0.015550 2016-03-26 0.016670 2016-03-25 0.019096 2016-03-24 0.019759 2016-03-23 0.018578 2016-03-22 0.021356 2016-03-21 0.020547 2016-03-20 0.020651 2016-03-19 0.015758 2016-03-18 0.007319 2016-03-17 0.028095 2016-03-16 0.016442 2016-03-15 0.015861 2016-03-14 0.012627 2016-03-13 0.008874 2016-03-12 0.023782 2016-03-11 0.012399 2016-03-10 0.010637 2016-03-09 0.009579 2016-03-08 0.007319 2016-03-07 0.005432 2016-03-06 0.004313 2016-03-05 0.001078 Name: last_seen, dtype: float64
auto_copy["registration_year"].describe()
count 48230.000000 mean 2004.729795 std 87.892058 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
bool_reg=auto_copy["registration_year"].between (1950,2016)
corrected_auto_copy=auto_copy[bool_reg]
corrected_auto_copy["registration_year"].value_counts().sort_index()
1950 1 1951 2 1952 1 1953 1 1954 2 1955 2 1956 4 1957 2 1958 4 1959 6 1960 22 1961 6 1962 4 1963 8 1964 12 1965 17 1966 22 1967 26 1968 26 1969 19 1970 37 1971 26 1972 33 1973 23 1974 24 1975 18 1976 21 1977 22 1978 42 1979 34 ... 1987 72 1988 133 1989 171 1990 332 1991 338 1992 368 1993 420 1994 626 1995 1193 1996 1355 1997 1925 1998 2340 1999 2880 2000 3104 2001 2629 2002 2477 2003 2693 2004 2699 2005 2911 2006 2669 2007 2273 2008 2210 2009 2081 2010 1587 2011 1618 2012 1308 2013 801 2014 662 2015 380 2016 1202 Name: registration_year, Length: 67, dtype: int64
auto_copy=corrected_auto_copy
print(auto_copy["brand"].value_counts(normalize=True))
volkswagen 0.211476 bmw 0.110205 opel 0.107227 mercedes_benz 0.096631 audi 0.086791 ford 0.069809 renault 0.047064 peugeot 0.029866 fiat 0.025614 seat 0.018256 skoda 0.016422 nissan 0.015343 mazda 0.015235 smart 0.014199 citroen 0.014048 toyota 0.012796 hyundai 0.010013 sonstige_autos 0.009430 volvo 0.009128 mini 0.008804 mitsubishi 0.008179 honda 0.007876 kia 0.007078 alfa_romeo 0.006668 porsche 0.006021 suzuki 0.005934 chevrolet 0.005675 chrysler 0.003517 dacia 0.002654 daihatsu 0.002503 jeep 0.002287 land_rover 0.002115 subaru 0.002115 saab 0.001662 jaguar 0.001532 daewoo 0.001489 trabant 0.001359 rover 0.001338 lancia 0.001057 lada 0.000583 Name: brand, dtype: float64
brand_dictionary_price={}
brand_dictionary_miles={}
brand_dictionary_counts={}
brands=auto_copy["brand"].value_counts(normalize=True).head(17)
for brand in brands.index:
bool_brand=auto_copy["brand"]==brand
selected_rows= auto_copy[bool_brand]
mean_price=selected_rows["price"].mean()
brand_dictionary_price[brand]=mean_price
mean_miles=selected_rows["odometer_km"].mean()
brand_dictionary_miles[brand]=mean_miles
number_of_data=selected_rows["brand"].value_counts()
brand_dictionary_counts[brand]=number_of_data.iloc[0]
#sorted_brand=sorted(brand_dictionary.items(),key=lambda x:x[1],reverse=True)
#$sorted_brand_cnt=sorted(brand_dictionary_counts.items(),key=lambda x:x[1],reverse=True)
final_table=pd.Series(brand_dictionary_price)
final_table=pd.DataFrame(final_table,columns=["mean_price"])
series_miles=pd.Series(brand_dictionary_miles)
series_counts=pd.Series(brand_dictionary_counts)
final_table['mean_miles']=series_miles
final_table['data_counts']=series_counts
print(final_table)
mean_price mean_miles data_counts audi 9380.718548 129245.400298 4022 bmw 8621.165459 132723.712551 5107 citroen 3796.262673 119554.531490 651 fiat 2836.873631 116950.294861 1187 ford 4074.959196 124386.398764 3235 hyundai 5411.075431 106885.775862 464 mazda 4129.774788 124553.824363 706 mercedes_benz 8669.448638 131069.673962 4478 nissan 4756.659634 118326.300985 711 opel 3005.930972 129405.312940 4969 peugeot 3113.860549 127127.890173 1384 renault 2496.940394 128337.918386 2181 seat 4433.419622 121536.643026 846 skoda 6409.609724 110906.701708 761 smart 3596.402736 99734.042553 658 toyota 5167.091062 115944.350759 593 volkswagen 5636.200306 128807.142857 9800
auto_copy.head(2)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
auto_copy["unrepaired_damage"].value_counts()
nein 33769 ja 4477 Name: unrepaired_damage, dtype: int64
auto_copy["unrepaired_damage"]=auto_copy["unrepaired_damage"].str.replace("nein","no")
auto_copy["unrepaired_damage"]=auto_copy["unrepaired_damage"].str.replace("ja","yes")
print(auto_copy["unrepaired_damage"].unique())
['no' nan 'yes']
print(auto_copy["model"].value_counts().describe())
print(auto_copy["brand"].value_counts().describe())
count 244.000000 mean 181.209016 std 430.101310 min 1.000000 25% 24.750000 50% 51.500000 75% 119.500000 max 3684.000000 Name: model, dtype: float64 count 40.000000 mean 1158.525000 std 1980.509427 min 27.000000 25% 113.500000 50% 393.500000 75% 782.250000 max 9800.000000 Name: brand, dtype: float64
print(auto_copy["model"].value_counts().head(5))
print(auto_copy["brand"].value_counts().head(5))
golf 3684 andere 3340 3er 2602 polo 1592 corsa 1567 Name: model, dtype: int64 volkswagen 9800 bmw 5107 opel 4969 mercedes_benz 4478 audi 4022 Name: brand, dtype: int64
There are 244 models and 40 brands. We will see most common models for first five brands.
brand_names=["volkswagen","bmw","opel","mercedes_benz","audi"]
for brand in brand_names:
bool_brand=auto_copy["brand"]==brand
selected_rows=auto_copy[bool_brand]
common_models=selected_rows["model"].value_counts()
common_model=common_models.index[0]
print("Common model for {} is {} .".format(brand,common_model))
Common model for volkswagen is golf . Common model for bmw is 3er . Common model for opel is corsa . Common model for mercedes_benz is c_klasse . Common model for audi is a4 .