The dataset used in this project was slightly modified and reduced in size from the original. The intent is to facilitate data handling in this host environment and also provide "data cleaning" experience.
The data relates to auto sales information and was acquired from eBay Kleinanzeigen, a classifieds section of the German eBay website. The original dataset can be found here.
Often times, large data files are loaded with errors, inconsistencies, repetition, missing data and so on. The quality and accuracy of data analysis output and conclusions are highly dependent upon the quality and accuracy of the data itself. For that reason, data cleaning is an extremely important part of the data analysis process.
This particular project has various questions posed throughout to:
# import, read the data file and observe the number of rows and columns.
# this particular data file needed encoding to be specified, "Latin-1",
# in order to be able to read it.
import pandas as pd
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
print(autos.shape)
print('\n')
print("There are 50,000 rows and 20 columns in this data file.")
(50000, 20) There are 50,000 rows and 20 columns in this data file.
# observe the raw form of the table of information.
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
# create data file output that reveals characteristics as follows:
# column titles/format
# number of values for each column, excluding blanks
# type of data in each column - numeric or non-numeric
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
There are inconsistencies among titles in terms of lower case and upper case letters, abbreviations and so on. Titles with more than one word and no space between are in camelcase format.
# execute command to reveal format of each column title.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Use the replace function as shown below to modify title format accordingly.
autos_mod = autos.copy()
def clean_col(col):
col = col.replace("dateCrawled", "date_crawled")
col = col.replace("offerType","offer_type")
col = col.replace("abtest","ab_test")
col = col.replace("vehicleType","vehicle_type")
col = col.replace("yearOfRegistration","registration_year")
col = col.replace("gearbox","gear_box")
col = col.replace("powerPS","power_PS")
col = col.replace("monthOfRegistration","registration_month")
col = col.replace("fuelType","fuel_type")
col = col.replace("notRepairedDamage","unrepaired_damage")
col = col.replace("dateCreated","ad_created")
col = col.replace("nrOfPictures","nbr_of_pictures")
col = col.replace("postalCode","postal_code")
col = col.replace("lastSeen","last_seen")
return col
new_columns = []
for label in autos_mod.columns:
clean_label = clean_col(label)
new_columns.append(clean_label)
autos.columns = new_columns
print(autos.columns)
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gear_box', 'power_PS', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nbr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
The column heading titles were modified to a standard snakecase format as shown above in the output for consistency of appearance. This should make it a little easier to understand what the information in each column relates to.
The values under column heading "price" are classified as object because of the presence of the "$" sign and inclusion of a comma. If we want to make comparisons between vehicle prices, then we'll need to remove these characters and convert the price data from string to numeric as either int or float.
The values under column heading "odometer" are classified as object because of the presence of "km" in addition to the number values being in string format. Again, if vehicle distance (km) is needed in analysis comparison, then we would need to remove these characters and convert the string values to numeric as int.
# use replace function to remove string type characters from price values.
# use astype function to change price from string format to numeric int.
print("Raw form of price data before cleaning:")
print('\n')
print(autos["price"].unique())
print('\n')
autos["price"] = autos["price"].str.replace("$","")
autos["price"] = autos["price"].str.replace(",","")
autos["price"] = autos["price"].astype(int)
print("Cleaned and converted form of price data as numbers:")
print('\n')
print(autos["price"].unique())
print('\n')
print("Number of Unique Rows = ", autos["price"].unique().shape)
print('\n')
print(autos["price"].describe())
print('\n')
print("NOTE the min. 0 and max. 100000000 values of price.")
print("We will need to identify the unreasonable price range values and remove.")
print('\n')
print("Raw form of odometer data before cleaning:")
print('\n')
print(autos["odometer"].unique())
print('\n')
# use replace function to remove string type characters from odometer values.
# use astype function to change price from string format to numeric int.
autos["odometer"] = autos["odometer"].str.replace("km","")
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].astype(int)
# change odometer column heading name
def clean_col(col):
col = col.replace("odometer", "odometer_km")
return col
new_columns = []
for c in autos.columns:
clean_c = clean_col(c)
new_columns.append(clean_c)
autos.columns = new_columns
print(autos.columns)
print('\n')
print("Cleaned and converted form of odometer data as numbers:")
print('\n')
print(autos["odometer_km"].unique())
print('\n')
print(autos["odometer_km"].describe())
print('\n')
print("NOTE that min. 5000 and max. 150000 odometer values appear to be reasonable.")
print("No need to further modify odometer values.")
Raw form of price data before cleaning: ['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] Cleaned and converted form of price data as numbers: [ 5000 8500 8990 ... 385 22200 16995] Number of Unique Rows = (2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 NOTE the min. 0 and max. 100000000 values of price. We will need to identify the unreasonable price range values and remove. Raw form of odometer data before cleaning: ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km'] Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gear_box', 'power_PS', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nbr_of_pictures', 'postal_code', 'last_seen'], dtype='object') Cleaned and converted form of odometer data as numbers: [150000 70000 50000 80000 10000 30000 125000 90000 20000 60000 5000 100000 40000] count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 NOTE that min. 5000 and max. 150000 odometer values appear to be reasonable. No need to further modify odometer values.
Price data now is in numeric integer format, however, looking at the min. (0) and max. values (10000000), we need to identify unreasonable price values and remove from the dataset.
The minimum odometer reading of 5,000 km and 150,000 km respectively are reasonable. No need to exclude any odometer values.
# sort price values both descending and ascending.
# observe the top 15 price values and lowest 15 values.
sorted_price = autos.sort_values("price",ascending=False)
top_15 = sorted_price.iloc[:,4].head(15)
print('\n')
print(top_15)
sorted_price = autos.sort_values("price")
lowest_15 = sorted_price.iloc[:,4].head(15)
print('\n')
print(lowest_15)
print('\n')
# obtain more information on price range and decide on reasonable range for prices.
zero_price = autos[(autos["price"] == 0)]
zero_price_percent = len(zero_price) *100 / len(autos)
print("There are ", len(zero_price), " cars with no price or", zero_price_percent, "% of the total")
under_200 = autos[(autos["price"] < 200)]
under_200_percent = len(under_200) *100 / len(autos)
print("There are ", len(under_200), " cars under $200 or", under_200_percent, "% of the total")
over_350000 = autos[(autos["price"] > 350000)]
over_350000_percent = len(over_350000) *100 / len(autos)
print("There are ", len(over_350000), " cars over $350,000 or", over_350000_percent, "% of the total")
print('\n')
price_mod = (autos[autos["price"].between(200, 950000)])
price_cleaned = price_mod["price"]
autos["price_cleaned"] = price_cleaned
autos_clean = (autos[autos["price_cleaned"].notnull()])
print("Number of Cleaned Unique Rows = ", autos["price_cleaned"].unique().shape)
print('\n')
print(autos_clean["price_cleaned"].describe())
print('\n')
autos_clean.info()
print('\n')
price_notnull = autos[autos["price_cleaned"].notnull()]
price_notnull.info()
price_no = autos["price_cleaned"].dropna()
print(len(price_no))
percent_removed = ((len(autos) - len(autos_clean["price_cleaned"])) / len(autos)) * 100
print("Total percentage of cars excluded from analysis = ", percent_removed, "%")
39705 99999999 42221 27322222 27371 12345678 39377 12345678 47598 12345678 2897 11111111 24384 11111111 11137 10000000 47634 3890000 7814 1300000 22947 1234566 43049 999999 514 999999 37585 999990 36818 350000 Name: price, dtype: int64 18089 0 43923 0 15225 0 29499 0 8445 0 38832 0 43925 0 8438 0 35821 0 31332 0 12819 0 33620 0 15208 0 2481 0 4266 0 Name: price, dtype: int64 There are 1421 cars with no price or 2.842 % of the total There are 2341 cars under $200 or 4.682 % of the total There are 14 cars over $350,000 or 0.028 % of the total Number of Cleaned Unique Rows = (2281,) count 47645.000000 mean 6000.707273 std 9110.783444 min 200.000000 25% 1300.000000 50% 3190.000000 75% 7500.000000 max 350000.000000 Name: price_cleaned, dtype: float64 <class 'pandas.core.frame.DataFrame'> Int64Index: 47645 entries, 0 to 49999 Data columns (total 21 columns): date_crawled 47645 non-null object name 47645 non-null object seller 47645 non-null object offer_type 47645 non-null object price 47645 non-null int64 ab_test 47645 non-null object vehicle_type 43396 non-null object registration_year 47645 non-null int64 gear_box 45539 non-null object power_PS 47645 non-null int64 model 45314 non-null object odometer_km 47645 non-null int64 registration_month 47645 non-null int64 fuel_type 43914 non-null object brand 47645 non-null object unrepaired_damage 39041 non-null object ad_created 47645 non-null object nbr_of_pictures 47645 non-null int64 postal_code 47645 non-null int64 last_seen 47645 non-null object price_cleaned 47645 non-null float64 dtypes: float64(1), int64(7), object(13) memory usage: 8.0+ MB <class 'pandas.core.frame.DataFrame'> Int64Index: 47645 entries, 0 to 49999 Data columns (total 21 columns): date_crawled 47645 non-null object name 47645 non-null object seller 47645 non-null object offer_type 47645 non-null object price 47645 non-null int64 ab_test 47645 non-null object vehicle_type 43396 non-null object registration_year 47645 non-null int64 gear_box 45539 non-null object power_PS 47645 non-null int64 model 45314 non-null object odometer_km 47645 non-null int64 registration_month 47645 non-null int64 fuel_type 43914 non-null object brand 47645 non-null object unrepaired_damage 39041 non-null object ad_created 47645 non-null object nbr_of_pictures 47645 non-null int64 postal_code 47645 non-null int64 last_seen 47645 non-null object price_cleaned 47645 non-null float64 dtypes: float64(1), int64(7), object(13) memory usage: 8.0+ MB 47645 Total percentage of cars excluded from analysis = 4.71 %
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gear_box | power_PS | ... | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nbr_of_pictures | postal_code | last_seen | price_cleaned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 | 47645.000000 |
unique | 48213 | 38754 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | ... | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 | NaN |
top | 2016-04-04 16:40:33 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | ... | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 | NaN |
freq | 3 | 78 | 49999 | 49999 | NaN | 25756 | 12859 | NaN | 36993 | NaN | ... | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 | NaN |
mean | NaN | NaN | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | ... | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN | 6000.707273 |
std | NaN | NaN | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | ... | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN | 9110.783444 |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | ... | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN | 200.000000 |
25% | NaN | NaN | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | ... | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN | 1300.000000 |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | ... | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN | 3190.000000 |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | ... | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN | 7500.000000 |
max | NaN | NaN | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | ... | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN | 350000.000000 |
11 rows × 21 columns
# take a cursory look at each date output format.
print(autos[["date_crawled", "ad_created", "last_seen"]][:5])
date_crawled ad_created last_seen 0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54 1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08 2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37 3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28 4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
# remove time from date, calculate percentage of total each date crawled
# and order dates in sequence.
(autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index())
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
There is very little variation in percentage of each day relative to total number of days in the data set. Average percentage looks to be about 3.3. This perhaps shouldn't be surprising since the number of days is about 30 (one month) and each day is 3.3 percent of one month. So, the frequency of times used per day is about the same every day.
(autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index())
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 2015-12-30 0.00002 2016-01-03 0.00002 2016-01-07 0.00002 2016-01-10 0.00004 2016-01-13 0.00002 2016-01-14 0.00002 2016-01-16 0.00002 2016-01-22 0.00002 2016-01-27 0.00006 2016-01-29 0.00002 2016-02-01 0.00002 2016-02-02 0.00004 2016-02-05 0.00004 2016-02-07 0.00002 2016-02-08 0.00002 2016-02-09 0.00004 2016-02-11 0.00002 2016-02-12 0.00006 2016-02-14 0.00004 2016-02-16 0.00002 2016-02-17 0.00002 2016-02-18 0.00004 2016-02-19 0.00006 2016-02-20 0.00004 2016-02-21 0.00006 ... 2016-03-09 0.03324 2016-03-10 0.03186 2016-03-11 0.03278 2016-03-12 0.03662 2016-03-13 0.01692 2016-03-14 0.03522 2016-03-15 0.03374 2016-03-16 0.03000 2016-03-17 0.03120 2016-03-18 0.01372 2016-03-19 0.03384 2016-03-20 0.03786 2016-03-21 0.03772 2016-03-22 0.03280 2016-03-23 0.03218 2016-03-24 0.02908 2016-03-25 0.03188 2016-03-26 0.03256 2016-03-27 0.03090 2016-03-28 0.03496 2016-03-29 0.03414 2016-03-30 0.03344 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03508 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
Results above show nothing relevant prior to month of March in 2016 which is the date range for crawled dates. Ad_created percentages for each day in March is much the same as for crawled dates.
(autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index())
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64
Last_seen date relates to when the crawler saw this ad last online. For the data output above, I really don't have anything to comment on or surmise.
# use describe command to show basic statistics for registration_year.
autos["registration_year"].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Minimum registration year 1000 and maximum 9999 are not reasonable.
# determine percentage of values outside of years 1900 and 2016.
registration_outliers = (autos["registration_year"][(autos["registration_year"] > 2016) | (autos["registration_year"] < 1900)])
print("Qty.", len(registration_outliers), "cars outside of 1900 and 2016 year range")
print("Percentage outside of 1900 and 2016", len(registration_outliers)*100 / len(autos_mod), "%")
print('\n')
registration_outliers2 = (autos["registration_year"][(autos["registration_year"] > 2016) | (autos["registration_year"] < 1950)])
# determine percentage of values outside of years 1950 and 2016.
print("Qty.", len(registration_outliers2), "cars outside of 1950 and 2016 year range")
print("Percentage outside of 1950 and 2016", len(registration_outliers2)*100 / len(autos_mod), "%")
print('\n')
# establish reasonable range of years 1950 to 2016 included.
registration_mod = (autos[autos["registration_year"].between(1950, 2016)])
print("Qty.", len(registration_mod), "cars between the years 1950 and 2016")
reg_yrs_cleaned = registration_mod["registration_year"]
autos["reg_yrs_cleaned"] = reg_yrs_cleaned
print('\n')
print(reg_yrs_cleaned.value_counts(normalize=True))
print('\n')
autos.info
price_no = autos["price_cleaned"].dropna()
print(len(price_no))
Qty. 1972 cars outside of 1900 and 2016 year range Percentage outside of 1900 and 2016 3.944 % Qty. 1996 cars outside of 1950 and 2016 year range Percentage outside of 1950 and 2016 3.992 % Qty. 48004 cars between the years 1950 and 2016 2000 0.069869 2005 0.062807 1999 0.062495 2004 0.057016 2003 0.056808 2006 0.056412 2001 0.056308 2002 0.052766 1998 0.051100 2007 0.047996 2008 0.046475 2009 0.043705 1997 0.042246 2011 0.034039 2010 0.033268 1996 0.030081 2012 0.027560 2016 0.027414 1995 0.027352 2013 0.016790 2014 0.013874 1994 0.013749 1993 0.009270 2015 0.008312 1990 0.008228 1992 0.008145 1991 0.007416 1989 0.003771 1988 0.002958 1985 0.002187 ... 1982 0.000896 1972 0.000729 1979 0.000729 1960 0.000708 1981 0.000646 1967 0.000562 1976 0.000562 1971 0.000562 1968 0.000542 1973 0.000542 1974 0.000500 1966 0.000458 1977 0.000458 1969 0.000396 1975 0.000396 1965 0.000354 1964 0.000250 1963 0.000187 1959 0.000146 1961 0.000125 1956 0.000104 1962 0.000083 1958 0.000083 1950 0.000062 1955 0.000042 1957 0.000042 1954 0.000042 1951 0.000042 1953 0.000021 1952 0.000021 Name: registration_year, Length: 67, dtype: float64 47645
The percentage of data to be excluded outside of the years 1900 and 2016 is very small at 3.94%. By including years before 1950 as to be excluded, the percentage barely increased to 3.99%. On that basis, I chose to go with a reasonable year range of 1950 to 2016 inclusive.
# create a smaller DataFrame to work with.
cols = ["brand", "price_cleaned", "odometer_km", "reg_yrs_cleaned", "power_PS"]
autos_brand = autos[cols]
print('\n')
print(autos_brand.head())
print(len(autos_brand))
print('\n')
print(len(autos_brand.dropna()))
print('\n')
autos_final = autos_brand.dropna()
print(autos_final.info())
print('\n')
# determine qty of price values for each brand and sequence
# brands from highest volume of price values to lowest.
print(autos["brand"].value_counts())
print('\n')
autos["brand"].unique()
print('\n')
# determine top brands and qty to focus remainder of analysis on.
top8 = 10687+5461+5429+4734+4283+3479+2404+1456
print(top8, "= total cars of top 10 brands")
top8_brands_percent = 100*(8/40)
print('\n')
Percent_of_total_cars = 100*(top8/50000)
print("The top 8 brands represent",top8_brands_percent, "% of all the brands")
print('\n')
print("The top 8 brands account for", Percent_of_total_cars, "% of the total cars in the dataset")
brand price_cleaned odometer_km reg_yrs_cleaned power_PS 0 peugeot 5000.0 150000 2004.0 158 1 bmw 8500.0 150000 1997.0 286 2 volkswagen 8990.0 70000 2009.0 102 3 smart 4350.0 70000 2007.0 71 4 ford 1350.0 150000 2003.0 0 50000 45769 <class 'pandas.core.frame.DataFrame'> Int64Index: 45769 entries, 0 to 49999 Data columns (total 5 columns): brand 45769 non-null object price_cleaned 45769 non-null float64 odometer_km 45769 non-null int64 reg_yrs_cleaned 45769 non-null float64 power_PS 45769 non-null int64 dtypes: float64(2), int64(2), object(1) memory usage: 2.1+ MB None volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64 37933 = total cars of top 10 brands The top 8 brands represent 20.0 % of all the brands The top 8 brands account for 75.866 % of the total cars in the dataset
When faced with the challenge of deciding what to focus on whether prioritizing or performing root cause analysis, the Pareto Principle is often used. Oftentimes 20% of a list of categories accounts for 80% of the total values associated with the categories. That is what the Pareto Principle is based on.
In our case here, the category is "brand" and the associated values is "price". It turns out that 20% of the brands (8/40) accounts for 76 percent of the list of prices: close to following the Pareto Principle.
Normally I would choose to follow the Pareto Principle (choose 8 brands), but I will follow the Guided Project suggestion of looking at the top 6.
avg_price_by_brand = {}
# calculate average price for each of the top 8 brands.
brand = autos_final["brand"].unique()
print(len(autos_final))
for b in brand:
selected_rows = autos_final[autos_final["brand"] == b]
if b == "volkswagen":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
elif b == "opel":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
elif b == "bmw":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
elif b == "mercedes_benz":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
elif b == "audi":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
elif b == "ford":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_brand[b] = int(mean)
else:
data = 0
avg_price_by_brand
45769
{'audi': 9406, 'bmw': 8402, 'ford': 3870, 'mercedes_benz': 8688, 'opel': 3078, 'volkswagen': 5502}
Brand | Avg_Price_Dollars |
---|---|
audi | 9406 |
mercedes_benz | 8688 |
bmw | 8402 |
volkswagen | 5502 |
ford | 3870 |
opel | 3078 |
Regarding average price by brand as shown in the table above, there appears to be three groups of price ranges:
We need to investigate other factors to try and determine reasons for significant difference in average prices between brands.
avg_km_by_brand = {}
# calculate average odometer readings in km for each of the top 8 brands.
brand = autos_final["brand"].unique()
print(len(autos_final))
for b in brand:
selected_rows = autos_final[autos_final["brand"] == b]
if b == "volkswagen":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
elif b == "opel":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
elif b == "bmw":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
elif b == "mercedes_benz":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
elif b == "audi":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
elif b == "ford":
mean = selected_rows["odometer_km"].mean()
avg_km_by_brand[b] = int(mean)
else:
data = 0
avg_km_by_brand
45769
{'audi': 129260, 'bmw': 132817, 'ford': 124208, 'mercedes_benz': 131135, 'opel': 129253, 'volkswagen': 128777}
Brand | Avg_kms |
---|---|
bmw | 132817 |
mercedes_benz | 131135 |
audi | 129260 |
opel | 129253 |
volkswagen | 128777 |
ford | 124208 |
As we see in the table above, there's not much difference in average odometer_km between brands.
The next step is to create a table with both average price and average km by brand and see if there is an obvious relationship between the two.
# convert dictionary to series format.
mean_price_by_brand = pd.Series(avg_price_by_brand)
mean_km_by_brand = pd.Series(avg_km_by_brand).sort_values(ascending=False)
Top6_Table = pd.DataFrame(mean_km_by_brand, columns=["mean_km"])
# add mean_price data as second column to existing series mean_km.
Top6_Table["mean_price"] = mean_price_by_brand
print(Top6_Table)
mean_km mean_price bmw 132817 8402 mercedes_benz 131135 8688 audi 129260 9406 opel 129253 3078 volkswagen 128777 5502 ford 124208 3870
Brand | Avg_kms | Avg_Price |
---|---|---|
bmw | 132817 | 8402 |
mercedes_benz | 131135 | 8688 |
audi | 129260 | 9406 |
opel | 129253 | 3078 |
volkswagen | 128777 | 5502 |
ford | 124208 | 3870 |
There is no obvious relationship between average price and average km driven for each brand to explain price differences. The magnitude of differences in average km by brand is too small to establish a trend if there really is one.
There are many other factors that contribute to car price, one of which may be engine power rating. In this dataset, that information is in the column heading "power_PS". PS stands for Pferdestrke in German (literally, 'horse strength').
Let's see if there's a relationship between brand engine power and price.
# take a look at number of unique power_PS values and some of its statistics.
print(autos_final["power_PS"].unique())
print('\n')
print("Number of Unique Rows = ", autos_final["power_PS"].unique().shape)
print('\n')
print(autos_final["power_PS"].describe())
sorted_power_PS = autos_final.sort_values("power_PS",ascending=False)
top_15 = sorted_power_PS.iloc[:,4].head(15)
print('\n')
print(top_15)
sorted_power_PS = autos_final.sort_values("power_PS")
lowest_15 = sorted_power_PS.iloc[:,4].head(15)
print('\n')
print(lowest_15)
print('\n')
# establish a reasonable range of power_PS values to include in the analysis.
autos_final = (autos_final[autos_final["power_PS"].between(300, 2000)]).reset_index(drop=True)
print("Number of Cleaned Unique Rows = ", autos_final["power_PS"].unique().shape)
print('\n')
print(autos_final["power_PS"].describe())
[ 158 286 102 71 0 150 90 101 75 203 205 124 131 320 184 120 306 54 110 204 218 140 193 122 60 86 43 95 435 84 116 58 80 121 179 57 136 108 65 170 143 113 177 163 385 63 200 151 85 109 145 127 250 105 50 126 160 183 81 174 125 141 224 239 560 55 88 344 232 135 45 73 129 103 436 53 68 211 245 128 241 115 231 325 235 197 133 111 260 82 83 185 173 92 61 190 74 100 44 70 147 225 107 69 64 517 230 315 1 370 360 155 258 326 33 94 192 156 52 130 180 117 275 114 256 299 207 104 118 87 394 220 272 175 76 280 420 98 333 99 77 165 89 148 106 400 188 171 313 247 59 26 67 387 265 330 367 401 345 48 381 328 223 72 310 6512 382 40 146 169 182 5 408 210 132 10 186 79 123 166 41 194 1003 279 196 139 11 176 340 300 284 264 209 380 1988 295 206 234 269 349 900 96 525 47 37 409 350 144 354 355 152 78 292 271 321 212 334 34 226 6045 97 480 18 91 240 62 164 277 1781 1001 343 154 178 222 201 1793 213 398 305 215 323 39 134 112 66 187 612 56 500 276 268 287 476 46 290 119 329 15 282 233 228 270 138 514 30 259 431 137 551 238 1779 507 142 199 610 252 237 365 450 317 371 198 368 640 515 273 51 4 16 457 1016 42 281 405 309 2 850 379 449 348 208 555 93 1300 181 6226 219 311 3500 27 544 296 167 168 579 285 35 14 161 1400 262 304 1090 487 242 254 4400 460 29 25 23 455 19 454 153 9 559 6 162 202 14009 630 564 571 1055 1082 157 521 1103 432 502 1202 149 1056 396 189 227 298 195 754 243 322 266 15001 489 2729 375 341 682 216 7511 356 485 24 1796 439 20 49 172 441 1398 374 1780 253 510 1800 3 678 347 600 530 191 386 1000 490 421 1367 415 430 404 950 362 587 16312 388 8 1011 17700 1771 249 407 236 318 1753 952 740 585 8404 21 5867 473 504 289 214 1405 696 1704 337 923 442 1401 650 1998 38] Number of Unique Rows = (434,) count 45769.000000 mean 119.204527 std 185.487337 min 0.000000 25% 75.000000 50% 109.000000 75% 150.000000 max 17700.000000 Name: power_PS, dtype: float64 36421 17700 35039 16312 22592 15001 16743 14009 41172 8404 24943 7511 1699 6512 11009 6226 3753 6045 41673 5867 14608 4400 11311 3500 23660 2729 49263 1998 2670 1988 Name: power_PS, dtype: int64 31073 0 45606 0 31052 0 31065 0 8532 0 13545 0 31082 0 3217 0 13539 0 42398 0 42401 0 3222 0 31091 0 13530 0 3225 0 Name: power_PS, dtype: int64 Number of Cleaned Unique Rows = (151,) count 952.000000 mean 403.539916 std 212.113620 min 300.000000 25% 313.000000 50% 340.500000 75% 405.000000 max 1998.000000 Name: power_PS, dtype: float64
avg_power_by_brand = {}
# calculate average power_PS for each of the top 8 brands.
brand = autos_final["brand"].unique()
for b in brand:
selected_rows = autos_final[autos_final["brand"] == b]
if b == "volkswagen":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
elif b == "opel":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
elif b == "bmw":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
elif b == "mercedes_benz":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
elif b == "audi":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
elif b == "ford":
mean = selected_rows["power_PS"].mean()
avg_power_by_brand[b] = int(mean)
else:
data = 0
avg_power_by_brand
{'audi': 368, 'bmw': 349, 'ford': 516, 'mercedes_benz': 403, 'opel': 774, 'volkswagen': 508}
Brand | Avg_Price_Dollars | Avg_Engine_Power |
---|---|---|
audi | 9406 | 368 |
mercedes_benz | 8688 | 403 |
bmw | 8402 | 349 |
volkswagen | 5502 | 508 |
ford | 3870 | 516 |
opel | 3078 | 774 |
The relationship between average price and average engine power by brand as shown in the table above seems to be opposite to what I would expect. I would expect higher price for higher engine power.
However, it's possible that the units of engine power may differ by brand. So, nothing conclusive here regarding avg price and avg engine power.
I would expect as km used by car increases, price would reduce. So, as km increases, the ratio of km/price would also increase, since the numerator increases while the denominator reduces.
Let's investigate the relationship between avg km/price and avg km. We will analyze the relationship within one specific brand - Volkswagen.
# create a smaller DataFrame.
selection = ["brand", "price_cleaned", "odometer_km"]
autos_final = autos_final[selection]
km_price_ratio = autos_final["odometer_km"] / autos_final["price_cleaned"]
autos_final["km_price_ratio"] = km_price_ratio
print(autos_final.head())
print('\n')
avg_km_by_ratio = {}
odometer = autos_final["odometer_km"].unique()
brand = autos_final["brand"].unique()
for b in brand:
selected_rows = autos_final[autos_final["brand"] == b]
if b == "volkswagen":
for o in odometer:
selected_odo = autos_final[autos_final["odometer_km"] == o]
mean = selected_odo["km_price_ratio"].mean()
avg_km_by_ratio[o] = int(mean)
else:
data = 0
avg_km_by_ratio
mean_km_by_ratio = pd.Series(avg_km_by_ratio)
Correlation_Table = pd.DataFrame(mean_km_by_ratio, columns = ["km/price"])
print("The table below relates to Volkswagen Brand Auto")
print("The first column is odometer reading in km")
print("The second column is km divided by price")
print('\n')
print(Correlation_Table)
brand price_cleaned odometer_km km_price_ratio 0 porsche 41500.0 150000 3.614458 1 bmw 48500.0 30000 0.618557 2 mercedes_benz 56000.0 60000 1.071429 3 jaguar 36000.0 40000 1.111111 4 bmw 50500.0 100000 1.980198 The table below relates to Volkswagen Brand Auto The first column is odometer reading in km The second column is km divided by price km/price 5000 1 10000 0 20000 1 30000 0 40000 0 50000 1 60000 2 70000 3 80000 3 90000 3 100000 5 125000 9 150000 24
Avg km | km price ratio |
---|---|
5000 | 4 |
10000 | 2 |
20000 | 5 |
30000 | 5 |
40000 | 6 |
50000 | 8 |
60000 | 13 |
70000 | 17 |
80000 | 22 |
90000 | 32 |
100000 | 39 |
125000 | 59 |
150000 | 112 |
As expected, we see in the table above that there is a strong relationship between avg km and avg km/price for Volkswagen. It may not be a linear relationship but most likely a curved one. This suggests that as car km usage increases the magnitude of price drop increases non-linearly.
# use replace function to convert some string data from German to English.
print("German information:")
print('\n')
print(autos["gear_box"].unique())
print('\n')
autos["gear_box"] = autos["gear_box"].str.replace("manuell","manual")
autos["gear_box"] = autos["gear_box"].str.replace("automatik","automatic")
print("English information:")
print('\n')
print(autos["gear_box"].unique())
German information: ['manuell' 'automatik' nan] English information: ['manual' 'automatic' nan]
print(autos["date_crawled"][:5])
print('\n')
# use replace and astype functions to convert dates from string to numeric integer.
autos["date_crawled"] = autos["date_crawled"].str[:10]
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
autos["date_crawled"] = autos["date_crawled"].astype(int)
date_integer = autos["date_crawled"]
print(date_integer[:5])
0 2016-03-26 17:47:46 1 2016-04-04 13:38:56 2 2016-03-26 18:57:24 3 2016-03-12 16:58:10 4 2016-04-01 14:38:50 Name: date_crawled, dtype: object 0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: date_crawled, dtype: int64
We'll focus on determining the top 3 model cars within each of the top 3 brands: Volkswagen, Opel and BMW.
# remove any empty cells from the columns of interest.
cols = ["brand", "model"]
autos_combo = autos[cols]
print(autos_combo.head())
print(len(autos_combo))
print('\n')
print(len(autos_combo.dropna()))
print('\n')
# determine the qty of specific models within each brand chosen.
selected_rows = autos_combo[autos_combo["brand"] == "volkswagen"]
print(selected_rows["model"].value_counts())
print('\n')
selected_rows = autos_combo[autos_combo["brand"] == "opel"]
print(selected_rows["model"].value_counts())
print('\n')
selected_rows = autos_combo[autos_combo["brand"] == "bmw"]
print(selected_rows["model"].value_counts())
brand model 0 peugeot andere 1 bmw 7er 2 volkswagen golf 3 smart fortwo 4 ford focus 50000 47242 golf 4024 polo 1757 passat 1425 transporter 704 touran 468 lupo 349 sharan 233 caddy 225 beetle 132 tiguan 121 bora 105 touareg 98 andere 98 fox 90 scirocco 89 eos 67 kaefer 63 up 52 jetta 39 phaeton 32 cc 20 amarok 7 Name: model, dtype: int64 corsa 1735 astra 1454 vectra 574 zafira 423 omega 202 andere 157 meriva 147 tigra 96 insignia 88 signum 69 agila 62 vivaro 57 combo 56 kadett 53 calibra 26 antara 22 Name: model, dtype: int64 3er 2761 5er 1183 1er 538 x_reihe 302 7er 135 z_reihe 121 m_reihe 47 andere 42 6er 32 i3 1 Name: model, dtype: int64
Based on the distribution of models within each brand shown above, it looks like each case may follow the Pareto Principle.
We'll summarize the top three models for each brand in the table below.
Brand | Model | Qty |
---|---|---|
Volkswagen | Golf | 4024 |
Volkswagen | Polo | 1757 |
Volkswagen | Passat | 1425 |
Opel | Corsa | 1735 |
Opel | Astra | 1454 |
Opel | Vectra | 574 |
BMW | 3er | 2761 |
BMW | 5er | 1183 |
BMW | 1er | 538 |
The top 3 Volkswagen models account for 71 percent of the total Volkswagen cars.
The top 3 Opel models account for 72 percent of the total Opel cars.
The top 3 BMW models account for 87 percent of the total BMW cars.
# perform some clean up on unrepaired_damage data.
print(autos["unrepaired_damage"].unique())
print(autos["unrepaired_damage"].value_counts())
print('\n')
autos["unrepaired_damage"] = autos["unrepaired_damage"].str.replace("nein","no")
autos["unrepaired_damage"] = autos["unrepaired_damage"].str.replace("ja","yes")
print(autos["unrepaired_damage"].value_counts())
cols = ["unrepaired_damage", "price_cleaned"]
autos_compare = autos[cols]
print(autos_compare.head())
print(len(autos_compare))
print('\n')
print(len(autos_compare.dropna()))
print('\n')
avg_price_by_repair = {}
repair = autos_compare["unrepaired_damage"].unique()
print(len(autos_compare))
print('\n')
# calculate mean price for cars repaired and cars not repaired.
for r in repair:
selected_rows = autos_compare[autos_compare["unrepaired_damage"] == r]
if r == "yes":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_repair[r] = int(mean)
elif r == "no":
mean = selected_rows["price_cleaned"].mean()
avg_price_by_repair[r] = int(mean)
avg_price_by_repair
['nein' nan 'ja'] nein 35232 ja 4939 Name: unrepaired_damage, dtype: int64 no 35232 yes 4939 Name: unrepaired_damage, dtype: int64 unrepaired_damage price_cleaned 0 no 5000.0 1 no 8500.0 2 no 8990.0 3 no 4350.0 4 no 1350.0 50000 39041 50000
{'no': 7120, 'yes': 2343}
With "Yes" meaning damage was NOT repaired and "No" meaning any damage was repaired, it makes sense that the average price for an unrepaired car (2,343) is a lot less than a repaired car (7,120).
Perhaps the sales people make significant more money by repairing the cars than not repairing them in that I don't think repair cost on average is (7,120 - 2,343 = 4,777).
Various questions were posed throughout this project regarding the car sales dataset. This provided much opportunity to explore the information and become familiar with the process of cleaning data in order to answer the questions.
Based on the data provided regarding car brand, odometer readings in km and sales price, there was nothing conclusive about any relationship between these three. There weren't enough variables (car size, included features, etc.) included in the dataset to conclude what the top factors were that created significant difference in sales prices for the top six models listed: Volkswagen, Opel, BMW, Mercedes_Benz, Audi and Ford.