As a data analyst in MBOA company, I have been appointed to explore a dataset of used cars from eBay Kleinanzeigen, containing information about different types of entertainment published in the advert section of the German eBay website. The purpose of this project is to clean the data and analyze the included used car listings.
Each advert in the dataset provides information as follow :
dateCrawled
: When this ad was first crawled. All field-values are taken from this date.name
: Name of the car.seller
: Whether the seller is private or a dealer.offertype
: The type of listing.price
: The price on the ad to sell the car.abtest
: Whether the listing is included in an A/B test.vehicletype
: The motor-car type.yearOfRegistration
: The year in which the car was first registered.gearbox
: The transmission type.powerPs
: The power of the vehicle in PS.model
: The motor-car model name.kilometer
: How many kilometers the car has driven.monthOfRegistration
: The month in which the car was first registered.fuelType
: What type of fuel the car uses.brand
: The brand of the car.notRepairedDamage
: If the car has a damage which is not yet fixed.dateCreated
: The date on which the eBay listing was created.nrOfPictures
: The number of picutres in the ad.postalCode
: The postal code for the location of the vehicle.lastSeenOnline
: When the crawler saw this ad last online.Firstly, let's read our dataset named autos.csv and store it to a variable named autos
.
#Importing the pandas and Numpy libraries.
import numpy as np
'''
Library : Numpy
Alias : np
'''
import pandas as pd
'''
Library : Pandas
Alias : pd
'''
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
#Displaying the description about the autos dataframe"
autos.info()
#displaying the first five rows in autos
autos.head(2)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
At a glance, the autos
dataframe contains listings related to several famous brand in the automobile industry including Ford, BMW, and Volkswagen.
Each Column name has been writed in camelcase, a practical writing style in which each word or abbreviation in the middle of the phrase begins with a capital letter. Instead, we will convert them in Python's conventional snakecase
. This refers to the style of writing in which each space is replaced by an underscore(_)
character, and the first letter of each word written in lowercace.
As a result, our column names will be more quickly recongize than in camelcase during our analysis.
print( autos.columns ) #an array containing our column names
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#Converting from camelcase to Python snakecase
def clean_col( c ) :
c = c.replace( "yearOfRegistration", "registration_year")
c = c.replace( "monthOfRegistration", "registration_month")
c = c.replace( "notRepairedDamage", "unrepaired_damage")
c = c.replace( "dateCreated", "ad_created")
c = c.replace( "datecrawled", "date_crawled")
c = c.replace( "lastseen", "last_seen")
#Convert to lowercase
c = c.lower()
return c
clean_columns = []
#Iterating over columns :
for c in autos.columns :
c = clean_col( c )
clean_columns.append( c )
#Assigning the clean column names back to autos.columns attribute
autos.columns = clean_columns
autos.head(2)
date_crawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
We start our exploration by look at descriptive statistics for all columns. Perhaps, there are some columns whose values could be dropped as they are not helpful for our analysis.
autos.describe( include = "all")
date_crawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Based on our first analysis, nrofpictures
found not interesting for our analysis as all of its value are null. Therefore it will be dropped. Futhermore, there are several columns which need more investigation because their total numbers of entries is not equal to 50000; the total listings saved in the dataframe. That includes, vehicletype
, gearbox
, model
, fueltype
, unrepaired_damage
columns. Finally, odometer
and prices
columns aside must be cleaned as there have been stored as numeric values, but contain text($ and km) around their values.
Let us cleaning price
and odometer
columns.
#Removing any non-numeric character
autos["price"] = autos["price"].str.replace( "$","")
autos["price"] = autos["price"].str.replace(",","")
autos["odometer"] = autos["odometer"].str.replace("km", "")
autos["odometer"] = autos["odometer"].str.replace(",", "")
#Converting to numeric dtype
autos["price"] = autos["price"].astype( float)
autos["odometer"] = autos["odometer"].astype( float)
#rename odometer column
autos.rename( { "odometer" : "odometer_km"}, axis = 1, inplace = True)
autos.describe()
price | registration_year | powerps | odometer_km | registration_month | nrofpictures | postalcode | |
---|---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 0.0 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
We continue examing the data by looking for outliers in price
and odometer
columns. An Outlier is a data point significantly differs from others data points in a data set. As long as we have found them in both columns, we might decide to remove or keep outliers.
Finding outliers out in price
and Odometer
is a process describe as follow:
1.Arrange all the data points in a column from lowest to highest values.
2. Compute the median of the values in column.
3. Calculate lower quartile( LQ) and the upper quartile( UQ )
4. Compute the interquartile range.
5. Compute the inner fences and outer fences.
6. Determine whether there are minor or major outliers in the data set.
print("Column name : Price")
nb= autos["price"].unique().shape #number of unique values
print("Number of unique values in price column:", nb[0])
autos["price"].value_counts().sort_index()
Column name : Price Number of unique values in price column: 2357
0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 40.0 6 45.0 4 47.0 1 49.0 4 50.0 49 55.0 2 59.0 1 60.0 9 65.0 5 66.0 1 ... 151990.0 1 155000.0 1 163500.0 1 163991.0 1 169000.0 1 169999.0 1 175000.0 1 180000.0 1 190000.0 1 194000.0 1 197000.0 1 198000.0 1 220000.0 1 250000.0 1 259000.0 1 265000.0 1 295000.0 1 299000.0 1 345000.0 1 350000.0 1 999990.0 1 999999.0 2 1234566.0 1 1300000.0 1 3890000.0 1 10000000.0 1 11111111.0 2 12345678.0 3 27322222.0 1 99999999.0 1 Name: price, Length: 2357, dtype: int64
As we can view above, the minimun and maximum values in price
column are respectively 10 and 50,000. Let us move on to the next step, and compute the median.
Fortunately, the median in price
column is already available by executing Series.describe()
function.
#Statistics
print("Column name : Price")
print( autos["price"].describe() )
Column name : Price count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
The median is the value standing in the middle(50%
) of a series. Therefore, our median is 2950
.
The lower quartile (LQ)
is the value at the 25th percentile or between the minumun and the median of a series. In our Series.describe()
, LQ refers to 25%.
The upper quartile(UQ)
is the value at the 75th percentile or between the median and the maximum of a series. In Series.describes()
, 75% is the upper quartile.
print("Column name : Price")
print("Lower quartile : ", autos["price"].describe()["25%"] )
print("Upper quartile : ", autos["price"].describe()["75%"] )
Column name : Price Lower quartile : 1100.0 Upper quartile : 7200.0
The interquartile range is the distance between the lower and the upper quartiles. To compute it, we just need to substract LQ from UQ.
price_lq = autos["price"].describe()["25%"]
price_uq = autos["price"].describe()["75%"]
price_int_q = price_uq - price_lq
print("Column name : Price")
print("Lower quartile : ",price_lq )
print("Upper quartile : ",price_uq )
print("Interquartile range : ",price_int_q )
Column name : Price Lower quartile : 1100.0 Upper quartile : 7200.0 Interquartile range : 6100.0
Outliers are identified by assssing whether or not they fall within a set of numerical boundaries called inner fences
and outer fences
.
To find out the inner fences for our data set, firstly, we multiply the interquartile range
by 1.5. Then, we add the result to UQ
and substract it from LQ
. The two resulting values are boundaries of our data set's inner fences.
To find out the outer fences for out data set, we multiply the interquartile range
by 3. Then, we add the result to UQ
and substract it from LQ
.
#Inner fences
inner_fence = []
product = price_int_q * 1.5
first_boundary = price_lq - product
second_boundary = price_uq + product
inner_fence.append( first_boundary )
inner_fence.append( second_boundary )
#Outer fences
outer_fence = []
product = price_int_q * 3
first_boundary = price_lq - product
second_boundary = price_uq + product
outer_fence.append( first_boundary )
outer_fence.append( second_boundary )
print("Column name : Price")
print("Lower quartile : ", price_lq)
print("Upper quartile: ", price_uq)
print("Interquartile range : ",price_int_q )
print("Inner fences : ",inner_fence[0]," and ", inner_fence[1] )
print("Outer fences : ",outer_fence[0]," and ", outer_fence[1] )
Column name : Price Lower quartile : 1100.0 Upper quartile: 7200.0 Interquartile range : 6100.0 Inner fences : -8050.0 and 16350.0 Outer fences : -17200.0 and 25500.0
A minor outlier is a data point which falls outside the data set's inner fences, whereas major outlier falls outside the outer fences.
In our analysis, we will throw out major outliers
as they may due to human errors like error typing or skewing.
#Removing major outliers out of [0, 25500]
autos["price"] = autos.loc[ autos["price"].between(0, 25500), "price" ]
autos["price"].describe()
count 48653.000000 mean 4767.383800 std 5163.186786 min 0.000000 25% 1100.000000 50% 2800.000000 75% 6750.000000 max 25500.000000 Name: price, dtype: float64
The same process is applying for odemeter
column. Just like price
column, we only keep vehicles with odometer between 0 and 50,000 km.
odometer_lq = autos["odometer_km"].describe()["25%"] #Lower quartile
odometer_uq = autos["odometer_km"].describe()["75%"] # Upper quartile
odometer_int_q = odometer_uq - odometer_lq #interquartile
#Inner fences
inner_fence = []
product = odometer_int_q * 1.5
first_boundary = odometer_lq - product
second_boundary = odometer_uq + product
inner_fence.append( first_boundary )
inner_fence.append( second_boundary )
#Outer fences
outer_fence = []
product = odometer_int_q * 3
first_boundary = odometer_lq - product
second_boundary = odometer_uq + product
outer_fence.append( first_boundary )
outer_fence.append( second_boundary )
print("Column name : Odometer")
print("Lower quartile : ",odometer_lq )
print("Upper quartile : ",odometer_uq )
print("Interquartile range : ",odometer_int_q )
print("Inner fences : ",inner_fence[0]," and ", inner_fence[1] )
print("Outer fences : ",outer_fence[0]," and ", outer_fence[1] )
#Removing major outliers out of outer fences
autos["odometer_km"] = autos.loc[ autos["odometer_km"].between(50000, 225000), "odometer_km" ]
Column name : Odometer Lower quartile : 125000.0 Upper quartile : 150000.0 Interquartile range : 25000.0 Inner fences : 87500.0 and 187500.0 Outer fences : 50000.0 and 225000.0
As we have removed outliers in odometer column, let us display its statitic description.
autos["odometer_km"].describe()
count 46377.000000 mean 133838.971904 std 28425.815462 min 50000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Let's now move on to the date columns(date_crawled
, last_seen
, ad_created
) and understand the date range the data covers.
#date_crawled column
date_crawled = autos["date_crawled"].str[:10] # first 10 characters
#Calculating the distribution
distribution = date_crawled.value_counts( normalize = True, dropna= False)
print("Column Name: Date Crawled ")
print( distribution.sort_index() )
Column Name: Date Crawled 2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
At a glance, most ad were added by the crawled from March to April in 2016. Futhermore, the date with the highest percentage( 3.86%)
is April 03/2016 and the one with the lowest percentage(1%) is April 07/2016.
#ad_created column
ad_created = autos["ad_created"].str[:10] # first 10 characters
#Calculating the distribution
distribution = ad_created.value_counts( normalize = True, dropna= False)
print("Column Name: ad created ")
print( distribution.sort_index() )
Column Name: ad created 2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 2015-12-30 0.00002 2016-01-03 0.00002 2016-01-07 0.00002 2016-01-10 0.00004 2016-01-13 0.00002 2016-01-14 0.00002 2016-01-16 0.00002 2016-01-22 0.00002 2016-01-27 0.00006 2016-01-29 0.00002 2016-02-01 0.00002 2016-02-02 0.00004 2016-02-05 0.00004 2016-02-07 0.00002 2016-02-08 0.00002 2016-02-09 0.00004 2016-02-11 0.00002 2016-02-12 0.00006 2016-02-14 0.00004 2016-02-16 0.00002 2016-02-17 0.00002 2016-02-18 0.00004 2016-02-19 0.00006 2016-02-20 0.00004 2016-02-21 0.00006 ... 2016-03-09 0.03324 2016-03-10 0.03186 2016-03-11 0.03278 2016-03-12 0.03662 2016-03-13 0.01692 2016-03-14 0.03522 2016-03-15 0.03374 2016-03-16 0.03000 2016-03-17 0.03120 2016-03-18 0.01372 2016-03-19 0.03384 2016-03-20 0.03786 2016-03-21 0.03772 2016-03-22 0.03280 2016-03-23 0.03218 2016-03-24 0.02908 2016-03-25 0.03188 2016-03-26 0.03256 2016-03-27 0.03090 2016-03-28 0.03496 2016-03-29 0.03414 2016-03-30 0.03344 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03508 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
In ad_created
column, adverts were created from the website between 06/2015
and 04/2016
. In addition, a few ads came from the website itself as the distribution from 06/2015 and 02/2016 is almost equal to 0%.
#last_seen column
last_seen = autos["last_seen"].str[:10] # first 10 characters
#Calculating the distribution
distribution = last_seen.value_counts( normalize = True, dropna= False)
print("Column Name: Last seen ")
print( distribution.sort_index() )
Column Name: Last seen 2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64
According to the last_seen
column's distribution, the majority of ads have been seen in 03/2016
.
Now, let us have a look at the registration_year
to understand its distribution.
autos["registration_year"].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The distribution above stated that the average registration year is in 2005. However, some ads have been registered in 1000 and in 9999; that is not accurate as the motor-cars were not yet invented in 1000 and the year 9999 means the future time. Therefore, there are definitely human errors in registration_year column.
As registration year
column contains odd values, we have to decide which interval with acceptable values to keep. According to Wikipedia, the first automobile were built in the 1800s. As a result we will remove rows with registration year before 1900.
Moreover, because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely incaccurate, so the will be remove too.
#Number of cars outside 1900-2016
(~autos['registration_year'].between(1900,2016)).sum()
1972
We can observed than 1972 cars have not been registered from 1900 to 2016.
#Removing those values in the column
autos['registration_year'] = autos.loc[ autos["registration_year"].between(1900, 2016), "registration_year"]
autos["registration_year"].value_counts(normalize = True).sort_index()
1910.0 0.000187 1927.0 0.000021 1929.0 0.000021 1931.0 0.000021 1934.0 0.000042 1937.0 0.000083 1938.0 0.000021 1939.0 0.000021 1941.0 0.000042 1943.0 0.000021 1948.0 0.000021 1950.0 0.000062 1951.0 0.000042 1952.0 0.000021 1953.0 0.000021 1954.0 0.000042 1955.0 0.000042 1956.0 0.000104 1957.0 0.000042 1958.0 0.000083 1959.0 0.000146 1960.0 0.000708 1961.0 0.000125 1962.0 0.000083 1963.0 0.000187 1964.0 0.000250 1965.0 0.000354 1966.0 0.000458 1967.0 0.000562 1968.0 0.000541 ... 1987.0 0.001562 1988.0 0.002957 1989.0 0.003769 1990.0 0.008224 1991.0 0.007412 1992.0 0.008141 1993.0 0.009265 1994.0 0.013742 1995.0 0.027338 1996.0 0.030066 1997.0 0.042225 1998.0 0.051074 1999.0 0.062464 2000.0 0.069834 2001.0 0.056280 2002.0 0.052740 2003.0 0.056779 2004.0 0.056988 2005.0 0.062776 2006.0 0.056384 2007.0 0.047972 2008.0 0.046452 2009.0 0.043683 2010.0 0.033251 2011.0 0.034022 2012.0 0.027546 2013.0 0.016782 2014.0 0.013867 2015.0 0.008308 2016.0 0.027401 Name: registration_year, Length: 78, dtype: float64
Let us explore the brand
column and understand the price by brand.
#Fetching unique values
brands = autos["brand"].value_counts()
print( brands)
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
We have determined to aggregrate on brands with more than 400 cars available in our data set as they are the most popular in our data set.
#top brands with more than 400 cars
top_brands = brands.index[0:20]
#empty dictionary
brand_mean_price = {}
max_price = 0
#Looping over unique_brands
for b in top_brands :
selected_brands = autos[ autos["brand"]== b ]
brand_mean_price[ b ] = selected_brands["price"].mean()
if( brand_mean_price[ b ] > max_price ) :
max_price = brand_mean_price[ b ]
brand = b
print( "Our Top brands Dictionary :" )
print("Brand : ", b, "| Highest mean price : ", max_price )
print( brand_mean_price )
Our Top brands Dictionary : Brand : mini | Highest mean price : 10081.237410071943 {'mini': 10081.237410071943, 'mercedes_benz': 6613.993269015033, 'fiat': 2697.6771231828616, 'citroen': 3644.7210300429183, 'toyota': 4913.425774877651, 'smart': 3482.971469329529, 'sonstige_autos': 6149.29898989899, 'bmw': 6744.7171736619175, 'nissan': 4418.849132176235, 'volkswagen': 4721.5591817316845, 'hyundai': 5266.0718685831625, 'peugeot': 3010.8688186813188, 'ford': 3336.5371014492753, 'volvo': 4479.25, 'renault': 2313.578859758635, 'audi': 7033.351615326822, 'mazda': 3641.474598930481, 'opel': 2812.0144874381076, 'skoda': 6030.061855670103, 'seat': 4141.988272921109}
Based on the information above, the mini
has the highest mean price of $10081 in our dictionary.
For the top 6 brands( Audi
,BMW
,Mercedes Benz
,Ford
,Opel
and Volkswagen
), Let's use aggragation to understand the average mileage for those cars and if there's any visible link with mean price.
#Calculating mileage and mean price
brand_mean_price = {}
brand_mean_odometer = {}
for b in top_brands :
selected_brands = autos[ autos["brand"]== b ]
brand_mean_price[ b ] = selected_brands["price"].mean()
brand_mean_odometer[ b ] = selected_brands["odometer_km"].mean()
#Converting dictionaries to series
bmp_series = pd.Series( brand_mean_price )
bmo_series = pd.Series( brand_mean_odometer )
#Converting series to dataframe
df = pd.DataFrame( bmp_series, columns = ["mean_price"])
df["mean_odometer"] = bmo_series
print( df.sort_values(by ="mean_odometer", ascending = False) )
mean_price mean_odometer volvo 4479.250000 141853.932584 bmw 6744.717174 138323.330106 audi 7033.351615 137987.405542 mercedes_benz 6613.993269 137149.720670 volkswagen 4721.559182 136581.136614 opel 2812.014487 136170.750681 renault 2313.578860 134944.763588 mazda 3641.474599 132744.680851 peugeot 3010.868819 132262.589928 ford 3336.537101 131840.234858 seat 4141.988273 130305.651672 fiat 2697.677123 127357.930450 citroen 3644.721030 127184.615385 nissan 4418.849132 127161.383285 toyota 4913.425775 124671.403197 skoda 6030.061856 123620.437956 sonstige_autos 6149.298990 120737.265416 hyundai 5266.071869 118025.700935 smart 3482.971469 110642.276423 mini 10081.237410 103410.404624
Our data set above showed that, whereas mini
gets the highest mean price, it is also the brand with the lowest mileage among the top brands. Futhermore, the more expensive brands like BMW
and Mecredes Benz
have the same mileage of about 130000 km with the cheapest brands despite the gap of mean price among them.