Exploring Ebay Car Sales Data
The aim of this project is to clean the data and analyze the included used car listings on a classified section of the German ebay website.
The source of the dataset was obtained from kaggle. after being scraped and uploaded to its website.
The following are the data dictionary provided from the dataset:
dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.
import pandas as pd
import numpy as np
#Read `autos.csv` file into pandas
autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
#information about `autos.csv` file
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
#First five rows of `autos` dataframe
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The autos
dataframe contains 50000 entries ranging from 0-49999 rows or index numbers underneath 20 column labels. It consists of 5 integer and 15 object datatypes.
Some of the columns contain null values.
Camelcase instead of snakecase (Python"s preferred) are used for column label.
So we are going to rename the column labels by changing from camelcase to snakecase. Before we do that, let us print the columns to have an overview.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
To make these changes:
We will use the Dataframe.colums
attribute to print out an array of the existing column names.
Create an empty list and assign new_columns
to it.
Create a clean_col()
function that takes in an argument that enables us to use the str.replace()
function to edit the camelcase column names to snakecase'.
We will loop through the Dataframe.colums
and pass in the iterable variable each_colum
into our clean_col()
function and assign the result to renamed_col
Append the result to new_columns
Finally assign new_columns
to Dataframe.colums
The reason for these changes is to make our data dictionary to be more descriptive.
#create a new column for the renamed column labels
new_columns = []
#define a function to help edit column names
def rename_col(col):
col = col.replace("yearOfRegistration","registration_year")
col = col.replace("monthOfRegistration","registration_month")
col = col.replace("notRepairedDamage","unrepaired_damage")
col = col.replace("dateCreated","ad_created")
col = col.replace("dateCrawled","date_crawled")
col = col.replace("offerType","offer_type")
col = col.replace("vehicleType","vehicle_type")
col = col.replace("fuelType","fuel_type")
col = col.replace("powerPS","power_ps")
col = col.replace("nrOfPictures","pictures_nr")
col = col.replace("postalCode","postal_code")
col = col.replace("lastSeen","last_seen")
return col
#iterate over the `Dataframe.columns`
for each_column in autos.columns:
renamed_col = rename_col(each_column)
new_columns.append(renamed_col)
#Assign the modified column names
#back to the `Dataframe.columns`
autos.columns = new_columns
#current state of our first five `autos` dataframe
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_nr | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We will do some data exploration to determiine what data we need to clean. Datas that are mixed: both numeric and text data shall be cleaned and converted while some text data shall be dropped as they may not have useful information.
We will be using the following methods for exploring our dataframe:
Dataframe.describe(include="all")
, Series.value_counts()
and Series.head()
#Descriptive statistics for all column
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_nr | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the descriptive statistics the picture_nr
column (number of pictures in the advert) need to be dropped because it shows null values.
Both seller and offer_type data are identical.
The price and odometer columns need more investigation as they contain both numeric and text data.
So we will clean the data for each column by:
Exploring the columns more using the Series.unique() function to look out for pattern and special cases.
Removing the non-numeric characters
Converting the column to a numeric dtype
Renmaing the column inline with the non-numeric characters using Dataframe.rename()
I Price Column
#unique values for price
autos["price"].unique()
#Remove $ sign and covert to numeric dtype
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
#Rename coulmn
autos.rename({"price": "price_USD"}, axis=1, inplace=1)
II Odometer Column
#unique values for odometer
autos["odometer"].unique()
#Remove `km` and covert to numeric dtype
autos["odometer"] = autos["odometer"].str.replace('km','').str.replace(',', '').astype(int)
#Rename column
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=1)
#lets print out the first five rows to confirm the changes made
autos.head()
date_crawled | name | seller | offer_type | price_USD | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_nr | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We will contiune ro analyse both odometer and price data further to specifically look out for data that are look wrong b doing the folowimg:
We will anlayse the column using maximum and minimum values for any high or low (outliers) values that are unreal so we can remove them.
We will use:
Series.unique().shape to see how many unique values are there.
Series.describe() to view min/max/median/mean etc
Series.value_counts(), with some variations:
chained to .head() if there are lots of values.
Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
When removing outliers, we can do df[(df["col"] > x ) & (df["col"] < y )], but it's more readable to use df[df["col"].between(x,y)]
#To see how many unique values in `odometer_km`
autos['odometer_km'].unique().shape
(13,)
#To see descriptive statistics of the data
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
#To find the count of each distance and for any variation
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
#To sort out the distance from highest to lowest
autos['odometer_km'].sort_index(ascending=True).head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
#To see how many unique values in `price_USD`
autos['price_USD'].unique().shape
(2357,)
#To see descriptive statistics of the data
autos['price_USD'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_USD, dtype: float64
#To find the count of each price value and for any variation
autos['price_USD'].value_counts().sort_index(ascending=True).head()
0 1421 1 156 2 3 3 1 5 2 Name: price_USD, dtype: int64
#To sort out the price from highest to lowest
autos['price_USD'].sort_index(ascending=False).head()
49999 1250 49998 22900 49997 13200 49996 1980 49995 24900 Name: price_USD, dtype: int64
Outliers that are considered low, fall below [25% - 1.5*(75% - 25%]
While outliers considered high, fall above [75% + 1.5(75% - 25%)
I Removing Outliers from odometer_km data.
Using the decsriptive statistics for odometer_km where 25% = 125000 and 75% = 150000
By calculation, the low outliers fall between 0-87499. So these range of values will be removed.
There is no high outlier. Since by the above calculation, its value is above 187500 whereas the maximum value for odometer_km is 150000
So we will use
autos = autos.loc[autos['price'].between(x, y), :] to filter the dataset and select value ranges from 87500-150000
#Removes low Outliers
autos = autos.loc[autos['odometer_km'].between(87500, 150000), :]
#Descriptive Statistics after removing outliers
autos['odometer_km'].describe()
count 41520.000000 mean 141736.030829 std 17102.004255 min 90000.000000 25% 150000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
#Value counts after removing outliers
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 Name: odometer_km, dtype: int64
From the above descriptive statistics, it shows that the number of the dataset has reduced to 41520 as well as the mean value
II Removing Outliers from price_USD data.
Using price_USD descriptive statistics, 25% = 1100 and 75% = 7200.
By calculation, there are no low outliers whereas the high outliers range between 16351-99999999. So these values will be removed from the price_USD data.
So we will use
autos = autos.loc[autos['price'].between(x, y), :] to filter the dataset and select value ranges from 0-16350
#Removes high Outliers
autos = autos.loc[autos['price_USD'].between(0, 16350), :]
#Descriptive Statistics
autos['price_USD'].describe()
count 39964.000000 mean 3558.803899 std 3536.038723 min 0.000000 25% 990.000000 50% 2299.000000 75% 4999.000000 max 16350.000000 Name: price_USD, dtype: float64
#Value Counts
autos['price_USD'].value_counts().head()
0 1190 500 738 1500 682 1200 610 1000 598 Name: price_USD, dtype: int64
After the high outliers have been removed from the data, the total number value has dropped to 39964 out of 50000 as well themean and other statistics.
As we can see data dictionary below, there are 5 columns that represent date values. Some were created from crawler while others come from its website.
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteThe first three columns have string values while the other two are represented in numeric values. So we need to convert them to numeric values so we can understand it quantitatively.
lets us see the format of these three string columns.
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | 2016-03-20 00:00:00 | 2016-03-23 02:48:59 |
From the above result, we can see that the first 10 characters represent the day (2016-04-01).
To understand the date range, we will first select only the first 10 characters in each column, we will use Series.str[:10]
Then use Series.values_counts(normalize=True,dropna=False) to generate a distribution as percentages including for missing numbers (NAN) for all string columns and then sort by the index using Series.sort_index()
I Exploring date_crawled column
#Extracts first ten characters
date_crawled = autos['date_crawled'].str[:10]
#All distribution values as percentages for `date_crawled`
date_crawled = date_crawled.value_counts(normalize=True, dropna=False)
#Sort from earliest to latest
date_crawled.sort_index(ascending=True)
2016-03-05 0.024947 2016-03-06 0.014063 2016-03-07 0.036283 2016-03-08 0.034231 2016-03-09 0.033805 2016-03-10 0.032579 2016-03-11 0.032079 2016-03-12 0.037384 2016-03-13 0.015014 2016-03-14 0.037158 2016-03-15 0.034006 2016-03-16 0.029852 2016-03-17 0.031954 2016-03-18 0.013187 2016-03-19 0.033580 2016-03-20 0.037309 2016-03-21 0.037534 2016-03-22 0.032504 2016-03-23 0.032654 2016-03-24 0.029001 2016-03-25 0.032629 2016-03-26 0.032454 2016-03-27 0.030928 2016-03-28 0.034631 2016-03-29 0.033830 2016-03-30 0.033981 2016-03-31 0.031779 2016-04-01 0.033055 2016-04-02 0.034706 2016-04-03 0.038184 2016-04-04 0.036733 2016-04-05 0.013362 2016-04-06 0.003453 2016-04-07 0.001151 Name: date_crawled, dtype: float64
autos['date_crawled'].describe()
count 39964 unique 38823 top 2016-03-09 11:54:38 freq 3 Name: date_crawled, dtype: object
From our observation, we can see that top date crawled for the cars is 2016-03
II Exploring ad_created column
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
2015-12-05 0.000025 2015-12-30 0.000025 2016-01-03 0.000025 2016-01-07 0.000025 2016-01-10 0.000050 2016-01-13 0.000025 2016-01-16 0.000025 2016-01-22 0.000025 2016-01-27 0.000050 2016-01-29 0.000025 2016-02-01 0.000025 2016-02-02 0.000050 2016-02-05 0.000050 2016-02-08 0.000025 2016-02-09 0.000025 2016-02-11 0.000025 2016-02-12 0.000075 2016-02-16 0.000025 2016-02-18 0.000050 2016-02-19 0.000025 2016-02-20 0.000050 2016-02-21 0.000075 2016-02-22 0.000025 2016-02-23 0.000025 2016-02-24 0.000025 2016-02-25 0.000050 2016-02-26 0.000025 2016-02-27 0.000125 2016-02-28 0.000200 2016-02-29 0.000150 ... 2016-03-09 0.033680 2016-03-10 0.032404 2016-03-11 0.032379 2016-03-12 0.037234 2016-03-13 0.016465 2016-03-14 0.035607 2016-03-15 0.033805 2016-03-16 0.030302 2016-03-17 0.031653 2016-03-18 0.013887 2016-03-19 0.032579 2016-03-20 0.037309 2016-03-21 0.037659 2016-03-22 0.032579 2016-03-23 0.032504 2016-03-24 0.028926 2016-03-25 0.032729 2016-03-26 0.032429 2016-03-27 0.030853 2016-03-28 0.034606 2016-03-29 0.033931 2016-03-30 0.033780 2016-03-31 0.031754 2016-04-01 0.033180 2016-04-02 0.034306 2016-04-03 0.038460 2016-04-04 0.037259 2016-04-05 0.011961 2016-04-06 0.003478 2016-04-07 0.001051 Name: ad_created, Length: 68, dtype: float64
from our observation, we can see that the top year in which the advert was created is between 2015 and 2016-04
III Exploring last_seen column
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)
2016-03-05 0.001126 2016-03-06 0.004804 2016-03-07 0.006005 2016-03-08 0.008558 2016-03-09 0.010560 2016-03-10 0.011210 2016-03-11 0.013787 2016-03-12 0.025598 2016-03-13 0.009609 2016-03-14 0.013412 2016-03-15 0.016590 2016-03-16 0.017441 2016-03-17 0.029777 2016-03-18 0.007507 2016-03-19 0.016890 2016-03-20 0.022020 2016-03-21 0.021770 2016-03-22 0.022370 2016-03-23 0.019442 2016-03-24 0.020844 2016-03-25 0.020218 2016-03-26 0.017666 2016-03-27 0.016890 2016-03-28 0.022045 2016-03-29 0.023046 2016-03-30 0.025448 2016-03-31 0.024597 2016-04-01 0.024222 2016-04-02 0.025673 2016-04-03 0.025623 2016-04-04 0.025323 2016-04-05 0.118457 2016-04-06 0.209063 2016-04-07 0.122410 Name: last_seen, dtype: float64
The results show that the car were last seen in 2016.
autos['registration_year'].describe()
count 39964.000000 mean 2002.675083 std 35.644961 min 1910.000000 25% 1999.000000 50% 2002.000000 75% 2006.000000 max 9000.000000 Name: registration_year, dtype: float64
From our observation, the minimum value is 1910 and the maximum is 9000. The minimum value which seems to be the year the car was invented seems correct beacuse cars were invented in the few decades of 1900s while the maximum value which seems to be wrong talking about futuristic registration of cars. For cars to be last seen in 2016 means any registration year above that is wrong.
so the lowest acceptable value is 1910 and the highest acceptable value is 2016
#filters only acceptable values
autos = autos.loc[autos['registration_year'].between(1910, 2016),:]
#Calculates the distribution of the remaining values
autos['registration_year'].value_counts(normalize=True).sort_index(ascending=True)
1910 0.000052 1934 0.000026 1937 0.000026 1950 0.000026 1953 0.000026 1954 0.000052 1956 0.000052 1958 0.000052 1959 0.000026 1960 0.000392 1961 0.000052 1962 0.000026 1963 0.000078 1964 0.000052 1965 0.000183 1966 0.000183 1967 0.000183 1968 0.000235 1969 0.000157 1970 0.000418 1971 0.000183 1972 0.000444 1973 0.000418 1974 0.000235 1975 0.000235 1976 0.000392 1977 0.000314 1978 0.000601 1979 0.000549 1980 0.001594 ... 1987 0.001620 1988 0.003084 1989 0.004051 1990 0.008441 1991 0.008285 1992 0.009016 1993 0.010715 1994 0.015785 1995 0.031178 1996 0.035464 1997 0.050727 1998 0.060396 1999 0.074535 2000 0.081330 2001 0.066590 2002 0.062199 2003 0.066747 2004 0.066198 2005 0.070850 2006 0.061729 2007 0.049629 2008 0.043017 2009 0.032981 2010 0.019810 2011 0.013930 2012 0.006194 2013 0.001228 2014 0.000314 2015 0.000131 2016 0.030028 Name: registration_year, Length: 66, dtype: float64
autos.registration_year.describe()
count 38264.000000 mean 2001.845050 std 6.103401 min 1910.000000 25% 1998.000000 50% 2002.000000 75% 2006.000000 max 2016.000000 Name: registration_year, dtype: float64
We have been able to remove the upper and lower bounds. A total of 1700 values have been removed and from the descriptive statistics, more cars were registered from 1993-2016
#Explore unique values of `brand`
autos['brand'].unique()
array(['peugeot', 'bmw', 'ford', 'chrysler', 'volkswagen', 'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'mini', 'toyota', 'nissan', 'jeep', 'dacia', 'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'kia', 'sonstige_autos', 'citroen', 'smart', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo', 'suzuki', 'land_rover', 'jaguar', 'alfa_romeo', 'lada', 'rover', 'trabant', 'daihatsu', 'lancia'], dtype=object)
#To determine each `brand` valve count
autos['brand'].value_counts()
volkswagen 8457 opel 4494 bmw 4305 mercedes_benz 3593 audi 3212 ford 2718 renault 1963 peugeot 1198 fiat 951 seat 694 mazda 604 nissan 562 skoda 544 citroen 523 toyota 448 smart 420 volvo 395 mitsubishi 328 hyundai 311 honda 299 alfa_romeo 278 sonstige_autos 249 kia 237 mini 211 suzuki 200 chrysler 156 chevrolet 154 daihatsu 97 jeep 83 subaru 82 saab 74 porsche 70 dacia 59 rover 58 daewoo 58 land_rover 53 jaguar 52 lancia 42 trabant 18 lada 14 Name: brand, dtype: int64
The following methods will be employed to aggregate my chosen brand:
We might want to select the top 20, or we might want to select those that have over a certain percentage of the total values (e.g. > 5%).
Remember that Series.value_counts() produces a series with index labels, so we can use Series.index attribute to access the labels, should we wish.
Create an empty dictionary to hold my aggregate data.
Loop over my selected brands, and assign the mean price to the dictionary, with the brand name as the key.
Print my dictionary of aggregate data.
#Top 20 brands
autos['brand'].value_counts().sort_values(ascending=False).head(20)
volkswagen 8457 opel 4494 bmw 4305 mercedes_benz 3593 audi 3212 ford 2718 renault 1963 peugeot 1198 fiat 951 seat 694 mazda 604 nissan 562 skoda 544 citroen 523 toyota 448 smart 420 volvo 395 mitsubishi 328 hyundai 311 honda 299 Name: brand, dtype: int64
There are 40 different brand of auto cars being used for our data analysis. Each car has its number of times it appears on the dataset.
For the purpose of this project i have chosen to aggregate the top 20 brands from highest to lowest index values as shown above.
#create an empty list
brand_mean_prices = {}
#Selects top 20 brands
top_20_brands = autos['brand'].value_counts().sort_values(ascending=False).head(20).index
for each_car in top_20_brands:
selected_cars = autos[autos["brand"] == each_car] #Indexes only unique brands from dataframe
mean_price = selected_cars["price_USD"].mean() #Calculates mean prices for selected brands
brand_mean_prices[each_car] = mean_price #Assigns dictionary keys to corresponding values
print(brand_mean_prices)
{'skoda': 4753.044117647059, 'ford': 2389.4621044885944, 'nissan': 3049.814946619217, 'smart': 2473.2190476190476, 'citroen': 2804.1032504780114, 'opel': 2179.712950600801, 'honda': 2739.4414715719063, 'hyundai': 3620.524115755627, 'toyota': 3895.5691964285716, 'volvo': 3619.217721518987, 'seat': 2875.025936599424, 'mercedes_benz': 4969.493181185639, 'peugeot': 2358.8213689482473, 'volkswagen': 3587.68038311458, 'mitsubishi': 2333.137195121951, 'audi': 5220.4346201743465, 'fiat': 1873.4079915878024, 'bmw': 5326.838095238095, 'mazda': 2797.680463576159, 'renault': 1759.2878247580234}
from our aggregated brands, we observe that the price gap for the top 6 brands differ. Begining with BMW being the most expensive followed by AUDI and Mercedes Benz while the less expensive ones are Opel and Ford
We will like to use aggreagtion method to understand the average mileage and mean prices forthe top brands by visually combining two aggreagte series objects into a dataframe object using the pandas series constructor and pandas dataframe constructor.
To do this we will:
Use the loop method from the last screen to calculate the mean mileage and mean price for each of the top brands, storing the results in a dictionary.
Convert both dictionaries to series objects, using the series constructor.
Create a dataframe from the first series object using the dataframe constructor.
Assign the other series as a new column in this dataframe.
Print the dataframe
#To calculate mean prices for top 6 brands
#ceates an empty dictionary for mean prices for top brands
brand_mean_prices = {}
#Select the unique values for top 6 `brand`
top_6_brands = autos['brand'].value_counts().sort_values(ascending=False).head(6).index
#loop through `top_6_brands`
for each_brand in top_6_brands:
selected_brand = autos[autos['brand'] == each_brand] #indexes only top 6 brands
mean_price = int(round(selected_brand['price_USD'].mean())) #Calculates mean prices of each brand
brand_mean_prices[each_brand] = mean_price #Assigns mean_prices corresponding values to each brand as the dictionary keys
print(brand_mean_prices)
{'mercedes_benz': 4969, 'opel': 2180, 'audi': 5220, 'ford': 2389, 'volkswagen': 3588, 'bmw': 5327}
#To calculate mean mileage for top 6 brands
#Creates an empty dictionary for mean milage
brand_mean_mileages = {}
#loop through `top_6_brands`
for each_brand in top_6_brands:
selected_brand = autos[autos['brand'] == each_brand] #Indexes only top 6 brands
mean_mileage = int(round(selected_brand['odometer_km'].mean())) #Calculates mean mileages of each brand
brand_mean_mileages[each_brand] = mean_mileage #Assigns mean_mileages values to each brand as the dictionary keys
print(brand_mean_mileages)
{'mercedes_benz': 145001, 'opel': 142079, 'audi': 145926, 'ford': 140795, 'volkswagen': 143645, 'bmw': 144839}
#Converts both dictionaries into series objects using series constructor
bmp_series = pd.Series(brand_mean_prices)
bmm_series = pd.Series(brand_mean_mileages)
#Creates a Dataframe from the first series using Dataframe constructor
brand_df = pd.DataFrame(bmp_series, columns=['mean_price'])
brand_df
mean_price | |
---|---|
audi | 5220 |
bmw | 5327 |
ford | 2389 |
mercedes_benz | 4969 |
opel | 2180 |
volkswagen | 3588 |
brand_df['mean_mileage'] = bmm_series
brand_df
mean_price | mean_mileage | |
---|---|---|
audi | 5220 | 145926 |
bmw | 5327 | 144839 |
ford | 2389 | 140795 |
mercedes_benz | 4969 | 145001 |
opel | 2180 | 142079 |
volkswagen | 3588 | 143645 |
our new dataframe shows both mean_price and mean_mileage. The mean mileage of each brand seems to be very close in value while their mean price value varies much except those for Audi and bmw brands which seem very close.
In Conclusion:
We were able to read our dataset into pandas dataframe:
Explore the dataset for necessary information and found out some data were incorrect and column label Camel Cased instead of Snake cased
We removed non-numeric characters and renamed column label(Price and odometer) in-line with the non-numeric characters to aid understanding.
We explored and analysed both columns more to remove outliers
We explored and analysed the year of registration in relation to when the brands were last seen on advert and filtered the acceptable years
finally, we aggregated the mean price and mileage for our top 6 brands.
The price of a brand does not determine its mileage. from our dataframe, we can see that opel, whose mean price value is low compared to ford, has a higher mileage value. Similarly, audi has a higher mileage than bmw which has the highest mean value.