This project contains a dataset of used cars scrapped from eBay Kleinanzeigen, a classifieds section of the German eBay website.
Our sampled dataset consits of 50,000 datapoints extracted from the original dataset, which has been made dirtier in order replicate a real world scrapped dataset.
Purpose of the project
Our dataset consists of about 50,000 rows and 20 columns, below are the columns and its definition:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listing.price
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yer repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehiclelastSeenOnline
- When the crawler saw this ad last online.Here's a link to the clean original dataset: Kaggle
In order for us to begin our data cleaning process we have to be able to read the data set, therefore we use the numpy library to enable us convert the dataset into an n-darray and pandas to read our dataset.
# importing the necessary libraries
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1') # reading in the dataset
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info() # prints out useful information of our dataset
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
To get concrete information about our dataset, such as: no. of rows, no. of cols, datatype of each column, no. of non-empty rows, e.t.c we used the DataFrame.info()
We notice that most of the columns in our dataset are contain string values, the are also columns with missing values and instead of Python's prefered snakecase for column names, the columns names are instead in camelcase
More specifically the columns with missing values are: vehicleType, gearbox, model, fuelType, notRepairDamage
autos.head() # displays the first 5 rows of our dataset
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Looking at the first 5 elements of our dataset, we notice that the name columns contains "_" where there should be a space. E.g 'Volkswagen_Golf_1.6_United'.
Also our price and odometer columns contains a combination of str and int which could cause some problems when trying to anaylse those column values. E.g $5,000 and 150,000km respectively.
We notice that the column names of our dataset are written in camelcase which is not the appropriate Python recommendation so we need change them to snakecase
autos.columns # displaying the column names of our dataset
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# creating a list of edited column names to be passed
# to our dataset
corrected_columns = ['date_crawled', 'name', 'seller', 'offer_type',
'price', 'abtest', 'vehicle_type', 'registration_year',
'gearbox', 'power_ps', 'model', 'odometer', 'registration_month',
'fuel_type', 'brand', 'unrepaired_damage', 'ad_created',
'nr_of_pictures', 'postal_code', 'last_seen']
# passing our edited column names to our dataset
autos.columns = corrected_columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
NB: We changed our column names from camelcase to snakecase because, that Python's recommendations for naming variables, we also notice that we edited some column names E.g yearOfRegistration to registration_year in order to get a more intuitive description of that column when analysing our data
A very vital step, which helps us determine what other cleaning tasks need to be performed on our dataset before starting our analysis, such as:
Pandas uses the DataFrame.describe() to provide statistically useful information about our dataset, also Series.value_counts() to provides the count of unique values in our columns
# we passed the arguement include='all' to get statistical
# information for both numeric and non-numeric columns
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Based upon the dataframe.describe analysis the following columns need further analysis/action:
Firstly I will convert the price and odometer columns to numeric, this will then enable a more thorough check for any erroneous entries.
autos[['price', 'odometer']].head()
price | odometer | |
---|---|---|
0 | $5,000 | 150,000km |
1 | $8,500 | 150,000km |
2 | $8,990 | 70,000km |
3 | $4,350 | 70,000km |
4 | $1,350 | 150,000km |
Based on the outputs gotten above we notice:
1421
rows with the ad price of a car to be $0 which is definitely not correct.With the numeric values of price and odometer stored as strings it would be difficult to perform numerical operations on those columns, therefore we have to convert its values to integers.
# removing non-numeric characters using pandas vectorized
# replace method and converting columns to a numeric dtype
autos['price'] = (autos['price'].str.replace('$', '')
.str.replace(',', '')
.astype(int))
autos['odometer'] = (autos['odometer'].str.replace('km', '')
.str.replace(',', '')
.astype(int))
autos[['price', 'odometer']].head()
price | odometer | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
3 | 4350 | 70000 |
4 | 1350 | 150000 |
Removing the string values in our some of columns names might lead to loss of useful information which may cause us to misinterprete that column during analysis.
So we rename such columns we feel are necessary E.g odometer to odometer_km which tells us the total distance travelled by a car.
# renaming our odometer column
autos.rename({'odometer': 'odometer_km'}, axis='columns', inplace=True)
autos.head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
We analyse our price and odometer_km columns specifically looking for data that doesn't look right. Below are the steps we'd take:
Series.unique().shape
- to see how many unique valuesSeries.describe()
- to view min/max/median/mean etcSeries.value_counts()
# unique values in odometer column
autos['odometer_km'].unique().shape
(13,)
The value above shows us we have only 13 rows of distinct values for odometer in our dataset
# displaying statistical info on odometer_km
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
At first glance we can assume there are no outliers. Since the min
distance for a relatively used car is 5000km and max
distance 150,000km
# getting the non-null unique values for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
The values shown describes there's an upper limit in the data for car distances > 150,000km and lower limit for car distances < 5000km.
We observe that majority(32424) of the cars on sale have been used extensively which reflect in their distance covered 150,000km.
While a most of the other fraction of cars left used moderately which is shown in their distance been between 50,000 - 125,000km.
Only a few cars on sale have been used minimally with their distances below 40,000km
In general, so far there are no outliers for the odometer_km column.
# unique row values for price column
autos['price'].unique().shape
(2357,)
There are 2357 rows with unique values for the price column
# statistical info on price column
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Looking at the output, we observe the max value in price column is 100m and min value 0.
This obviously shows an outlier!
Therefore, we further explore this column to understand the reason for this.
autos['price'].value_counts()
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 1100 376 1300 371 3000 365 550 356 1800 355 5500 340 1250 335 350 335 1600 327 1999 322 ... 46200 1 29600 1 13480 1 21700 1 7373 1 3279 1 4286 1 188 1 17830 1 9130 1 910 1 238 1 2671 1 69900 1 151990 1 2479 1 4510 1 86500 1 47499 1 16998 1 27299 1 41850 1 4780 1 686 1 6495 1 20790 1 8970 1 846 1 2895 1 33980 1 Name: price, Length: 2357, dtype: int64
We notice that we have 1421 rows with the price of 0 which is inaccurate, because the least price for a product on eBay, for sales is $1. So we can remove all the 0 rows.
autos['price'].value_counts().sort_index(ascending=False).head(7)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 Name: price, dtype: int64
It would also be fair enough to remove the rows containing prices between 100m to 10m cause logically the seem too extreme.
# remvoing rows 0 < price > 1.0+e8
autos = autos[(autos['price'] >1) & (autos['price'] < 10000000)]
# checking to see if it worked
autos.shape
(48415, 20)
autos['price'].describe()
count 4.841500e+04 mean 6.101839e+03 std 2.282104e+04 min 2.000000e+00 25% 1.250000e+03 50% 3.000000e+03 75% 7.499000e+03 max 3.890000e+06 Name: price, dtype: float64
# checking the max price in our new df
autos['price'].max()
3890000
In summary, we have removed rows where 0 < price > 10m because these values are too extreme.
We could however, go more indept in our analysis by using the 25% and 75% percentile to narrow our price range.
In our dataset we've been provided 5 columns which contain data about date values. The data provided in this columns where created either by the crawler or the website as we would see in the data dictionary below.
date_crawled
: added by crawlerlast_seen
: added by crawlerad_created
: added by websiteregistration_month
: added by websiteregistration_year
: added by websiteThe first 3 columns above, date_crawled, last_seen and ad_created are provided in strings, while the rest are numerical.We would need to convert the string colums to numerical columns in order to properly anaylse the data in them.
# displaying the first 3 date columns
autos[['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We notice that the values contained in each data point is in "%Y-%m-%d"
format. So we extract the first 10 characters(e.g 2016-03-12) of each data point and anaylse it.
autos['date_crawled'].describe()
count 48415 unique 46745 top 2016-04-02 11:37:04 freq 3 Name: date_crawled, dtype: object
Using our df.describe() we see that there's a total of 48415 rows of date entries for date_crawled, and we also have some rows with null values
# getting the number of unique dates for date_crawled
autos['date_crawled'].str[:10].unique().shape
(34,)
Out of 48415 rows, we see that there are only 34 unique date entries(e.g: 2016-04-05) that was extracted by the crawler, based car sales upload posts excluding the time of the post.(e.g: 2016-04-05 16:40:00)
# determining the all the values for the date_crawled column.
# setting (normalize=True) contains the relative frequencies of
# of the unique values. (dropna=False) including null columns
(autos['date_crawled'].str[:10].value_counts(normalize=True,
dropna=False).sort_index(ascending=True))
2016-03-05 0.025364 2016-03-06 0.014066 2016-03-07 0.036043 2016-03-08 0.033275 2016-03-09 0.033048 2016-03-10 0.032201 2016-03-11 0.032593 2016-03-12 0.036951 2016-03-13 0.015656 2016-03-14 0.036621 2016-03-15 0.034266 2016-03-16 0.029516 2016-03-17 0.031643 2016-03-18 0.012909 2016-03-19 0.034741 2016-03-20 0.037798 2016-03-21 0.037323 2016-03-22 0.032944 2016-03-23 0.032263 2016-03-24 0.029392 2016-03-25 0.031560 2016-03-26 0.032242 2016-03-27 0.031127 2016-03-28 0.034845 2016-03-29 0.034142 2016-03-30 0.033729 2016-03-31 0.031808 2016-04-01 0.033729 2016-04-02 0.035485 2016-04-03 0.038583 2016-04-04 0.036518 2016-04-05 0.013074 2016-04-06 0.003160 2016-04-07 0.001384 Name: date_crawled, dtype: float64
The dates displayed show, the data gotten from the uploaded posts for car sales between March 2016 - April 2016 on eBay site by the crawler.
There's a relatively even distribution of uploads for car sales made between March 2016 - April 2016
The highest number of car sales post extracted was on 3rd April, 2016, while the least number extracted was 7th April, 2016.
autos['ad_created'].describe()
count 48415 unique 76 top 2016-04-03 00:00:00 freq 1880 Name: ad_created, dtype: object
Using our df.describe() we see that there's a total of 48415 rows of date entries for ad_created, and we also have some rows with null values
# getting the number of unique dates for ad_created
autos['ad_created'].str[:10].unique().shape
(76,)
Out of the 48415 rows of date entries for ad_created, there are only 76 unique date entries(e.g: 2016-04-05) extracted by the crawler, based on the ad_created uploads for car sales excluding the time of the post. (e.g: 2016-04-05 16:40:00)
(autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
.sort_index())
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033110 2016-03-10 0.031912 2016-03-11 0.032924 2016-03-12 0.036765 2016-03-13 0.017020 2016-03-14 0.035258 2016-03-15 0.033998 2016-03-16 0.030032 2016-03-17 0.031292 2016-03-18 0.013591 2016-03-19 0.033626 2016-03-20 0.037860 2016-03-21 0.037550 2016-03-22 0.032758 2016-03-23 0.032097 2016-03-24 0.029330 2016-03-25 0.031684 2016-03-26 0.032304 2016-03-27 0.031044 2016-03-28 0.034969 2016-03-29 0.034080 2016-03-30 0.033543 2016-03-31 0.031850 2016-04-01 0.033709 2016-04-02 0.035175 2016-04-03 0.038831 2016-04-04 0.036889 2016-04-05 0.011794 2016-04-06 0.003243 2016-04-07 0.001239 Name: ad_created, Length: 76, dtype: float64
The cell above shows the data extracted by the crawler for uploaded car sales from 11th June, 2015 - 7th April, 2016 on eBay.
We notice that in the there's an inconsistency in the extraction process by the crawler in 2015, which also reflects in the amount of data collected. i.e Data was only collected for the months June, August, September, November and December.
However, in 2016 we notice an improvement in the extraction of data for uploaded car sales from March, with the largest extracted data on **3rd April, 2016
autos['last_seen'].describe()
count 48415 unique 38382 top 2016-04-07 06:17:27 freq 8 Name: last_seen, dtype: object
Using our df.describe() we see that there's a total of 48415 rows of date entries for last_seen, and we also have some rows with null values
autos['last_seen'].str[:10].unique().shape
(34,)
Out of 48415 rows, we see that there are only 34 unique date entries(e.g: 2016-04-05) that was extracted by the crawler, based on the last_seen for car sales upload posts excluding the time of the post.(e.g: 2016-04-05 16:40:00)
(autos['last_seen'].str[:10].value_counts(normalize=True,
dropna=False).sort_index(ascending=True))
2016-03-05 0.001074 2016-03-06 0.004337 2016-03-07 0.005412 2016-03-08 0.007374 2016-03-09 0.009625 2016-03-10 0.010617 2016-03-11 0.012372 2016-03-12 0.023794 2016-03-13 0.008861 2016-03-14 0.012620 2016-03-15 0.015863 2016-03-16 0.016441 2016-03-17 0.028070 2016-03-18 0.007332 2016-03-19 0.015822 2016-03-20 0.020634 2016-03-21 0.020613 2016-03-22 0.021378 2016-03-23 0.018589 2016-03-24 0.019746 2016-03-25 0.019188 2016-03-26 0.016813 2016-03-27 0.015615 2016-03-28 0.020903 2016-03-29 0.022348 2016-03-30 0.024744 2016-03-31 0.023836 2016-04-01 0.022865 2016-04-02 0.024868 2016-04-03 0.025199 2016-04-04 0.024497 2016-04-05 0.124961 2016-04-06 0.221584 2016-04-07 0.132005 Name: last_seen, dtype: float64
The cell above shows the data extracted by the crawler for uploaded car sales from 5th March, 2016 - 7th April, 2016 on eBay.
We notice that in the first few days of March, the data extracted by the crawler was quite low.
However, in April we notice an improvement in the last days of extraction of data for uploaded car sales, with the largest extracted data on 6th April, 2016
# displaying the contents of registration col
autos['registration_year'].head()
0 2004 1 1997 2 2009 3 2007 4 2003 Name: registration_year, dtype: int64
autos['registration_year'].describe()
count 48415.000000 mean 2004.773149 std 88.777913 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The output above shows that, there are 48415 date entries extracted by the crawler. We can also notice an outlier in date values.
The max date value 9999(year) seems to be an invalid entry by the user on the site, which we would go further to analyse.
The min date value is 1000(year) which is also invalid.
autos['registration_year'].unique().shape
(95,)
Out of the 48415 date entries, there are only 95 distinct year entries
(autos['registration_year'].value_counts(normalize=True, dropna=False)
.sort_index(ascending=False)
.head(15))
9999 0.000062 9000 0.000021 8888 0.000021 6200 0.000021 5911 0.000021 5000 0.000083 4800 0.000021 4500 0.000021 4100 0.000021 2800 0.000021 2019 0.000041 2018 0.009708 2017 0.028627 2016 0.024972 2015 0.008076 Name: registration_year, dtype: float64
This shows, within the date entries for registration_year we have some invalid registration_year entries for car, created by users on the webiste.
A car with a registration year > 2020 is wrong because a car can't be listed before its creation.
The highest acceptable year is thus 2020.
We then look to examine the lower bound limit for registration_year
(autos['registration_year'].value_counts(normalize=True, dropna=False)
.sort_index()
.head(15))
1000 0.000021 1001 0.000021 1111 0.000021 1800 0.000041 1910 0.000083 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000041 1937 0.000083 1938 0.000021 1939 0.000021 1941 0.000041 1943 0.000021 1948 0.000021 Name: registration_year, dtype: float64
The lowest acceptable year is in the early 1900s (after cars were invented). The value I will choose is 1930, as this is when automobiles started to become more rampant.
Therefore, only the years between 1930 to atmost 2020 would be considered.
# keeping rows 1930 < registration_year < 2020.
# We do this by creating a new dataset for our autos df
print('Prior to the removal of outliers the rows in the dataFrame was {}'.format(autos['registration_year'].shape[0]))
autos = autos[(autos['registration_year'] > 1930) & (autos['registration_year'] < 2020)]
Prior to the removal of outliers the rows in the dataFrame was 48415
print('The updated dataFrame shape is now {}'.format(autos.shape))
The updated dataFrame shape is now (48389, 20)
# Calculating the distribution of the remaining values
autos['registration_year'].value_counts(normalize=True).head(10)
2000 0.064643 2005 0.060489 1999 0.059662 2004 0.055798 2003 0.055736 2006 0.055198 2001 0.054393 2002 0.051251 1998 0.048606 2007 0.046994 Name: registration_year, dtype: float64
From our analysis, we noticed that the bulk of cars have registrations between 1995-2017
In this section we'll look at the average price per car brand to determine if certain brands attract higher prices. The process of doing this will be:
# gives us a summary of the information on brand column
autos['brand'].describe()
count 48389 unique 40 top volkswagen freq 10305 Name: brand, dtype: object
We see we have 48389 entries, 40 unique entry values with the volkswagen brand been the highest
# displaying the non-null value counts for each unique value
# NB: I didn't use the .sort_index() cause it sorts only the
# the index values and not the actual values
autos['brand'].value_counts(normalize=True, dropna=False)
volkswagen 0.212962 bmw 0.108558 opel 0.108392 mercedes_benz 0.095766 audi 0.085867 ford 0.069747 renault 0.047862 peugeot 0.029470 fiat 0.026039 seat 0.018930 skoda 0.016078 nissan 0.015293 mazda 0.015272 smart 0.014342 citroen 0.014115 toyota 0.012627 hyundai 0.009961 sonstige_autos 0.009527 volvo 0.009031 mini 0.008638 mitsubishi 0.008142 honda 0.007998 kia 0.007130 alfa_romeo 0.006613 suzuki 0.005890 porsche 0.005890 chevrolet 0.005662 chrysler 0.003493 dacia 0.002666 daihatsu 0.002521 jeep 0.002211 subaru 0.002087 land_rover 0.002046 saab 0.001633 daewoo 0.001571 jaguar 0.001529 trabant 0.001364 rover 0.001343 lancia 0.001137 lada 0.000599 Name: brand, dtype: float64
I have decided to choose all brands with >.01% frequency.
This goes up to Toyota [15th on index].
# Assigning a brands with freq > .01 to a list
brand_counts = autos['brand'].value_counts(normalize=True, dropna=False)
brands = brand_counts[brand_counts > .01].index
print(brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota'], dtype='object')
# Calculating the mean price of our brands
# Empty dictionary for aggregate data
brand_price = {}
for b in brands:
brand = autos.loc[autos['brand'] == b, 'price'].mean()
# formating the display of brand. ie removing the decimals
brand = '{:,.0f}'.format(brand)
brand_price[b] = brand
brand_price
{'audi': '9,242', 'bmw': '8,527', 'citroen': '3,767', 'fiat': '2,798', 'ford': '4,028', 'mazda': '4,059', 'mercedes_benz': '8,556', 'nissan': '4,676', 'opel': '2,958', 'peugeot': '3,074', 'renault': '2,440', 'seat': '4,334', 'skoda': '6,370', 'smart': '3,518', 'toyota': '5,148', 'volkswagen': '5,541'}
# sorting the dictionary by values
brand_price_tuples = sorted(brand_price.items(), reverse=True, key=lambda x:x[1])
# printing out our sorted prices in a dictionary format
brand_price_tuples
for element in brand_price_tuples:
print(element[0], ":" ,element[1])
audi : 9,242 mercedes_benz : 8,556 bmw : 8,527 skoda : 6,370 volkswagen : 5,541 toyota : 5,148 nissan : 4,676 seat : 4,334 mazda : 4,059 ford : 4,028 citroen : 3,767 smart : 3,518 peugeot : 3,074 opel : 2,958 fiat : 2,798 renault : 2,440
Our top brands of expensive cars are: Audi, Mercedes_benz, BMW e.t.c. This is unsurprising as these brands are luxury cars.
The least expensive brands been Renault and Fiat, is also not surprising as the are cars made affordable for the middle class and the lower class.
While brands like Toyota, Volkswagen e.t.c are in between, cause the consists of both luxurious and affordable cars
To understand what drives the differences in price across brands it is important to also look at other features that could affect the price, such as: mileage, age etc.
For large amounts of data or multiple columns it is easier to display this data in a new dataframe, using the dataframe constructor.
# Calculating the mean mileage of our brands.
# Our selected brand was stored in the brands variable.
# stores each brand and its average mileage
brand_mileage = {}
for b in brands:
mileage = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
# formating the output of mileage, therefore removing decimal points
mileage = '{:,.0f}'.format(mileage)
brand_mileage[b] = mileage
brand_mileage
{'audi': '129,568', 'bmw': '132,693', 'citroen': '120,073', 'fiat': '117,516', 'ford': '124,332', 'mazda': '124,871', 'mercedes_benz': '130,899', 'nissan': '118,750', 'opel': '129,462', 'peugeot': '127,311', 'renault': '128,184', 'seat': '121,676', 'skoda': '110,893', 'smart': '100,512', 'toyota': '116,219', 'volkswagen': '128,962'}
Now to convert both dictionaries into series objects using series constructor, and then add both series objects to a dataframe (with named column headings).
# converting brand mean price to a series object, which is
# done by passing in the dictionary we created for mean price
# of each brand above
bmp_series = pd.Series(brand_price)
bmp_series
audi 9,242 bmw 8,527 citroen 3,767 fiat 2,798 ford 4,028 mazda 4,059 mercedes_benz 8,556 nissan 4,676 opel 2,958 peugeot 3,074 renault 2,440 seat 4,334 skoda 6,370 smart 3,518 toyota 5,148 volkswagen 5,541 dtype: object
# converting average mileage of each brand to a series
amp_series = pd.Series(brand_mileage)
amp_series
audi 129,568 bmw 132,693 citroen 120,073 fiat 117,516 ford 124,332 mazda 124,871 mercedes_benz 130,899 nissan 118,750 opel 129,462 peugeot 127,311 renault 128,184 seat 121,676 skoda 110,893 smart 100,512 toyota 116,219 volkswagen 128,962 dtype: object
# Now turn them both into dataframes.
# NB. The dictionary keys became the index in the Series obj.
# This index can be used for the dataframe.
# NB: columns=['mean_price'] defines the name of our columns
# if omitted the column would be named '0'
# creating a dataframe out of our bmp_series
auto_brands = pd.DataFrame(bmp_series, columns=['mean_price'])
auto_brands
mean_price | |
---|---|
audi | 9,242 |
bmw | 8,527 |
citroen | 3,767 |
fiat | 2,798 |
ford | 4,028 |
mazda | 4,059 |
mercedes_benz | 8,556 |
nissan | 4,676 |
opel | 2,958 |
peugeot | 3,074 |
renault | 2,440 |
seat | 4,334 |
skoda | 6,370 |
smart | 3,518 |
toyota | 5,148 |
volkswagen | 5,541 |
# Adding the average mileage series to the auto_brand dataframe.
# NB: Don't use a datFrame constructor! Only use that to create
# the df object
auto_brands['mean_mileage'] = amp_series
auto_brands
mean_price | mean_mileage | |
---|---|---|
audi | 9,242 | 129,568 |
bmw | 8,527 | 132,693 |
citroen | 3,767 | 120,073 |
fiat | 2,798 | 117,516 |
ford | 4,028 | 124,332 |
mazda | 4,059 | 124,871 |
mercedes_benz | 8,556 | 130,899 |
nissan | 4,676 | 118,750 |
opel | 2,958 | 129,462 |
peugeot | 3,074 | 127,311 |
renault | 2,440 | 128,184 |
seat | 4,334 | 121,676 |
skoda | 6,370 | 110,893 |
smart | 3,518 | 100,512 |
toyota | 5,148 | 116,219 |
volkswagen | 5,541 | 128,962 |
# sorting the values by mean_price to make it easier to anaylse
auto_brands.sort_values(by='mean_price', ascending=False)
mean_price | mean_mileage | |
---|---|---|
audi | 9,242 | 129,568 |
mercedes_benz | 8,556 | 130,899 |
bmw | 8,527 | 132,693 |
skoda | 6,370 | 110,893 |
volkswagen | 5,541 | 128,962 |
toyota | 5,148 | 116,219 |
nissan | 4,676 | 118,750 |
seat | 4,334 | 121,676 |
mazda | 4,059 | 124,871 |
ford | 4,028 | 124,332 |
citroen | 3,767 | 120,073 |
smart | 3,518 | 100,512 |
peugeot | 3,074 | 127,311 |
opel | 2,958 | 129,462 |
fiat | 2,798 | 117,516 |
renault | 2,440 | 128,184 |
Based on the values, they don't seem to be any obvious connection between mean_price and mean_mileage. Most brands have an average mileage between 110-130k.
However, we notice that the top 3 brands(Audi, Mercedes_benz and BMW) have the highest mileage - this may be because they are well built cars and so can last long. So buyers may know this and be willing to pay a lot for any of those brands, even with the high mileage.
Data cleaning next steps:
Analysis next steps:
autos.head(50)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | 250 | test | NaN | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | 999 | test | NaN | 2017 | manuell | 90 | NaN | 150000 | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | 350 | control | NaN | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | 350 | test | kombi | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | kombi | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | 41500 | test | cabrio | 2004 | manuell | 320 | 911 | 150000 | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450 | control | cabrio | 2015 | manuell | 184 | cooper | 10000 | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | 7999 | control | bus | 2010 | manuell | 120 | NaN | 150000 | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | 48500 | control | limousine | 2014 | automatik | 306 | 5er | 30000 | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | 90 | control | kombi | 1996 | manuell | 116 | NaN | 150000 | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | 777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125000 | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | 5250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150000 | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | 4999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150000 | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
30 | 2016-03-14 11:47:31 | Peugeot_206_Unfallfahrzeug | privat | Angebot | 80 | test | kleinwagen | 2002 | manuell | 60 | 2_reihe | 150000 | 6 | benzin | peugeot | ja | 2016-03-14 00:00:00 | 0 | 57076 | 2016-03-14 11:47:31 |
31 | 2016-03-14 16:53:09 | Noch_gut_erhaltenen_C_320 | privat | Angebot | 2850 | test | kombi | 2002 | automatik | 218 | c_klasse | 150000 | 7 | benzin | mercedes_benz | nein | 2016-03-14 00:00:00 | 0 | 41065 | 2016-03-16 07:19:04 |
32 | 2016-03-20 05:03:03 | Corsa_mit_TÜV_5.2016 | privat | Angebot | 350 | control | kleinwagen | 1999 | manuell | 0 | corsa | 150000 | 7 | benzin | opel | NaN | 2016-03-20 00:00:00 | 0 | 27619 | 2016-04-06 03:15:20 |
33 | 2016-03-15 12:25:26 | Audi_A3_2.0_TDI_Sportback_DPF_Kupplung_defekt_... | privat | Angebot | 4800 | test | kombi | 2006 | manuell | 140 | a3 | 150000 | 8 | diesel | audi | nein | 2016-03-15 00:00:00 | 0 | 61169 | 2016-04-07 13:15:31 |
34 | 2016-04-05 07:55:07 | Bmw_530d__navi_tuv03/18 | privat | Angebot | 2600 | test | kombi | 2000 | automatik | 193 | 5er | 150000 | 6 | diesel | bmw | NaN | 2016-04-05 00:00:00 | 0 | 95111 | 2016-04-07 12:15:47 |
35 | 2016-03-29 21:52:56 | Ford_Mondeo | privat | Angebot | 1200 | test | kombi | 1998 | manuell | 0 | mondeo | 150000 | 12 | benzin | ford | nein | 2016-03-29 00:00:00 | 0 | 52428 | 2016-04-06 08:45:35 |
36 | 2016-03-16 23:46:31 | Ford_Mondeo_2.0_Turnier_TDCi_DPF_Titanium | privat | Angebot | 7500 | test | kombi | 2007 | manuell | 131 | mondeo | 150000 | 11 | diesel | ford | nein | 2016-03-16 00:00:00 | 0 | 34246 | 2016-04-05 16:21:16 |
37 | 2016-03-23 16:47:46 | Seat_Altea_XL_2.0_TDI_CR_DPF_Style | privat | Angebot | 13800 | test | bus | 2012 | manuell | 140 | altea | 90000 | 2 | diesel | seat | nein | 2016-03-23 00:00:00 | 0 | 85113 | 2016-04-05 20:17:53 |
38 | 2016-03-21 15:51:10 | Volkswagen_Golf_1.4_Special | privat | Angebot | 2850 | control | limousine | 2002 | manuell | 75 | golf | 125000 | 2 | benzin | volkswagen | nein | 2016-03-21 00:00:00 | 0 | 63674 | 2016-03-28 12:16:06 |
39 | 2016-03-11 13:58:38 | Mercedes_C_180_Kombi__zu_verkaufen | privat | Angebot | 1450 | test | kombi | 1999 | automatik | 122 | c_klasse | 150000 | 6 | benzin | mercedes_benz | nein | 2016-03-11 00:00:00 | 0 | 12683 | 2016-04-06 15:17:52 |
40 | 2016-03-07 14:50:03 | VW_Golf__4_Cabrio_2.0_GTI_16V___Leder_MFA_Alus... | privat | Angebot | 3500 | control | cabrio | 1999 | manuell | 150 | golf | 150000 | 1 | benzin | volkswagen | nein | 2016-03-07 00:00:00 | 0 | 6780 | 2016-03-12 02:15:52 |
41 | 2016-03-10 10:46:08 | Passat_3b_1.9_TDI_Highline__angemeldet_mit_tuv... | privat | Angebot | 3200 | test | kombi | 2003 | manuell | 131 | NaN | 150000 | 7 | NaN | volkswagen | nein | 2016-03-10 00:00:00 | 0 | 28259 | 2016-04-06 20:19:08 |
42 | 2016-03-22 21:37:03 | Vw_Polo_l.0_/60_PS_Blue_Motion_Technologie_Son... | privat | Angebot | 11900 | control | kleinwagen | 2014 | manuell | 60 | polo | 20000 | 7 | benzin | volkswagen | NaN | 2016-03-22 00:00:00 | 0 | 26629 | 2016-04-06 18:44:39 |
43 | 2016-03-08 16:51:22 | Audi_A1_1.2_TFSI_Attraction_mit_Garantie | privat | Angebot | 11500 | test | kleinwagen | 2011 | manuell | 86 | a1 | 60000 | 9 | benzin | audi | nein | 2016-03-08 00:00:00 | 0 | 85049 | 2016-04-07 03:16:05 |
44 | 2016-03-12 10:38:06 | Zu_verkaufen | privat | Angebot | 900 | control | kleinwagen | 1998 | manuell | 43 | twingo | 150000 | 7 | benzin | renault | nein | 2016-03-12 00:00:00 | 0 | 79725 | 2016-04-06 01:17:27 |
45 | 2016-03-12 17:39:55 | Toyota_Starlet_1.3 | privat | Angebot | 1000 | test | kleinwagen | 1995 | manuell | 75 | andere | 150000 | 3 | benzin | toyota | nein | 2016-03-12 00:00:00 | 0 | 74597 | 2016-04-07 08:45:19 |
46 | 2016-03-31 10:53:28 | BMW_mit__Lpg | privat | Angebot | 9000 | control | NaN | 2005 | automatik | 0 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-03-31 00:00:00 | 0 | 12351 | 2016-04-06 03:44:41 |
47 | 2016-03-28 19:51:51 | Mercedes_Benz_A_150 | privat | Angebot | 3499 | control | limousine | 2008 | manuell | 95 | a_klasse | 150000 | 8 | benzin | mercedes_benz | nein | 2016-03-28 00:00:00 | 0 | 59557 | 2016-04-01 05:17:51 |
48 | 2016-03-20 12:41:50 | Mercedes_C180_Sport_1_8l | privat | Angebot | 1099 | test | limousine | 1998 | manuell | 122 | c_klasse | 150000 | 2 | benzin | mercedes_benz | nein | 2016-03-20 00:00:00 | 0 | 83024 | 2016-03-24 03:17:53 |
49 | 2016-03-27 15:36:42 | Mercedes_Benz_CL_500__BlueEFFICIENCY__7G_TRONIC | privat | Angebot | 56000 | control | coupe | 2012 | automatik | 435 | cl | 60000 | 2 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 34474 | 2016-04-05 11:47:52 |
50 | 2016-03-09 16:48:05 | ??_????????_??Top!!!_Peugeot_308_sw_Sport_?... | privat | Angebot | 5999 | test | kombi | 2008 | manuell | 150 | 3_reihe | 125000 | 8 | benzin | peugeot | nein | 2016-03-09 00:00:00 | 0 | 55569 | 2016-04-07 06:17:09 |
#Inspecting the values in the seller column
autos["seller"].value_counts()
privat 48388 gewerblich 1 Name: seller, dtype: int64
"""There's only one value: private seller. So this column
can be removed"""
#Now to check the offer_type column
autos["offer_type"].value_counts()
Angebot 48389 Name: offer_type, dtype: int64
"""Same result for the offer-type column, so it can be
dropped too."""
autos = autos.drop(['seller','offer_type'], axis=1)
print(autos.shape)
(48389, 18)
autos.describe(include='all')
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48389 | 48389 | 4.838900e+04 | 48389 | 43894 | 48389.000000 | 46106 | 48389.000000 | 45964 | 48389.000000 | 48389.000000 | 44434 | 48389 | 39404 | 48389 | 48389.0 | 48389.000000 | 48389 |
unique | 46721 | 37308 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 38362 |
top | 2016-03-12 16:06:22 | Ford_Fiesta | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 76 | NaN | 24922 | 12573 | NaN | 36009 | NaN | 3888 | NaN | NaN | 29283 | 10305 | 34735 | 1879 | NaN | NaN | 8 |
mean | NaN | NaN | 6.101576e+03 | NaN | NaN | 2003.486082 | NaN | 117.332886 | NaN | 125845.543409 | 5.792081 | NaN | NaN | NaN | NaN | 0.0 | 50990.788981 | NaN |
std | NaN | NaN | 2.282611e+04 | NaN | NaN | 7.479853 | NaN | 200.668182 | NaN | 39662.381173 | 3.681683 | NaN | NaN | NaN | NaN | 0.0 | 25747.322845 | NaN |
min | NaN | NaN | 2.000000e+00 | NaN | NaN | 1931.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | 1.250000e+03 | NaN | NaN | 1999.000000 | NaN | 72.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30659.000000 | NaN |
50% | NaN | NaN | 3.000000e+03 | NaN | NaN | 2004.000000 | NaN | 107.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49716.000000 | NaN |
75% | NaN | NaN | 7.498000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71679.000000 | NaN |
max | NaN | NaN | 3.890000e+06 | NaN | NaN | 2019.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
gearbox and fuel_type both have German words to translate
autos['gearbox'].value_counts()
manuell 36009 automatik 10097 Name: gearbox, dtype: int64
The gearbox column contains German words(manuell and automatik) which we can translate to English words and insert them back to our dataset.
# converting gearbox German words to English
# Create a dictionary with the German words as the key and
# the English words as the values
gearbox_map = {'manuell': 'manual',
'automatik': 'automatic'}
The pandas method Series.map() is used to re-assign column values in our dataset.
# inserting the translated words back to our dataset.
autos['gearbox'] = autos['gearbox'].map(gearbox_map)
autos['gearbox'].value_counts()
manual 36009 automatic 10097 Name: gearbox, dtype: int64
autos['fuel_type'].value_counts(dropna=False)
benzin 29283 diesel 14339 NaN 3955 lpg 667 cng 73 hybrid 37 elektro 18 andere 17 Name: fuel_type, dtype: int64
Above are the German words that need translations including the Null values
#Inspecting the null values
autos[autos["fuel_type"].isnull()].head(10)
#The Null values seem to be missing data. For now they
#can be kept in as their ommission is not critical for
#this analysis
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 2016-03-22 16:51:34 | Seat_Arosa | 250 | test | NaN | 2000 | manual | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
41 | 2016-03-10 10:46:08 | Passat_3b_1.9_TDI_Highline__angemeldet_mit_tuv... | 3200 | test | kombi | 2003 | manual | 131 | NaN | 150000 | 7 | NaN | volkswagen | nein | 2016-03-10 00:00:00 | 0 | 28259 | 2016-04-06 20:19:08 |
46 | 2016-03-31 10:53:28 | BMW_mit__Lpg | 9000 | control | NaN | 2005 | automatic | 0 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-03-31 00:00:00 | 0 | 12351 | 2016-04-06 03:44:41 |
64 | 2016-04-05 07:36:19 | Autotransport__Abschlepp_Schlepper | 40 | test | NaN | 2011 | NaN | 0 | 5er | 150000 | 5 | NaN | bmw | NaN | 2016-04-05 00:00:00 | 0 | 40591 | 2016-04-07 12:16:01 |
68 | 2016-04-03 17:36:59 | Mini_cooper_s_clubman_/vollausstattung_/_Navi/... | 10990 | test | NaN | 2017 | manual | 174 | clubman | 100000 | 0 | NaN | mini | nein | 2016-04-03 00:00:00 | 0 | 83135 | 2016-04-05 17:26:26 |
72 | 2016-03-14 07:55:36 | Audi_A6_2.7_TDI | 8100 | control | kombi | 2008 | automatic | 179 | a6 | 150000 | 5 | NaN | audi | nein | 2016-03-14 00:00:00 | 0 | 97453 | 2016-04-07 08:17:00 |
84 | 2016-03-27 19:52:54 | Renault_twingo | 900 | control | NaN | 2018 | NaN | 60 | twingo | 150000 | 0 | NaN | renault | NaN | 2016-03-27 00:00:00 | 0 | 40589 | 2016-04-05 18:46:49 |
85 | 2016-03-11 12:36:21 | Toyota_yaris_cool | 5000 | test | kleinwagen | 2009 | manual | 0 | yaris | 60000 | 12 | NaN | toyota | NaN | 2016-03-11 00:00:00 | 0 | 30916 | 2016-04-07 08:17:01 |
92 | 2016-03-25 19:47:32 | Zu_verkaufen | 1900 | test | NaN | 2000 | NaN | 0 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-03-25 00:00:00 | 0 | 72108 | 2016-04-07 02:16:01 |
101 | 2016-03-22 17:51:49 | Schnaepchen_in_einem_Jahr_OLDTEIMER_KENNZEICHEN | 600 | control | NaN | 2016 | NaN | 0 | corolla | 150000 | 0 | NaN | toyota | NaN | 2016-03-22 00:00:00 | 0 | 32825 | 2016-03-22 17:51:49 |
# converting fuel_type German words to English
fuel_type_map = {'benzin': 'petrol',
'diesel': 'diesel',
'lpg': 'lpg',
'cng': 'cng',
'hybrid': 'hybrid',
'elektro': 'electric',
'andere': 'other'}
autos['fuel_type'] = autos['fuel_type'].map(fuel_type_map)
autos['fuel_type'].value_counts(dropna=False)
petrol 29283 diesel 14339 NaN 3955 lpg 667 cng 73 hybrid 37 electric 18 other 17 Name: fuel_type, dtype: int64
Above shows the German words have been translated to English, excluding the 'NaN' values
# converting 'NaN values using Series.fillna() to 'unknown'
autos['fuel_type'] = autos['fuel_type'].fillna('unknown')
autos['fuel_type'].value_counts()
petrol 29283 diesel 14339 unknown 3955 lpg 667 cng 73 hybrid 37 electric 18 other 17 Name: fuel_type, dtype: int64
The 'NaN' values have been changed 'unknown'
# re-inspecting the dataFrame
autos.describe(include='all')
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48389 | 48389 | 4.838900e+04 | 48389 | 43894 | 48389.000000 | 46106 | 48389.000000 | 45964 | 48389.000000 | 48389.000000 | 48389 | 48389 | 39404 | 48389 | 48389.0 | 48389.000000 | 48389 |
unique | 46721 | 37308 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 8 | 40 | 2 | 76 | NaN | NaN | 38362 |
top | 2016-03-12 16:06:22 | Ford_Fiesta | NaN | test | limousine | NaN | manual | NaN | golf | NaN | NaN | petrol | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 76 | NaN | 24922 | 12573 | NaN | 36009 | NaN | 3888 | NaN | NaN | 29283 | 10305 | 34735 | 1879 | NaN | NaN | 8 |
mean | NaN | NaN | 6.101576e+03 | NaN | NaN | 2003.486082 | NaN | 117.332886 | NaN | 125845.543409 | 5.792081 | NaN | NaN | NaN | NaN | 0.0 | 50990.788981 | NaN |
std | NaN | NaN | 2.282611e+04 | NaN | NaN | 7.479853 | NaN | 200.668182 | NaN | 39662.381173 | 3.681683 | NaN | NaN | NaN | NaN | 0.0 | 25747.322845 | NaN |
min | NaN | NaN | 2.000000e+00 | NaN | NaN | 1931.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | 1.250000e+03 | NaN | NaN | 1999.000000 | NaN | 72.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30659.000000 | NaN |
50% | NaN | NaN | 3.000000e+03 | NaN | NaN | 2004.000000 | NaN | 107.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49716.000000 | NaN |
75% | NaN | NaN | 7.498000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71679.000000 | NaN |
max | NaN | NaN | 3.890000e+06 | NaN | NaN | 2019.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
unrepaired_damage and vehicle_type could both be translated too. I'll save this for another time.
Columns of interest:
"""Could convert into a datetime object first"""
#autos["date_crawled"] = pd.to_datetime(autos["date_crawled"])
print('Could convert into a datetime object first')
Could convert into a datetime object first
# Or we could just convert it to string then to int
autos['date_crawled'] = autos['date_crawled'].str[:11].str.replace('-', '').astype(int)
autos['ad_created'] = autos['ad_created'].str[:11].str.replace('-','').astype(int)
autos['last_seen'] = autos['last_seen'].str[:11].str.replace('-', '').astype(int)
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48389 entries, 0 to 49999 Data columns (total 18 columns): date_crawled 48389 non-null int64 name 48389 non-null object price 48389 non-null int64 abtest 48389 non-null object vehicle_type 43894 non-null object registration_year 48389 non-null int64 gearbox 46106 non-null object power_ps 48389 non-null int64 model 45964 non-null object odometer_km 48389 non-null int64 registration_month 48389 non-null int64 fuel_type 48389 non-null object brand 48389 non-null object unrepaired_damage 39404 non-null object ad_created 48389 non-null int64 nr_of_pictures 48389 non-null int64 postal_code 48389 non-null int64 last_seen 48389 non-null int64 dtypes: int64(10), object(8) memory usage: 7.0+ MB
#Print first few rows to confirm it has come out right
autos.head(10)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 0 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | nein | 20160404 | 0 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | nein | 20160326 | 0 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | nein | 20160312 | 0 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | nein | 20160401 | 0 | 39218 | 20160401 |
5 | 20160321 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | 7900 | test | bus | 2006 | automatic | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 20160321 | 0 | 22962 | 20160406 |
6 | 20160320 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | 300 | test | limousine | 1995 | manual | 90 | golf | 150000 | 8 | petrol | volkswagen | NaN | 20160320 | 0 | 31535 | 20160323 |
7 | 20160316 | Golf_IV_1.9_TDI_90PS | 1990 | control | limousine | 1998 | manual | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 20160316 | 0 | 53474 | 20160407 |
8 | 20160322 | Seat_Arosa | 250 | test | NaN | 2000 | manual | 0 | arosa | 150000 | 10 | unknown | seat | nein | 20160322 | 0 | 7426 | 20160326 |
9 | 20160316 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | 590 | control | bus | 1997 | manual | 90 | megane | 150000 | 7 | petrol | renault | nein | 20160316 | 0 | 15749 | 20160406 |
autos['name'].unique()
array(['Peugeot_807_160_NAVTECH_ON_BOARD', 'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik', 'Volkswagen_Golf_1.6_United', ..., 'Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon', 'Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+Reifen_neu_!!', 'Fiat_500_C_1.2_Dualogic_Lounge'], dtype=object)
autos['name'].head(10)
0 Peugeot_807_160_NAVTECH_ON_BOARD 1 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 Volkswagen_Golf_1.6_United 3 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 Golf_IV_1.9_TDI_90PS 8 Seat_Arosa 9 Renault_Megane_Scenic_1.6e_RT_Klimaanlage Name: name, dtype: object
# creating a series object from the name column of the df
s = pd.Series(autos['name'])
s.str.split(pat='_', n=2).str[:5].head(15)
0 [Peugeot, 807, 160_NAVTECH_ON_BOARD] 1 [BMW, 740i, 4_4_Liter_HAMANN_UMBAU_Mega_Optik] 2 [Volkswagen, Golf, 1.6_United] 3 [Smart, smart, fortwo_coupe_softouch/F1/Klima/... 4 [Ford, Focus, 1_6_Benzin_TÜV_neu_ist_sehr_gepf... 5 [Chrysler, Grand, Voyager_2.8_CRD_Aut.Limited_... 6 [VW, Golf, III_GT_Special_Electronic_Green_Met... 7 [Golf, IV, 1.9_TDI_90PS] 8 [Seat, Arosa] 9 [Renault, Megane, Scenic_1.6e_RT_Klimaanlage] 10 [VW, Golf, Tuning_in_siber/grau] 11 [Mercedes, A140, Motorschaden] 12 [Smart, smart, fortwo_coupe_softouch_pure_MHD_... 13 [Audi, A3, 1.6_tuning] 14 [Renault, Clio, 3__Dynamique_1.2__16_V;_viele_... Name: name, dtype: object
The name column can be split into Model name and some other specifications using the split
Finding the most common brand/model combinations
autos[['brand', 'model']].head()
brand | model | |
---|---|---|
0 | peugeot | andere |
1 | bmw | 7er |
2 | volkswagen | golf |
3 | smart | fortwo |
4 | ford | focus |
# grouping the dataset by model and brand
brand_mod_group = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = brand_mod_group["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:10])
brand model volkswagen golf 3888 bmw 3er 2676 volkswagen polo 1682 opel corsa 1672 astra 1401 volkswagen passat 1380 audi a4 1253 mercedes_benz c_klasse 1161 bmw 5er 1144 mercedes_benz e_klasse 972 Name: date_crawled, dtype: int64
autos['unrepaired_damage'].value_counts(dropna=False)
nein 34735 NaN 8985 ja 4669 Name: unrepaired_damage, dtype: int64
The values of this column is displayed in German, but below are translated values:
unique_damage = autos['unrepaired_damage'].unique()
unique_damage
array(['nein', nan, 'ja'], dtype=object)
# Stores the mean price of cars with repaired_damage and
# unrepaired_damage
repaired_unrepaired_mean_price = {}
for value in unique_damage:
selected_value = autos.loc[autos['unrepaired_damage'] == value, 'price'].mean()
repaired_unrepaired_mean_price[value] = selected_value
repaired_unrepaired_mean_price
{nan: nan, 'nein': 7271.651792140492, 'ja': 2443.114371385736}
Recall that, 'nein': No 'ja': Yes 'NaN': No input(missing value)
Therefore, the price of damaged cars, are relatively low compared to undamaged cars which is a very reasonable logic. The fact is that a damaged car should cost less, reason been that you still have to replacements to make, while an undamaged car doesn't require any replacements
A lot of data cleaning processes have been carried out on our dataset,such as cleaning column names, removing null values, cleaning values for date columns etc
This processes have made our analysis easier, precise and more efficient, as we were able to compare various columns against each other to gain more valuable information.
We could see that the brands with the highest prices are Audi, Mercedes_benz and BMW