Our personal goal is to familiarize ourselves with various data cleaning tasks and work flows using numpy and pandas.
As for the project itself, we will be trying to clean the data set and do some preliminary analysis. The data we will be using for this project are ads for used cars from the classifieds sectionof the GermaneBay website.
The data dictionary for our data is as follows:
Variable Name | Description |
---|---|
dateCrawled |
when this ad was first crawled (all field-values are taken from this date) |
name |
name of the car |
seller |
whether the seller is private or a dealer |
offerType |
the type of listing |
price |
the price on the ad to sell the car |
abtest |
whether the listing is included in an A/B test |
vehicleType |
the vehicle type |
yearOfRegistration |
the year in which the car was first registered |
gearbox |
the transmission type |
powerPS |
the power of the car in PS |
model |
the car model name |
kilometer |
how many kilometers the car has driven |
monthOfRegistration |
the month in which the car was first registered |
fuelType |
what type of fuel the car uses |
brand |
the brand of the car |
notRepairedDamage |
if the car has a damage which is not yet repaired |
dateCreated |
the date on which the eBay listing was created |
nrOfPictures |
the number of pictures in the ad |
postalCode |
the postal code for the location of the vehicle |
lastSeenOnline |
when the crawler saw this ad last online |
We will start by importing the pandas and numpy libraries, as well as loading the data set.
# importing libraries
import pandas as pd
import numpy as np
# loading the file
autos = pd.read_csv('autos.csv', encoding='Latin-1')
# Checking what our data set looks like
# This cell also shows some features of Jupyter Notebook in displaying dataframes.
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
Let's also have a quick look at the shape and data types in our data set using the info()
and the head()
methods.
autos.info()
autos.head(3)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
Since the column names are currently in camelcase, let's rename them to follow the snakecase convention. Dataquest also suggests for us to rename some specific column names such as notRepairedDamage
to unrepaired_damage
, among others.
We'll do this manually since we're not yet experts in regular expressions and advanced manipulation of strings.
# Checking the column names
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Renaming the columns
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price',
'ab_test', 'vehicle_type', 'registration_year',
'gear_box', 'power_ps', 'model', 'odometer',
'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures',
'postal_code', 'last_seen'
]
autos.columns = new_columns
# Checking the new column names by printing the first few rows of our dataframe
autos.head(2)
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
Let's look at the the characteristics of our data set so we can start figuring out what data cleaning tasks we need to do. We use the describe()
method and specify the parameter include='all'
so that non-numeric columns are included.
# Checking column descriptions/summaries
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Some quick observations here:
seller
column are tagged as 'privat'
offer_type
column are tagged as 'Angebot
nr_of_pictures
column appear to be zero (we can check this by using the value_counts()
method laterWe'll consider dropping these three columns but let's check the nr_of_pictures
more closely first.
# Checking value counts of the three columns being considered for deletion.
# We're most interested in the the nr_of_pictures column
print(autos['nr_of_pictures'].value_counts())
print('\n')
print(autos['seller'].value_counts())
print('\n')
print(autos['offer_type'].value_counts())
0 50000 Name: nr_of_pictures, dtype: int64 privat 49999 gewerblich 1 Name: seller, dtype: int64 Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
The previous cell confirms that all values in the nr_of_pictures
columns are zero. Also, there isn't really any significant variation in the values contained in the seller
and offer_type
columns, with only one observation having a different value. We'll drop those two as well.
# Dropping the three columns
autos.drop(['seller', 'offer_type', 'nr_of_pictures'], axis=1, inplace=True)
# Checking if they are really dropped
print(autos.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 17 columns): date_crawled 50000 non-null object name 50000 non-null object price 50000 non-null object ab_test 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gear_box 47320 non-null object power_ps 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(4), object(13) memory usage: 6.5+ MB None
Some other observations from checking the info of our dataframe:
date_crawled
, odometer
, price
, ad_created
and last_seen
are stored as strings (we may want to convert these to dates and integers, as appropriate)registration_year
and registration_month
are stored as integers (we may want to convert these to datetime
format)Hence, these are our tasks for further cleaning the data:
odometer
and price
to numeric formatdate_crawled
, ad_created
, last_seen
, registration_year
, and registration_month
columns to datetime formatSince item no. 1 should be so much easier compared to item no. 2, let's start with that.
# Checking what the values in price and odometer columns look like
print(autos['price'].unique())
print(autos['odometer'].unique())
['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km']
It appears that for the price
column, we need to remove the leading $ sign and remove the commas. For the odometer
, we need to remove the trailing km indicator and also the commas.
# Converting the two columns to numeric
autos['price'] = (autos['price'].str.replace(',','')
.str.replace('$','')
.astype(int)
)
autos['odometer'] = (autos['odometer'].str.replace(',','')
.str.replace('km','')
.astype(int)
)
# Renaming odometer column
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
After converting price
and odometer
to numeric, let's begin exploring these two columns.
Dataquest recommends the following steps:
Series.unique().shape
to see how many unique valuesSeries.describe()
to view min/max/median/mean etcSeries.value_counts()
to see frequency distributionsSince those are a lot of steps, let's define a function that will do all those things for us.
# Defining a function to make our examination of the series faster
def explore(series):
unique_values = series.unique().shape[0]
print('The series "{s}" has {u} unique values'
.format(s=series.name, u=unique_values))
print('\n')
print('Description')
print(series.describe())
print('\n')
print('Value counts top')
print(series.value_counts()
.sort_index(ascending=False)
.head(10))
print('\n')
print('Value counts bottom')
print(series.value_counts()
.sort_index(ascending=True)
.head(10))
# Using the explore function to check price and odometer_km columns
explore(autos['price'])
print('\n')
explore(autos['odometer_km'])
The series "price" has 2357 unique values Description count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 Value counts top 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 Name: price, dtype: int64 Value counts bottom 0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 Name: price, dtype: int64 The series "odometer_km" has 13 unique values Description count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 Value counts top 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 Name: odometer_km, dtype: int64 Value counts bottom 5000 967 10000 264 20000 784 30000 789 40000 819 50000 1027 60000 1164 70000 1230 80000 1436 90000 1757 Name: odometer_km, dtype: int64
For the price
column, we remove those with zero prices (lower outliers) and also those that cost more than $900,000 (upper outliers).
Note that the cars that are priced above $900,000 may still be valid data points. However, it is likely that these very high prices are typographical errors or items that are just being advertised but not really for sale. Anyway, these extremely high values (and there are only a few of them) might disproportionately affect our observations, so we will drop them for now.
# Removing zero prices and overpriced observations
autos = autos[autos['price'].between(1,900000)]
#autos = autos.drop(autos[(autos['price'] == 0)
# | (autos['price'] > 900000)].index)
explore(autos['price'])
The series "price" has 2346 unique values Description count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64 Value counts top 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64 Value counts bottom 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 Name: price, dtype: int64
After dropping the outliers, the number of unique prices has dropped from 2,357 to 2,346. Our total number of observations also dropped from 50,000 to 48,565. We only lost a small amount of observations, so this is acceptable.
For the odometer_km
column, there are no obvious outliers since the minimum and maximum values appear reasonable enough. We won't be dropping anything for this column.
Recall that we have some columns that give us datetime-related information. These are the date_crawled
, registration_year
, registration_month
, ad_created
, and last_seen
columns. Some of these columns are stored as strings. Let's check these columns again, cleaning them as necessary, and trying out some preliminary analytical tasks.
date_columns = ['date_crawled', 'registration_year', 'registration_month', 'ad_created', 'last_seen']
print(autos[date_columns].info())
autos[date_columns][:4]
<class 'pandas.core.frame.DataFrame'> Int64Index: 48565 entries, 0 to 49999 Data columns (total 5 columns): date_crawled 48565 non-null object registration_year 48565 non-null int64 registration_month 48565 non-null int64 ad_created 48565 non-null object last_seen 48565 non-null object dtypes: int64(2), object(3) memory usage: 2.2+ MB None
date_crawled | registration_year | registration_month | ad_created | last_seen | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2004 | 3 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 1997 | 6 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2009 | 7 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2007 | 6 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
The columns registration_year
and registration_month
are in int64
format. For the purpose of our analysis, there is no need to convert these to datetime
format. We may consider, however, combining the two into a single variable that contains month and year information if we want to study trends over several years.
For the other time-related columns, let's convert them to datetime
format. While we're not interested in the exact time values contained in the columns, there may be a use for those in later analysis. What we'll do, instead, is to create a new dataframe with columns that are of datetime
format but only containing date (and not time) information. We create a new dataframe because adding new columns to our autos
dataframe appears to give us a SettingwithCopyWarning.
# importing datetime library
import datetime as dt
# creating function for conversion
date_format = '%Y-%m-%d'
def convert_dt(series):
date = series.str[:10]
date_dt = pd.to_datetime(date, format=date_format)
return date_dt
# Creating an empty dataframe
autos_copy = pd.DataFrame()
# Adding columns to our new dataframe
autos_copy['date_crawled_dt'] = convert_dt(autos['date_crawled'])
autos_copy['ad_created_dt'] = convert_dt(autos['ad_created'])
autos_copy['last_seen_dt'] = convert_dt(autos['last_seen'])
Now we check the distributions of the three date columns, sorted by earliest date to latest date.
# Date crawled
print('date_crawled')
print(autos_copy['date_crawled_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
print('\n')
# Ad created
print('ad_created')
print(autos_copy['ad_created_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
print('\n')
# Last seen
print('last_seen')
print(autos_copy['last_seen_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
date_crawled 2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled_dt, dtype: float64 ad_created 2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created_dt, Length: 76, dtype: float64 last_seen 2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen_dt, dtype: float64
In terms of frequency distribution, there is no pattern that really pops out. Looking at the date_crawled_dt
column, we learn that the crawler collected this data over a span of around a month in March/April of 2016.
The oldest ad seen by the crawler, as shown by the ad_created_dt
column was from June 2015, while the latest one (April 4, 2016) coincides with the last day the crawler collected data.
Not much can be said about the last_seen
column, other than over 40% of the ads were still seen by the crawler during its last three days (April 5-7) of crawling.
It's now time to look at registration_year
and registration_month
. Again, we have not created a new series that combines these two. Let's explore them separately for now.
# Checking the registration year and moth information
print(autos[['registration_year', 'registration_month']].describe())
print('\n')
# Checking frequency distributions
print(autos['registration_year'].value_counts(dropna=False).head(5))
print(autos['registration_year'].value_counts(dropna=False).tail(5))
print('\n')
# Checking frequency distribution of registration month
print(autos['registration_month'].value_counts(dropna=False))
registration_year registration_month count 48565.000000 48565.000000 mean 2004.755421 5.782251 std 88.643887 3.685595 min 1000.000000 0.000000 25% 1999.000000 3.000000 50% 2004.000000 6.000000 75% 2008.000000 9.000000 max 9999.000000 12.000000 2000 3156 2005 2936 1999 2897 2004 2703 2003 2699 Name: registration_year, dtype: int64 9000 1 1948 1 1000 1 1931 1 1952 1 Name: registration_year, dtype: int64 3 5003 0 4480 6 4271 4 4036 5 4031 7 3857 10 3588 12 3374 9 3330 11 3313 1 3219 8 3126 2 2937 Name: registration_month, dtype: int64
Discussion on the observations from the above results are in the next section.
Upon checking the results from using the describe
method for the registrationyear, we see some outliers that don't make sense. The minimum is 1,000 while the maximum is 9,999. The lower outlier doesn't make sense because this years are too early while the higher outliers are years that haven't occured yet.
We may want to remove this using the criteria:
The reason for these the lower-bound criteria is that mass production of vehicles only really started around the 1900s. The reason for the upper-bound criteria is that the crawler couldn't possibly have detected vehicles that were registered in the future (the crawler's last day in our data set is April 2016).
As for the registration_month
, we see that March is the most frequent month of registration for the vehicles in our data set. However, we do see that there are vehicles that are identified as being registered on month zero. We have several options to deal with this: (1) drop the rows where registration month is zero, (2) convert the months to 1 (January), or (3) convert the months to missing (NaN).
In doing the steps above, we need to clean the year data first, because it's likely that the zero month rows coincide with the outlier years where data is questionable. For the rows that remain after dropping the rows with outlier years, we will then convert the zero months to missing.
# Removing outlier registration years
autos = autos[autos['registration_year'].between(1900,2016)]
# Checking data and months
print('registration year')
print(autos['registration_year'].describe())
print('\n')
print('registration month')
print(autos['registration_month'].value_counts())
registration year count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64 registration month 3 4838 6 4119 0 4026 4 3894 5 3877 7 3722 10 3495 12 3262 9 3243 11 3229 1 3105 8 3021 2 2850 Name: registration_month, dtype: int64
We dropped around 2,000 observations after removing the outliers. The registration years look okay now, with the earliest observation tagged as 1910.
Upon checking the regisration month, there are still months listed as zero. We'll assume that these months are 'unknown' and just tag them as missing. This is so our frequency distributions for regisration months can still be analyzed, while removing the zeros so it doesn't affect any numerical computations or aggregations.
Note that if we want to create a new datetime column, we may want to go with the January conversion instead, although this will inflate the numbers for January registrations, hence making any trend analysis tenuous.
# replacing zero months to 1
zero_month = (autos['registration_month'] == 0)
autos.loc[zero_month, 'registration_month'] = np.nan
# re-checking the registration month column
print(autos['registration_month'].value_counts())
3.0 4838 6.0 4119 4.0 3894 5.0 3877 7.0 3722 10.0 3495 12.0 3262 9.0 3243 11.0 3229 1.0 3105 8.0 3021 2.0 2850 Name: registration_month, dtype: int64
Let's now check how price patterns vary across brands. Let's start with looking at what the unique brands in our data set are.
print('Unique brands: ', len(autos['brand'].unique()))
autos['brand'].value_counts(normalize=True)
Unique brands: 40
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
We have over 40 brands of automobiles in our data set, with Volkswagen being the most common brand. These are a lot of brands, we'll perform some aggregation for all the brands, since it is fairly simple to do using loops, but will only delve deeper into the top 10 most common brands.
# Creating dictionary on average price per brand
mean_price_brand = {}
brands = autos['brand'].unique()
for b in brands:
brand = autos[autos['brand'] == b]
mean_price = brand['price'].mean()
mean_price_brand[b] = int(mean_price)
mean_price_brand
{'alfa_romeo': 4087, 'audi': 9336, 'bmw': 8332, 'chevrolet': 6684, 'chrysler': 3465, 'citroen': 3779, 'dacia': 5915, 'daewoo': 1049, 'daihatsu': 1636, 'fiat': 2813, 'ford': 3749, 'honda': 4107, 'hyundai': 5365, 'jaguar': 11635, 'jeep': 11650, 'kia': 5982, 'lada': 2688, 'lancia': 3376, 'land_rover': 19108, 'mazda': 4112, 'mercedes_benz': 8628, 'mini': 10613, 'mitsubishi': 3394, 'nissan': 4743, 'opel': 2975, 'peugeot': 3094, 'porsche': 45643, 'renault': 2474, 'rover': 1602, 'saab': 3211, 'seat': 4397, 'skoda': 6368, 'smart': 3580, 'sonstige_autos': 12338, 'subaru': 3953, 'suzuki': 4096, 'toyota': 5167, 'trabant': 1790, 'volkswagen': 5402, 'volvo': 4946}
It turns out that analyzing the dictionary with all the brands is quite difficult for us at our current skill level. Let's do the aggregation again, this time with only for the brands that account for at least 5% of the total number of observations.
# Getting the top brands
brand_share = autos['brand'].value_counts(normalize=True)
top_brands = brand_share[brand_share > 0.05].index
mean_price_top_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_price = brand['price'].mean()
mean_price_top_brands[b] = int(mean_price)
mean_price_top_brands
{'audi': 9336, 'bmw': 8332, 'ford': 3749, 'mercedes_benz': 8628, 'opel': 2975, 'volkswagen': 5402}
We can convert dictionaries to series, then dataframes, using some pandas functionalities to help us analyze the aggregated data much easier.
# Converting a couple of dictionaries we've already created
mean_price_all = pd.Series(mean_price_brand)
mean_price_top = pd.Series(mean_price_top_brands)
# Converting the series into dataframes
mean_price_all_df = pd.DataFrame(mean_price_all, columns=['mean_price'])
mean_price_top_df = pd.DataFrame(mean_price_top, columns=['mean_price'])
mean_price_all_df.sort_values('mean_price', ascending=False)
mean_price | |
---|---|
porsche | 45643 |
land_rover | 19108 |
sonstige_autos | 12338 |
jeep | 11650 |
jaguar | 11635 |
mini | 10613 |
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
chevrolet | 6684 |
skoda | 6368 |
kia | 5982 |
dacia | 5915 |
volkswagen | 5402 |
hyundai | 5365 |
toyota | 5167 |
volvo | 4946 |
nissan | 4743 |
seat | 4397 |
mazda | 4112 |
honda | 4107 |
suzuki | 4096 |
alfa_romeo | 4087 |
subaru | 3953 |
citroen | 3779 |
ford | 3749 |
smart | 3580 |
chrysler | 3465 |
mitsubishi | 3394 |
lancia | 3376 |
saab | 3211 |
peugeot | 3094 |
opel | 2975 |
fiat | 2813 |
lada | 2688 |
renault | 2474 |
trabant | 1790 |
daihatsu | 1636 |
rover | 1602 |
daewoo | 1049 |
mean_price_top_df.sort_values('mean_price', ascending=False)
mean_price | |
---|---|
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
volkswagen | 5402 |
ford | 3749 |
opel | 2975 |
When accounting for all the brands, we see that the most expensive brand, on the average, is Porsche with Land Rover a distant second. The cheapest brand, on the average, is Daewoo.
When we consider only the top brands (those which account for >5% of total observations), Audi cars are the most expensive with a mean price of 9,366 while Opel is the cheapest with a mean price of 2,975.
We see from the top brands that the Audi, Mercedes Benz, and BMW are relatively more expensive (average prices greather than 8,000) while Ford and Opel are very cheap by comparison (average prices less than 4,000).
It may be the brands themselves creating this price differential. However, it's also possible that the price differentials are the result of different average mileage or age of the vehicles. It may be the case that certain brands have, on the average, more mileage or are older, hence the lower average price.
Let's check whether our guess is correct by aggregating mileage (odometer_km
) and registration years (registration_year
) by brand then appending to a single dataframe along with the mean prices which we've already generated. This time, we limit our analysis to the top brands.
# Creating a copy dataframe for our aggregations
aggregate_df = mean_price_top_df.copy()
# For mileage
mean_mileage_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_mileage = brand['odometer_km'].mean()
mean_mileage_brands[b] = int(mean_mileage)
mean_mileage_series = pd.Series(mean_mileage_brands)
aggregate_df['mean_mileage'] = mean_mileage_series
# For registration year
mean_reg_year_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_reg_year = brand['registration_year'].mean()
mean_reg_year_brands[b] = int(mean_reg_year)
mean_reg_year_series = pd.Series(mean_reg_year_brands)
aggregate_df['mean_reg_year'] = mean_reg_year_series
# Displaying our new dataframe
aggregate_df.sort_values('mean_price', ascending=False)
mean_price | mean_mileage | mean_reg_year | |
---|---|---|---|
audi | 9336 | 129157 | 2004 |
mercedes_benz | 8628 | 130788 | 2002 |
bmw | 8332 | 132572 | 2003 |
volkswagen | 5402 | 128707 | 2002 |
ford | 3749 | 124266 | 2002 |
opel | 2975 | 129310 | 2002 |
From the table above, we see that our initial guess as to the relationship between price and mileage for the top brands may be incorrect. In fact, Audi cars have more mileage (129,157km), on the average, compared to Ford cars (124,266km) and only slightly lower than the much cheaper Opel cars (129,310km).
However, we also see that the Audis may be newer cars, with an average registration year of 2004 compared to Ford and Opel's 2002. We cannot make any definitive conclusions for now, especially since Mercedes Benz cares are, on the average, just as old as the cheaper brands but are still more expensive. We could expand this analysis to include all the brands in our data set but we'll settle with this for now.
For this project, we were able to practice a lot of skills:
The skills we practiced for this project will prove to be useful in future data science tasks.