On the German eBay website, cars can be offered for sale in a classifieds section 'eBay Kleinanzeigen'. In this study we we'll do an analysis of data that was collected from this website.
The dataset was originally scraped and uploaded to Kaggle. Find the original dataset here. It contains approximately 370,000 records. For this study we will make use of a modified version, which was prepared by Dataquest, and can be found here. This is a sample of 50,000 records.
The goal of this study? Nothing in particular. For sure there are interesting facts and statistics to find in here. Let's go explore!
Let's start with importing the data, and doing some initial exploration by looking at some main statistics and data samples.
# Import NumPy and Pandas libraries
import numpy as np
import pandas as pd
# Import the data
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
# An initial view of the data
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
# Information about the columns, and the first couple of rows
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Some initial observations about the data:
Before further analyzing the data, let's start with cleaning it. We'll start with improving the column names (that is, convert them to a standard convention).
# Print the current column names
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Clean column names
# Create a cleaning function
def clean_column_name(col):
# direct replacement of some
col = col.replace('yearOfRegistration', 'registration_year')
col = col.replace('monthOfRegistration', 'registration_month')
col = col.replace('notRepairedDamage', 'unrepaired_damage')
col = col.replace('dateCreated', 'ad_created')
# change from camelCase to snake_case for others (just hard-code, given there are only a few)
col = col.replace('dateCrawled', 'date_crawled')
col = col.replace('offerType', 'offer_type')
col = col.replace('vehicleType', 'vehicle_type')
col = col.replace('powerPS', 'power_ps')
col = col.replace('fuelType', 'fuel_type')
col = col.replace('nrOfPictures', 'nr_of_pictures')
col = col.replace('postalCode', 'postal_code')
col = col.replace('lastSeen', 'last_seen')
return col
# Build a list with new column names
updated_column_names = []
for name in autos.columns:
updated_column_names.append(clean_column_name(name))
# Replace the column names
autos.columns = updated_column_names
# Print the first part again to check the result
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
So we have the same data still, but now with consistent and standardized (snake-case) column names. Which will help us with the next steps.
Next let us get some descriptive statistics for all columns.
# Get descriptive statistics about all columns
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Observations (of these statistics, in combination with the sample data earlier):
Let's convert price and odometer to numeric values (by taking off the non-numeric characters).
# Remove the 'S' and ',' signs from 'price', and convert it to a float
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '').astype(float)
# Remove the 'km' and ',' signs from 'odometer', and convert it to a float
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','').astype(float)
# Let's rename the odometer column to make clear this is in km
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
Let's check the statistics again (for numeric columns only; which should now include price and odometer_km.
# Desribe (without include='all', so it will only including numeric columns)
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 0.0 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
That appears to have worked. With some cleaning, we have now prepared the data to do further analysis, which we will do in the next section. Some (remarkable) observations already based on the descriptive statistics of the numeric columns:
# print (autos["odometer_km"].unique()) commented out as the other statistics provide better information
# print (autos["odometer_km"].describe()) commented out as the other statistics provide better information
print (autos["odometer_km"].value_counts().sort_index(ascending = True))
5000.0 967 10000.0 264 20000.0 784 30000.0 789 40000.0 819 50000.0 1027 60000.0 1164 70000.0 1230 80000.0 1436 90000.0 1757 100000.0 2169 125000.0 5170 150000.0 32424 Name: odometer_km, dtype: int64
Observation is that there are only 13 unique entries for odemeter, and that the vast majority of these are 150,000 km. One possibly explanation would be that - apparently - people simply don't want to put any higher figure than 150,000 km into their advertisement since that won't sell. And that in reality this value should be interpreted as "150,000 km or above". Another possibility (maybe more likely actually): the website gave a limited number of options only, with "150,000+" being the highest possibly entry. Either way, not much to do about this and since all values are realistic as such, let's leave all values as they are.
Then, let's explore price.
# print (autos["price"].unique()) commented out as the other statistics provide better information
print (autos["price"].describe())
print (autos["price"].value_counts().sort_index(ascending = True))
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 40.0 6 45.0 4 47.0 1 49.0 4 50.0 49 55.0 2 59.0 1 60.0 9 65.0 5 66.0 1 ... 151990.0 1 155000.0 1 163500.0 1 163991.0 1 169000.0 1 169999.0 1 175000.0 1 180000.0 1 190000.0 1 194000.0 1 197000.0 1 198000.0 1 220000.0 1 250000.0 1 259000.0 1 265000.0 1 295000.0 1 299000.0 1 345000.0 1 350000.0 1 999990.0 1 999999.0 2 1234566.0 1 1300000.0 1 3890000.0 1 10000000.0 1 11111111.0 2 12345678.0 3 27322222.0 1 99999999.0 1 Name: price, Length: 2357, dtype: int64
Already earlier we had observed that there was a maximum price of (almost) 100 M in the list. And also the statistics (with a very high standard deviation compared to the mean) indicate that some outliers may have very large impact. Then, indeed, we can see that on the upper end of the price range, there appear to be some very extreme values.
All values of 10 M and above can be safely assumed to be incorrect. Below those, we see values of 999,999 up to 3,89 M. Yes, there could be some supercars for sale, but it may as well be incorrect. The big gap between 350,000 and 999,999 seems to be a good point to cut things off.
Let's double check how many that would be.
print ((autos["price"] > 400000).value_counts())
False 49986 True 14 Name: price, dtype: int64
So that is 14 rows. Arguably, we could replace these values by something more realistic, e.g. the average or median of all other prices. Not to loose the other data of these 14 cars. Given the small amount though, let's choose to just get rid of those 14 rows altogether.
# Keep rows with a price below 400,000
autos = autos[autos["price"] < 400000]
Then let's next look at the lower end of the price range. What we could see above is that there are a lot of cars offered for free (price = 0). But also for prices of 1, 2, 3, 5. Let's put in this some categories to get a better view.
print ('Price 0:',(autos[autos["price"] == 0].shape[0]))
print ('Price 1-100:', (autos[autos["price"].between(1,100)].shape[0]))
print ('Price 101-200:', (autos[autos["price"].between(101,200)].shape[0]))
print ('Price 201-300:', (autos[autos["price"].between(201,300)].shape[0]) )
print ('Price 301-400:', (autos[autos["price"].between(301,400)].shape[0]) )
Price 0: 1421 Price 1-100: 475 Price 101-200: 711 Price 201-300: 895 Price 301-400: 898
In a way it could be tempting to throw away the rows with a price of 0, considering it 'incorrect data'. But then, is a price of 1 more realistic? Or a price of 29, 65 or 123? It is hard to set a particular limit for this, and consider everything below that 'outliers' or 'incorrect'. And maybe people do just want to get rid of their old worn-out cars? Available for who still sees some value in it - maybe just for spare parts - and wants to put the effort to collect the cars.
In addition, these low numbers will have much less influence on the statistics as the extreme values at the upper end of the range.
With those two arguments: we'll keep the rows.
Let's look at the new statistics (having removed the 14 price outliers on the upper end of the scale.)
print (autos["price"].describe())
count 49986.000000 mean 5721.525167 std 8983.617820 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 350000.000000 Name: price, dtype: float64
What we can observe (apart from indeed having 14 rows left and a maximum price of 350,000) is that the average price went down to 5721, coming from 9840 before. That's a very significant change. The massively reduced standard deviation, and the unchanged values for the 25%, 50% and 75% percentiles seem to confirm that we did a good step with removing those 14 row.
So the average price is just above 5700 euro, while the median price is (only) 2950 euro. Only for 25% of the cars, the ask price is 7200 euro or above.
There are several fields that contains dates, let's explore those.
We'll first look at distributions of:
All as percentages of the total.
print(autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2015-06-11 0.000020 2015-08-10 0.000020 2015-09-09 0.000020 2015-11-10 0.000020 2015-12-05 0.000020 2015-12-30 0.000020 2016-01-03 0.000020 2016-01-07 0.000020 2016-01-10 0.000040 2016-01-13 0.000020 2016-01-14 0.000020 2016-01-16 0.000020 2016-01-22 0.000020 2016-01-27 0.000060 2016-01-29 0.000020 2016-02-01 0.000020 2016-02-02 0.000040 2016-02-05 0.000040 2016-02-07 0.000020 2016-02-08 0.000020 2016-02-09 0.000040 2016-02-11 0.000020 2016-02-12 0.000060 2016-02-14 0.000040 2016-02-16 0.000020 2016-02-17 0.000020 2016-02-18 0.000040 2016-02-19 0.000060 2016-02-20 0.000040 2016-02-21 0.000060 ... 2016-03-09 0.033229 2016-03-10 0.031869 2016-03-11 0.032789 2016-03-12 0.036610 2016-03-13 0.016925 2016-03-14 0.035230 2016-03-15 0.033749 2016-03-16 0.030008 2016-03-17 0.031189 2016-03-18 0.013724 2016-03-19 0.033849 2016-03-20 0.037871 2016-03-21 0.037691 2016-03-22 0.032769 2016-03-23 0.032189 2016-03-24 0.029088 2016-03-25 0.031889 2016-03-26 0.032569 2016-03-27 0.030909 2016-03-28 0.034970 2016-03-29 0.034110 2016-03-30 0.033449 2016-03-31 0.031909 2016-04-01 0.033809 2016-04-02 0.035090 2016-04-03 0.038931 2016-04-04 0.036850 2016-04-05 0.011843 2016-04-06 0.003261 2016-04-07 0.001280 Name: ad_created, Length: 76, dtype: float64
print(autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05 0.025387 2016-03-06 0.013944 2016-03-07 0.035970 2016-03-08 0.033269 2016-03-09 0.033209 2016-03-10 0.032129 2016-03-11 0.032489 2016-03-12 0.036770 2016-03-13 0.015564 2016-03-14 0.036630 2016-03-15 0.033990 2016-03-16 0.029508 2016-03-17 0.031509 2016-03-18 0.013064 2016-03-19 0.034910 2016-03-20 0.037831 2016-03-21 0.037490 2016-03-22 0.032909 2016-03-23 0.032389 2016-03-24 0.029108 2016-03-25 0.031749 2016-03-26 0.032489 2016-03-27 0.031049 2016-03-28 0.034850 2016-03-29 0.034150 2016-03-30 0.033629 2016-03-31 0.031909 2016-04-01 0.033809 2016-04-02 0.035410 2016-04-03 0.038691 2016-04-04 0.036490 2016-04-05 0.013104 2016-04-06 0.003181 2016-04-07 0.001420 Name: date_crawled, dtype: float64
print(autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05 0.001080 2016-03-06 0.004421 2016-03-07 0.005362 2016-03-08 0.007582 2016-03-09 0.009843 2016-03-10 0.010763 2016-03-11 0.012524 2016-03-12 0.023807 2016-03-13 0.008983 2016-03-14 0.012804 2016-03-15 0.015884 2016-03-16 0.016445 2016-03-17 0.027928 2016-03-18 0.007422 2016-03-19 0.015744 2016-03-20 0.020706 2016-03-21 0.020726 2016-03-22 0.021586 2016-03-23 0.018585 2016-03-24 0.019565 2016-03-25 0.019205 2016-03-26 0.016965 2016-03-27 0.016024 2016-03-28 0.020846 2016-03-29 0.022326 2016-03-30 0.024847 2016-03-31 0.023827 2016-04-01 0.023106 2016-04-02 0.024887 2016-04-03 0.025367 2016-04-04 0.024627 2016-04-05 0.124275 2016-04-06 0.220982 2016-04-07 0.130957 Name: last_seen, dtype: float64
So the crawler was doing its job in the period of 5th March 2016 to 7th april 2016. The ads that it found went back as far as 11th June 2015, some 9 months before that.
We can also observe that the site is actively being used. In the period of crawling, every day new advertisements appeared, and others disappeared. Up to several percents of the total volume of advertisements.
Let's also explore registration_year, being the year that the car was registered (that is, how old are the cars).
autos['registration_year'].describe()
count 49986.000000 mean 2005.075721 std 105.727161 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos['registration_year'].value_counts().sort_index(ascending=True)
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 33 1961 6 1962 4 1963 9 ... 2001 2702 2002 2533 2003 2727 2004 2737 2005 3015 2006 2707 2007 2304 2008 2231 2009 2097 2010 1597 2011 1634 2012 1323 2013 806 2014 665 2015 399 2016 1316 2017 1452 2018 491 2019 3 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 2 9996 1 9999 4 Name: registration_year, Length: 97, dtype: int64
The observations seem to be heavily impacted here by some 'outliers', or rather: invalid values. Let us remove all values before 1900. And also all values after 2016. (Given that these were ads in 2016.)
# Create a series with registration year only. (And show it to check it is correct).
reg_year_only = autos['registration_year']
reg_year_only.value_counts().sort_index(ascending=True)
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 33 1961 6 1962 4 1963 9 ... 2001 2702 2002 2533 2003 2727 2004 2737 2005 3015 2006 2707 2007 2304 2008 2231 2009 2097 2010 1597 2011 1634 2012 1323 2013 806 2014 665 2015 399 2016 1316 2017 1452 2018 491 2019 3 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 2 9996 1 9999 4 Name: registration_year, Length: 97, dtype: int64
# Filter it to values between 1900 and 2016
reg_year_only = reg_year_only[reg_year_only.between(1850, 2016)]
reg_year_only.value_counts().sort_index(ascending=True)
1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 33 1961 6 1962 4 1963 9 1964 12 1965 17 1966 22 1967 27 1968 26 ... 1987 75 1988 142 1989 181 1990 395 1991 356 1992 390 1993 445 1994 660 1995 1312 1996 1444 1997 2028 1998 2453 1999 2998 2000 3354 2001 2702 2002 2533 2003 2727 2004 2737 2005 3015 2006 2707 2007 2304 2008 2231 2009 2097 2010 1597 2011 1634 2012 1323 2013 806 2014 665 2015 399 2016 1316 Name: registration_year, Length: 78, dtype: int64
# Describe the statistics again
reg_year_only.describe()
count 48016.000000 mean 2002.806002 std 7.306212 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
That looks better.
What we can see now is that both 'mean' and '50%' are (around) 2003. Cars on sale are on average about 13 years old. 25% of the cars is more than 17 years old, while 25% is less than 8 years old.
# JQ, added later (after sharing my project)
# The above could have been achieved in a more direct way, without first creating separate series for the registration year, with these two commands:
# autos.loc[autos["registration_year"].between(1900, 2016), "registration_year"].describe()
# autos.loc[autos["registration_year"].between (1900, 2016), "registration_year"].value_counts().sort_index(ascending=True)
Let's also look at the registration year as percentages, looking only at the registration years that are most common
reg_year_only.value_counts(normalize=True).head(10).sort_index(ascending=True)
1998 0.051087 1999 0.062438 2000 0.069852 2001 0.056273 2002 0.052753 2003 0.056794 2004 0.057002 2005 0.062792 2006 0.056377 2007 0.047984 Name: registration_year, dtype: float64
Of course, these are the years around the median year 2003, and each of these years accounts for some 5-7% of the total.
Now let us explore more about the brands that are on offer.
print (autos["brand"].unique())
print ('\n','Number of brands: ', len(autos["brand"].unique()))
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia'] Number of brands: 40
autos["brand"].value_counts()
volkswagen 10684 opel 5460 bmw 5428 mercedes_benz 4733 audi 4283 ford 3477 renault 2404 peugeot 1456 fiat 1307 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 700 toyota 617 sonstige_autos 543 hyundai 488 volvo 456 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
autos["brand"].value_counts(normalize = True)
volkswagen 0.213740 opel 0.109231 bmw 0.108590 mercedes_benz 0.094687 audi 0.085684 ford 0.069559 renault 0.048093 peugeot 0.029128 fiat 0.026147 seat 0.018825 skoda 0.015724 mazda 0.015144 nissan 0.015084 smart 0.014024 citroen 0.014004 toyota 0.012343 sonstige_autos 0.010863 hyundai 0.009763 volvo 0.009123 mini 0.008482 mitsubishi 0.008122 honda 0.007982 kia 0.007122 alfa_romeo 0.006582 porsche 0.005882 suzuki 0.005862 chevrolet 0.005662 chrysler 0.003621 dacia 0.002581 daihatsu 0.002561 jeep 0.002201 subaru 0.002181 land_rover 0.001981 saab 0.001600 daewoo 0.001580 trabant 0.001560 jaguar 0.001540 rover 0.001380 lancia 0.001140 lada 0.000620 Name: brand, dtype: float64
Observations:
Data appears to be clean (e.g. not both 'Volkswagen' and 'VW').
For further analysis, let's only consider the 16 brands that have at least 1% of the count. (We also ignore "sonstige_autos"). For these 16 brands, let us explore the average price.
# Take the top 16 brands
brands = autos["brand"].value_counts().head(16).index
# Create a dictionary with the average price for each of these brands
average_price_per_brand = {}
for brand in brands:
selected_rows = autos[autos["brand"] == brand]
brand_mean_price = int(selected_rows["price"].mean())
average_price_per_brand[brand] = brand_mean_price
print (average_price_per_brand)
{'volkswagen': 5158, 'opel': 2845, 'bmw': 8026, 'mercedes_benz': 8389, 'audi': 8965, 'ford': 3626, 'renault': 2351, 'peugeot': 3010, 'fiat': 2697, 'seat': 4219, 'skoda': 6305, 'mazda': 3962, 'nissan': 4588, 'smart': 3482, 'citroen': 3686, 'toyota': 5097}
Quite significant differences between the brands. German high-end brands Audi, BMW and Mercedes-Benz are on top with average prices between 8000 and 9000 euro. While at the other end of the spectrum, Fiat and Renault are averaging at around 2500 euro only. Very popular brands like Volkswagen, Ford and Opel are in between (having significant differences between them though.)
Surprising is that Skoda (a fairly modest Czech car brand), appears to have an average price which comes right next after Audi, BMW, and Mercedes-Benz. With all respect, that looks a bit suspect actually; let's investigate that.
skodas_only = autos[autos["brand"] == 'skoda']
print (skodas_only["price"].describe())
print (skodas_only["price"].value_counts().sort_index(ascending = False).head(20))
count 786.000000 mean 6305.044529 std 5248.937529 min 0.000000 25% 2500.000000 50% 4945.000000 75% 8427.500000 max 32700.000000 Name: price, dtype: float64 32700.0 1 28450.0 1 28000.0 1 27800.0 1 27500.0 2 26499.0 1 25999.0 2 25990.0 1 25000.0 1 22999.0 1 22900.0 2 22490.0 1 22250.0 1 22000.0 1 21900.0 2 21400.0 1 21300.0 1 20999.0 1 20990.0 1 20589.0 1 Name: price, dtype: int64
Nothing looking particularly suspect there actually. No extreme outliers. Okay, so be it. (So does this mean that people already sell off their Skoda when it is still pretty new....?)
Let's get back to the 6 brands that are advertized most: Volkswagen, Opel, BMW, Mercedes-Benz, Audi and Ford. It's of course not really surprising that luxury brands Audi, BMW and Mercedes-Benz are most expensive on average. However, let's check whether mileage (that is: odometer reading in km) can explain some of the price differences.
# Take the top 6 brands
brands = autos["brand"].value_counts().head(6).index
# Create a dictionary with the average price for each of these brands
average_price_per_brand_top6 = {}
for brand in brands:
selected_rows = autos[autos["brand"] == brand]
brand_mean_price = int(selected_rows["price"].mean())
average_price_per_brand_top6[brand] = brand_mean_price
# print (average_price_per_brand_top6) commented out after checking correctness
# Create a dictionary with the average odometer for each of these brands
average_odometerkm_per_brand_top6 = {}
for brand in brands:
selected_rows = autos[autos["brand"] == brand]
brand_mean_odometerkm = int(selected_rows["odometer_km"].mean())
average_odometerkm_per_brand_top6[brand] = brand_mean_odometerkm
# print (average_odometerkm_per_brand_top6) commented out after checking correctness
# Convert this into a dataframe with both price and odometer for the top 6 brands
top6_price = pd.Series(average_price_per_brand_top6)
top6_odometer = pd.Series(average_odometerkm_per_brand_top6)
top6_data = pd.DataFrame(top6_price, columns = ['average_price'])
top6_data['average_odometer_km']= top6_odometer
print (top6_data)
average_price average_odometer_km volkswagen 5158 128949 opel 2845 129294 bmw 8026 132518 mercedes_benz 8389 130882 audi 8965 129643 ford 3626 124153
Those values are actually pretty close, so that clearly isn't any explanation for the price differences. (Actually, we could have known that. Remember that we observed earlier that a very large part of the odometer readings is 150,000 and there are non above. So we do not have true insight in the actual mileages.)
In this study, we've analyzed data of cars-for-sale in a German section of eBay, on a sample of 50,000 advertisements in 2016. Observations were written throughout the analysis above. Some highlights were:
For sure there are many more interesting things to explore in this data - maybe we'll do so on a next occassion!