In this project, we will do an exploratory study of used car ads on eBay Kleinnanzeigen, a classified section of the German eBay website.
Users can use this analysis when buying a used car, to understand whether or not the car seller is asking for a reasonable price.
To understand what drives, car prices, we will try to find out:
After analyzing the data, we can conclude that the higher the mileage, the lower the price of the cars and that car brand can be categorized as follow:
For more details, please refer to the full analysis below.
In this project, we will use [this dataset][1] from Kaggle. This data was scraped from eBay Kleinanzeigen. [1]:https://www.kaggle.com/orgesleka/used-cars-database
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = "Latin-1")
In the step above, we have imported the auto csv using lating-1 encoding. In the next section, we will print a subset of the dataset.
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
As we can see, the data set has 50 000 rows and 20 columns. See below information for each column:
In our next step, we are going to explore the autos
dataframe.
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
From the table above we can see that the data is either integers or objects. We can also see that some colunm names have a mix of upper and lower case and use camelcase instead of snakecase
Moreover, we can see that the columns vehicleType
, gearbox
, model
, fuelType
, notRepairedDamage
have some missing values.
In the next step, we are going to reword some of the column names using snake case.
autos.columns =['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'kilometer', 'registration_mont', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
'last_seen']
autos.head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | kilometer | registration_mont | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
In the step above, we have updated the column names to snake case, as it has been proved to make it easier to read than camelcase.
In the next step, we are going to do some further data exploration to determine what other cleaning tasks need to be done.
autos.describe(include ='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | kilometer | registration_mont | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the exploration above, we can deduce that we need to do some further data cleaning.
The columns seller
, offer_type
, and number of pictures
all contain mostly one value only. Therefore they can be dropped as they don't have useful information for the analysis.
Moreover, we can see that the price
and Kilometer
columns have numeric data stored as text. Therefore we will need to clean up this column.
Furthermore, we will need to investigate the columns, registration_year
further, as it has a maximum of 9999 and minimum of 1000, and registration_month
, as it has a minimum of 0, whereas all months should be from 1 to 12.
We will start by converting the columns' price
and kilometer
to numeric values.
print(autos['price'].head(10))
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 5 $7,900 6 $300 7 $1,990 8 $250 9 $590 Name: price, dtype: object
Values in the price
columns are text with the special character $ and a comma as the thousand separators. To convert the values to numerical values, we will have to remove the comma and the special character.
#removing special characters and converting to float
autos['price'] = autos['price'].str.replace("$",'').str.replace(",","")
autos['price'] = autos['price'].astype(float)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
In the step above, we have successfully converted price
column into the float type. However, looking at the summary statistics, we can see that some of the values will need to be corrected as the minimum price is 0, and the maximum price is 100 M euros.
the next step is to convert kilometer
column to numeric value.
print(autos['kilometer'].head(10))
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km 5 150,000km 6 150,000km 7 150,000km 8 150,000km 9 150,000km Name: kilometer, dtype: object
Values in kilometer
column have a thousand separator comma and finish with the characters km. in the next step, we will remove the comma and km to convert the values to integers.
autos['kilometer'] = autos['kilometer'].str.replace("km",'').str.replace(",","")
autos['kilometer'] = autos['kilometer'].astype(int)
autos['kilometer'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: kilometer, dtype: float64
In the step above, we have converted kilometer values to float type.
We will continue our data cleaning by looking for data that doesn't look right.
As we saw above, some of the price information doesn't look correct as the min and max values are unrealistic. Therefore we are going to proceed to remove those outliers.
unique_price= autos['price'].unique().shape[0]
print(' There are '+ str(unique_price)+' prices in autos dataset')
There are 2357 prices in autos dataset
autos['price'].value_counts().sort_index(ascending= True).head(10)
0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending= False).head(10)
99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 Name: price, dtype: int64
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Looking at the statics summary, we can see that one-quarter of the values are less than 1 100 euros and 3 quarters of the values are less or equal to 7 200 euros. To avoid data loss for our analysis, we are going to check if there are car brands with average prices above 10 000 euros
for brand in autos['brand'].unique():
selected_row=autos.loc[autos['brand'] == brand]
average_price = selected_row['price'].mean()
if average_price > 10000:
output= '{} brand average price is {:,.2f}'.format(brand,average_price)
print(output)
else:
pass
mercedes_benz brand average price is 29,511.96 sonstige_autos brand average price is 38,300.84 porsche brand average price is 44,537.98 mini brand average price is 10,392.39 jeep brand average price is 11,363.21 volvo brand average price is 31,689.91 jaguar brand average price is 11,076.51 fiat brand average price is 12,134.21 citroen brand average price is 42,657.46 land_rover brand average price is 18,934.27
As we can see from the data above, some brands have average prices of up to 44 000 euros. We do not want to remove this data; therefore for the outliers, we are going to remove rows prices below 1 000 euros and rows with price above 50 000 euros
autos = autos[autos["price"].between(1000,50000)]
autos['price'].describe()
count 38440.000000 mean 6863.851067 std 6973.665588 min 1000.000000 25% 2200.000000 50% 4300.000000 75% 8900.000000 max 50000.000000 Name: price, dtype: float64
After removing the price
outliers, autos data set has 38440 rows.
Next, we are goign to explore tha columns that contain date information.
There are 5 columns that represent date values. The columns date_crawled
, ad_created
and last_seen
are string values.
We are going to explore those 3 columns to understand the dataset date range.
autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.025572 2016-03-06 0.013918 2016-03-07 0.035120 2016-03-08 0.032622 2016-03-09 0.032492 2016-03-10 0.033455 2016-03-11 0.032752 2016-03-12 0.037461 2016-03-13 0.016051 2016-03-14 0.036707 2016-03-15 0.033637 2016-03-16 0.029058 2016-03-17 0.030463 2016-03-18 0.012825 2016-03-19 0.035146 2016-03-20 0.038241 2016-03-21 0.037227 2016-03-22 0.032466 2016-03-23 0.032128 2016-03-24 0.029032 2016-03-25 0.030489 2016-03-26 0.033143 2016-03-27 0.031322 2016-03-28 0.035354 2016-03-29 0.034001 2016-03-30 0.033117 2016-03-31 0.031374 2016-04-01 0.034573 2016-04-02 0.036238 2016-04-03 0.039126 2016-04-04 0.036863 2016-04-05 0.013241 2016-04-06 0.003278 2016-04-07 0.001509 Name: date_crawled, dtype: float64
From the table above, we can see that date_crawled
date range goes from March 5th 2016 to April 7th 2016.
In the next step, we will explore ad_created
.
autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2015-06-11 0.000026 2015-08-10 0.000026 2015-09-09 0.000026 2015-11-10 0.000026 2015-12-30 0.000026 2016-01-03 0.000026 2016-01-07 0.000026 2016-01-10 0.000052 2016-01-13 0.000026 2016-01-14 0.000026 2016-01-16 0.000026 2016-01-22 0.000026 2016-01-27 0.000078 2016-01-29 0.000026 2016-02-01 0.000026 2016-02-02 0.000052 2016-02-05 0.000052 2016-02-07 0.000026 2016-02-09 0.000026 2016-02-11 0.000026 2016-02-12 0.000052 2016-02-14 0.000052 2016-02-16 0.000026 2016-02-17 0.000026 2016-02-18 0.000052 2016-02-19 0.000078 2016-02-20 0.000026 2016-02-21 0.000052 2016-02-22 0.000026 2016-02-23 0.000104 ... 2016-03-09 0.032700 2016-03-10 0.033117 2016-03-11 0.033012 2016-03-12 0.037201 2016-03-13 0.017716 2016-03-14 0.035016 2016-03-15 0.033481 2016-03-16 0.029631 2016-03-17 0.030125 2016-03-18 0.013293 2016-03-19 0.034079 2016-03-20 0.038345 2016-03-21 0.037513 2016-03-22 0.032258 2016-03-23 0.031868 2016-03-24 0.029032 2016-03-25 0.030645 2016-03-26 0.033221 2016-03-27 0.031191 2016-03-28 0.035406 2016-03-29 0.033975 2016-03-30 0.032986 2016-03-31 0.031530 2016-04-01 0.034417 2016-04-02 0.035900 2016-04-03 0.039412 2016-04-04 0.037305 2016-04-05 0.011863 2016-04-06 0.003382 2016-04-07 0.001327 Name: ad_created, Length: 74, dtype: float64
From the table above, we can see that ad_created
date range goes from June 11th 2015 to April 7th 2016.
In the next step, we will explore `last_seen.
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.001093 2016-03-06 0.003590 2016-03-07 0.004527 2016-03-08 0.006270 2016-03-09 0.008897 2016-03-10 0.009860 2016-03-11 0.011785 2016-03-12 0.022216 2016-03-13 0.008429 2016-03-14 0.012019 2016-03-15 0.015010 2016-03-16 0.015427 2016-03-17 0.026509 2016-03-18 0.007362 2016-03-19 0.014646 2016-03-20 0.019771 2016-03-21 0.019745 2016-03-22 0.020838 2016-03-23 0.017924 2016-03-24 0.018522 2016-03-25 0.017794 2016-03-26 0.016077 2016-03-27 0.014074 2016-03-28 0.019459 2016-03-29 0.020864 2016-03-30 0.023491 2016-03-31 0.022737 2016-04-01 0.023283 2016-04-02 0.024974 2016-04-03 0.024454 2016-04-04 0.023491 2016-04-05 0.130645 2016-04-06 0.234547 2016-04-07 0.139672 Name: last_seen, dtype: float64
From the table above, we can see that last_seen
date range goes from March 5th 2016 to April 7th 2016.
For our next step, we are going to explore the data in registration_year
.
autos['registration_year'].describe()
count 38440.000000 mean 2005.681009 std 86.888904 min 1000.000000 25% 2001.000000 50% 2005.000000 75% 2009.000000 max 9999.000000 Name: registration_year, dtype: float64
From the summary statistics above, we can see that the minimum year is 1000 and the maximum year is 9999; this suggests that some rows have errors in registration_year
autos['registration_year'].value_counts().sort_index()
1000 1 1001 1 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 1 1951 1 1952 1 1953 1 1954 2 1956 4 1957 2 1958 2 1959 6 1960 17 1961 6 1962 4 1963 7 1964 8 1965 17 1966 20 1967 23 1968 25 ... 1999 1723 2000 1999 2001 2064 2002 2134 2003 2476 2004 2605 2005 2784 2006 2650 2007 2252 2008 2193 2009 2071 2010 1573 2011 1609 2012 1289 2013 774 2014 633 2015 331 2016 643 2017 1008 2018 396 2019 1 2800 1 4100 1 4500 1 5000 2 5911 1 6200 1 8888 1 9000 1 9999 2 Name: registration_year, Length: 90, dtype: int64
For the accuracy of our analysis, we will remove the rows which registration year is 1000 and 1001. Moreover, form the data range analysis done previously, the ads were created in 2016, suggesting that rows with registration_year
after 2016 are inaccurate; therefore, we will also remove those rows.
autos =autos[autos["registration_year"].between(1927,2016)]
autos['registration_year'].value_counts(normalize=True)
2005 0.075199 2006 0.071579 2004 0.070364 2003 0.066879 2007 0.060829 2008 0.059235 2002 0.057641 2009 0.055940 2001 0.055751 2000 0.053995 1999 0.046540 2011 0.043461 2010 0.042488 2012 0.034817 1998 0.034547 1997 0.021177 2013 0.020906 2016 0.017368 2014 0.017098 1996 0.014586 1995 0.011912 2015 0.008941 1994 0.007320 1993 0.005861 1992 0.005834 1991 0.005564 1990 0.004997 1989 0.003322 1988 0.002836 1985 0.002107 ... 1976 0.000567 1966 0.000540 1977 0.000513 1975 0.000486 1965 0.000459 1960 0.000459 1969 0.000432 1964 0.000216 1963 0.000189 1961 0.000162 1959 0.000162 1937 0.000108 1962 0.000108 1956 0.000108 1954 0.000054 1941 0.000054 1958 0.000054 1957 0.000054 1934 0.000054 1951 0.000027 1953 0.000027 1943 0.000027 1927 0.000027 1929 0.000027 1931 0.000027 1950 0.000027 1948 0.000027 1938 0.000027 1939 0.000027 1952 0.000027 Name: registration_year, Length: 76, dtype: float64
The data shows that the majority of cars in the Ebay dataset have been registered between the years 2000 and 2008.
In this project, we are interested in understanding if the price asked for a car is reasonable. First, we will start by exploring what the most commom car brands are.
autos['brand'].value_counts(normalize= True)
volkswagen 0.211793 bmw 0.125601 mercedes_benz 0.111339 audi 0.097483 opel 0.089514 ford 0.058830 renault 0.037437 peugeot 0.028037 fiat 0.021177 skoda 0.019151 seat 0.017368 smart 0.016693 toyota 0.014694 mazda 0.014316 citroen 0.013965 nissan 0.013695 mini 0.010939 hyundai 0.010804 sonstige_autos 0.010129 volvo 0.009022 kia 0.007725 honda 0.007374 mitsubishi 0.006915 chevrolet 0.006618 alfa_romeo 0.006240 suzuki 0.005753 porsche 0.005240 dacia 0.003295 chrysler 0.003187 jeep 0.002755 land_rover 0.002485 jaguar 0.001810 subaru 0.001729 daihatsu 0.001702 saab 0.001378 daewoo 0.000918 trabant 0.000864 rover 0.000729 lancia 0.000675 lada 0.000621 Name: brand, dtype: float64
As we can see, 21% of the ad are for volkswagen
, followed by bmw
and mercedes-benz
.
For the rest of the analysis we will focus on brands that present at least 2% of the total ads.
in the next step, we will calculate the average price for those top brands
avg_price_brand ={}
top_brands = autos['brand'].value_counts(normalize= True)
top_brands = top_brands[top_brands> 0.02]
for brand in top_brands.index:
selected_row = autos.loc[autos['brand'] == brand]
average_price = selected_row['price'].mean()
avg_price_brand[brand] = round(average_price,2)
avg_price_brand
{'audi': 9981.2, 'bmw': 8859.2, 'fiat': 4008.17, 'ford': 5143.46, 'mercedes_benz': 8833.54, 'opel': 4219.95, 'peugeot': 3955.17, 'renault': 3526.43, 'volkswagen': 6630.4}
From the data above, we can see that audi
, mercedes_benz
and bmw
brands have the highest average prices, from 9 981euros to 8 833 euros. volkswagen
and ford
are in the mid-range pricing, with average prices from 6 630 to 5 143 euros. fiat
, opel
, peugeot
and renault are the cheapest brands.
In the next section, we are going to explore the impact of mileage on the average price of a car.
avg_price_mileage ={}
for mileage in autos['kilometer'].unique():
selected_row = autos.loc[autos['kilometer'] == mileage]
average_price = selected_row['price'].mean()
avg_price_mileage[str(mileage)] = round(average_price,2)
avg_price_mileage
{'10000': 18913.46, '100000': 8687.72, '125000': 6946.01, '150000': 4815.88, '20000': 16619.09, '30000': 15104.49, '40000': 14498.52, '5000': 8903.3, '50000': 13367.42, '60000': 11942.39, '70000': 11224.25, '80000': 10135.09, '90000': 9009.52}
From the data above, we can see that the higher the mileage, the lower the rate. For cars from 10 000km to 80 000km, prices the average price range from 18 913 to 10 135. For cars with mileage from 90 000 km and above, the average price range from 9 009 to 4 815.
The average price for 5 000 km seems against this trend; therefore, further, we need to dig deeper to understand what is causing this outlier.
autos[autos['kilometer']== 5000].describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | kilometer | registration_mont | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 455 | 455 | 455 | 455 | 455.000000 | 455 | 393 | 455.000000 | 401 | 455.000000 | 407 | 455.0 | 455.000000 | 392 | 455 | 310 | 455 | 455.0 | 455.000000 | 455 |
unique | 455 | 444 | 1 | 1 | NaN | 2 | 8 | NaN | 2 | NaN | 114 | NaN | NaN | 6 | 37 | 2 | 36 | NaN | NaN | 455 |
top | 2016-04-03 12:48:58 | Volkswagen_Kaefer | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | andere | NaN | NaN | benzin | volkswagen | nein | 2016-03-24 00:00:00 | NaN | NaN | 2016-03-24 08:50:39 |
freq | 1 | 4 | 455 | 455 | NaN | 244 | 97 | NaN | 310 | NaN | 62 | NaN | NaN | 289 | 84 | 278 | 22 | NaN | NaN | 1 |
mean | NaN | NaN | NaN | NaN | 8903.298901 | NaN | NaN | 2000.821978 | NaN | 111.523077 | NaN | 5000.0 | 4.767033 | NaN | NaN | NaN | NaN | 0.0 | 48631.197802 | NaN |
std | NaN | NaN | NaN | NaN | 9850.754764 | NaN | NaN | 15.731683 | NaN | 162.425866 | NaN | 0.0 | 3.910706 | NaN | NaN | NaN | NaN | 0.0 | 25983.876250 | NaN |
min | NaN | NaN | NaN | NaN | 1000.000000 | NaN | NaN | 1927.000000 | NaN | 0.000000 | NaN | 5000.0 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1069.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 2100.000000 | NaN | NaN | 1998.000000 | NaN | 50.000000 | NaN | 5000.0 | 1.000000 | NaN | NaN | NaN | NaN | 0.0 | 26650.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 4800.000000 | NaN | NaN | 2004.000000 | NaN | 101.000000 | NaN | 5000.0 | 4.000000 | NaN | NaN | NaN | NaN | 0.0 | 49626.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 12000.000000 | NaN | NaN | 2014.000000 | NaN | 143.500000 | NaN | 5000.0 | 8.000000 | NaN | NaN | NaN | NaN | 0.0 | 67614.500000 | NaN |
max | NaN | NaN | NaN | NaN | 50000.000000 | NaN | NaN | 2016.000000 | NaN | 1998.000000 | NaN | 5000.0 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99974.000000 | NaN |
From the table above, we can see that there are 455 car ads with 5 000km. We can see that the minimum price is 1000, which seems very low. These low prices could be skewing the average price.
At this point, we can say that generally, higher mileage translates into lower prices.
.
In this project, we have analyzed used car ads data from eBay Kleinnanzeigen to understand what are some of the drivers of the price. We can conclude that the higher the mileage, the lower the price of the cars and that car brand can be categorized as follow:
audi
, mercedes_benz
and bmw
brands have the highest average pricesvolkswagen
and ford
are in the mid-range pricing.fiat
, opel
, peugeot
and renault
are the cheapest brands.