In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date for which the ad at ebay was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.The aim of this project is to clean the data and analyze the used car listings.
Let's get our dataset. We get an encoding error with UTF-8
, so we read the file with Latin-1
:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We can see:
yearOfRegistration
, powerPS
, monthOfRegistration
, nrOfPictures
, postalCode
).vehicleType
, gearbox
, model
, fuelType
, notRepairedDamage
.We'll convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
Let's print an array of the existing column names:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
We rename the columns and assign the modified column names back to the autos.columns
:
autos.columns = ['ad_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
'ad_last_seen']
autos.head()
ad_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | ad_last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
autos.describe(include='all')
ad_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | ad_last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Note the following observations:
seller
, offer_type
) can be excluded from the research, because each of them has same value for all rows except one row.price
and odometer
columns to numeric type.price
and odometer
columns for outliers.ad_crawled
, ad_created
, ad_last_seen
, registration_month
, registration_year
).We'll drop the seller
and offer_type
columns which have mostly one value.
autos.drop(columns=['seller', 'offer_type'], inplace=True)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 18 columns): ad_crawled 50000 non-null object name 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gearbox 47320 non-null object power_ps 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object nr_pictures 50000 non-null int64 postal_code 50000 non-null int64 ad_last_seen 50000 non-null object dtypes: int64(5), object(13) memory usage: 6.9+ MB
Let's convert the price
and odometer
columns to numeric type.
print('price col:\n', autos['price'].value_counts().head(10), '\n')
autos['price'] = (autos['price']
.str.replace('$', '')
.str.replace(',', '')
.astype(float)
)
print(autos['price'].value_counts().head(10), '\n')
price col: $0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 $1,200 639 $600 531 $800 498 $3,500 498 $2,000 460 Name: price, dtype: int64 0.0 1421 500.0 781 1500.0 734 2500.0 643 1200.0 639 1000.0 639 600.0 531 800.0 498 3500.0 498 2000.0 460 Name: price, dtype: int64
We found that all prices are in US dollars, so we'll rename the column price
to price_usd
.
autos.rename(columns={'price' : 'price_usd'}, inplace=True)
autos.columns
Index(['ad_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code', 'ad_last_seen'], dtype='object')
And for the odometer
column:
print('odometer col:\n', autos['odometer'].value_counts().head(10), '\n')
autos['odometer'] = (autos['odometer']
.str.replace('km', '')
.str.replace(',', '')
.astype(float)
)
print(autos['odometer'].value_counts().head(10), '\n')
odometer col: 150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 Name: odometer, dtype: int64 150000.0 32424 125000.0 5170 100000.0 2169 90000.0 1757 80000.0 1436 70000.0 1230 60000.0 1164 50000.0 1027 5000.0 967 40000.0 819 Name: odometer, dtype: int64
We found that all distances are in kilometers, so we'll rename the column odometer
to odometer_km
.
autos.rename(columns={'odometer' : 'odometer_km'}, inplace=True)
autos.columns
Index(['ad_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code', 'ad_last_seen'], dtype='object')
We'll start by analyzing the price_usd
and odometer_km
columns. Here's the steps we'll take:
print('price_usd col:\nNumber of unique values: ',
autos['price_usd'].unique().shape, '\n')
print((autos['price_usd']
.describe()
.apply('{0:.2f}'.format)
), '\n')
print('Lowest values:')
print((autos['price_usd']
.value_counts()
.sort_index(ascending=True)
.head(20)
), '\n')
print('Highest values:')
print((autos['price_usd']
.value_counts()
.sort_index(ascending=False)
.head(12)
))
price_usd col: Number of unique values: (2357,) count 50000.00 mean 9840.04 std 481104.38 min 0.00 25% 1100.00 50% 2950.00 75% 7200.00 max 99999999.00 Name: price_usd, dtype: object Lowest values: 0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 Name: price_usd, dtype: int64 Highest values: 99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 Name: price_usd, dtype: int64
The selling price for sale cannot be zero. Also, the highest values 99999999.0
, 27322222.0
, 12345678.0
, 11111111.0
, 10000000.0
seem to be specified at random. So, the values of the price_usd
column less 1USD and more 4,000,000USD don't look right. We'll remove these outliers from the data set.
autos = autos[autos['price_usd'].between(1,4000000)]
print((autos['price_usd']
.describe()
.apply('{0:.2f}'.format)
), '\n')
print('Lowest values:')
print((autos['price_usd']
.value_counts()
.sort_index(ascending=True)
.head(20)
), '\n')
print('Highest values:')
print((autos['price_usd']
.value_counts()
.sort_index(ascending=False)
.head(12)
))
count 48571.00 mean 6082.24 std 22786.98 min 1.00 25% 1200.00 50% 3000.00 75% 7490.00 max 3890000.00 Name: price_usd, dtype: object Lowest values: 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 40.0 6 Name: price_usd, dtype: int64 Highest values: 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 Name: price_usd, dtype: int64
Now, the new prices look real. The values such as 1
, 2
, 3
, 5
, 50
are questionable, but we leave them for further research.
Let's look at the odometer_km
column.
print('odometer_km col:\nNumber of unique values: ',
autos['odometer_km'].unique().shape, '\n')
print((autos['odometer_km']
.describe()
.apply('{0:.2f}'.format)
), '\n')
print('Lowest values:')
print((autos['odometer_km']
.value_counts()
.sort_index(ascending=True)
.head(20)
), '\n')
print('Highest values:')
print((autos['odometer_km']
.value_counts()
.sort_index(ascending=False)
.head(12)
))
odometer_km col: Number of unique values: (13,) count 48571.00 mean 125767.54 std 39791.89 min 5000.00 25% 125000.00 50% 150000.00 75% 150000.00 max 150000.00 Name: odometer_km, dtype: object Lowest values: 5000.0 837 10000.0 253 20000.0 762 30000.0 780 40000.0 815 50000.0 1013 60000.0 1155 70000.0 1217 80000.0 1415 90000.0 1734 100000.0 2115 125000.0 5058 150000.0 31417 Name: odometer_km, dtype: int64 Highest values: 150000.0 31417 125000.0 5058 100000.0 2115 90000.0 1734 80000.0 1415 70000.0 1217 60000.0 1155 50000.0 1013 40000.0 815 30000.0 780 20000.0 762 10000.0 253 Name: odometer_km, dtype: int64
The mileage values are between 5000km
and 150,000km
, that is normal for used cars. The values of the odometer_km
column look realistic. We don't need to clean them.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteThe date_crawled
, last_seen
, and ad_created
columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe()
to understand the distribution without any extra data processing.
We'll start by exploring the date_crawled
, last_seen
and ad_created
columns, which are now the ad_crawled
, ad_last_seen
and ad_created
columns respectively.
autos[['ad_crawled','ad_created','ad_last_seen']][0:5]
ad_crawled | ad_created | ad_last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
The first 10 characters represent the day (e.g. 2016-03-12
). To understand the date range, we can extract just the date values and then sort by the index.
print('ad_crawled col:\n')
print((autos['ad_crawled'].str[:10]
.describe()
), '\n')
print('Earliest dates:')
print((autos['ad_crawled'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=True)
.head(12)
), '\n')
print('Latest dates:')
print((autos['ad_crawled'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=False)
.head(12)
))
ad_crawled col: count 48571 unique 34 top 2016-04-03 freq 1875 Name: ad_crawled, dtype: object Earliest dates: 2016-03-05 0.025324 2016-03-06 0.014041 2016-03-07 0.036009 2016-03-08 0.033291 2016-03-09 0.033086 2016-03-10 0.032180 2016-03-11 0.032571 2016-03-12 0.036915 2016-03-13 0.015668 2016-03-14 0.036544 2016-03-15 0.034280 2016-03-16 0.029606 Name: ad_crawled, dtype: float64 Latest dates: 2016-04-07 0.001400 2016-04-06 0.003171 2016-04-05 0.013094 2016-04-04 0.036524 2016-04-03 0.038603 2016-04-02 0.035474 2016-04-01 0.033683 2016-03-31 0.031830 2016-03-30 0.033683 2016-03-29 0.034115 2016-03-28 0.034856 2016-03-27 0.031089 Name: ad_crawled, dtype: float64
We see, that the ads in our dataset were crawled in March and April 2016.
The dates of the ad_crawled
column look valid. Let's continue with the ad_created
column.
print('ad_created col:\n')
print((autos['ad_created'].str[:10]
.describe()
), '\n')
print('Earliest dates:')
print((autos['ad_created'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=True)
.head(12)
), '\n')
print('Latest dates:')
print((autos['ad_created'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=False)
.head(12)
))
ad_created col: count 48571 unique 76 top 2016-04-03 freq 1887 Name: ad_created, dtype: object Earliest dates: 2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 Name: ad_created, dtype: float64 Latest dates: 2016-04-07 0.001256 2016-04-06 0.003253 2016-04-05 0.011818 2016-04-04 0.036894 2016-04-03 0.038850 2016-04-02 0.035144 2016-04-01 0.033683 2016-03-31 0.031871 2016-03-30 0.033497 2016-03-29 0.034053 2016-03-28 0.034980 2016-03-27 0.030986 Name: ad_created, dtype: float64
The ads in our dataset were created on different days from June 2015 to April 2016. Also, the dates of the ad_created
column look valid. Let's see at the ad_last_seen
column.
print('ad_last_seen col:\n')
print((autos['ad_last_seen'].str[:10]
.describe()
), '\n')
print('Earliest dates:')
print((autos['ad_last_seen'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=True)
.head(12)
), '\n')
print('Latest dates:')
print((autos['ad_last_seen'].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index(ascending=False)
.head(12)
))
ad_last_seen col: count 48571 unique 34 top 2016-04-06 freq 10773 Name: ad_last_seen, dtype: object Earliest dates: 2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005394 2016-03-08 0.007412 2016-03-09 0.009594 2016-03-10 0.010665 2016-03-11 0.012374 2016-03-12 0.023780 2016-03-13 0.008894 2016-03-14 0.012600 2016-03-15 0.015874 2016-03-16 0.016450 Name: ad_last_seen, dtype: float64 Latest dates: 2016-04-07 0.131931 2016-04-06 0.221799 2016-04-05 0.124786 2016-04-04 0.024480 2016-04-03 0.025200 2016-04-02 0.024933 2016-04-01 0.022791 2016-03-31 0.023780 2016-03-30 0.024768 2016-03-29 0.022359 2016-03-28 0.020877 2016-03-27 0.015647 Name: ad_last_seen, dtype: float64
There are the data for March and April 2016.
Let's look at the registration_year
column.
autos['registration_year'].describe()
count 48571.000000 mean 2004.754257 std 88.638548 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
We can see, the registration_year
column contains some odd values: 1000
and 9999
. A car can't be first registered in 1000
or 9999
year. Also, a car can't be first registered after the listing was seen, therefore any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
(autos.loc[~(autos['registration_year'].between(1900,2016)), 'registration_year']
.describe()
)
count 1884.000000 mean 2050.461783 std 446.321293 min 1000.000000 25% 2017.000000 50% 2017.000000 75% 2018.000000 max 9999.000000 Name: registration_year, dtype: float64
We found 1884 listings, where the registration year falls outside the 1900 - 2016 interval. Let's try to see the 1850 - 2016 interval.
(autos.loc[~(autos['registration_year'].between(1850,2016)), 'registration_year']
.describe()
)
count 1884.000000 mean 2050.461783 std 446.321293 min 1000.000000 25% 2017.000000 50% 2017.000000 75% 2018.000000 max 9999.000000 Name: registration_year, dtype: float64
As we can see, the 1884 ads we discovered have the registration year outside the 1850 - 2016 interval. It looks safe to remove these rows completely.
print('Before:')
autos.info(verbose=False)
autos = autos[autos['registration_year'].between(1850,2016)]
print('\nAfter:')
autos.info(verbose=False)
(autos['registration_year'].value_counts(normalize=True)
.sort_index(ascending=True)
)
Before: <class 'pandas.core.frame.DataFrame'> Int64Index: 48571 entries, 0 to 49999 Columns: 18 entries, ad_crawled to ad_last_seen dtypes: float64(2), int64(5), object(11) memory usage: 7.0+ MB After: <class 'pandas.core.frame.DataFrame'> Int64Index: 46687 entries, 0 to 49999 Columns: 18 entries, ad_crawled to ad_last_seen dtypes: float64(2), int64(5), object(11) memory usage: 6.8+ MB
1910 0.000107 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000043 1937 0.000086 1938 0.000021 1939 0.000021 1941 0.000043 1943 0.000021 1948 0.000021 1950 0.000064 1951 0.000043 1952 0.000021 1953 0.000021 1954 0.000043 1955 0.000043 1956 0.000086 1957 0.000043 1958 0.000086 1959 0.000129 1960 0.000493 1961 0.000129 1962 0.000086 1963 0.000171 1964 0.000257 1965 0.000364 1966 0.000471 1967 0.000557 1968 0.000557 ... 1987 0.001542 1988 0.002892 1989 0.003727 1990 0.007432 1991 0.007261 1992 0.007947 1993 0.009103 1994 0.013473 1995 0.026281 1996 0.029409 1997 0.041789 1998 0.050614 1999 0.062073 2000 0.067599 2001 0.056461 2002 0.053248 2003 0.057811 2004 0.057896 2005 0.062887 2006 0.057211 2007 0.048772 2008 0.047444 2009 0.044681 2010 0.034035 2011 0.034763 2012 0.028059 2013 0.017200 2014 0.014201 2015 0.008396 2016 0.026131 Name: registration_year, Length: 78, dtype: float64
We have left the listings with cars registered from 1910 to 2016.
Below, we get statistics by brands in our dataset.
autos['brand'].value_counts(normalize=True)
volkswagen 0.211279 bmw 0.110052 opel 0.107567 mercedes_benz 0.096451 audi 0.086555 ford 0.069912 renault 0.047144 peugeot 0.029837 fiat 0.025639 seat 0.018271 skoda 0.016407 nissan 0.015272 mazda 0.015186 smart 0.014158 citroen 0.014008 toyota 0.012702 hyundai 0.010024 sonstige_autos 0.009853 volvo 0.009146 mini 0.008760 mitsubishi 0.008225 honda 0.007839 kia 0.007068 alfa_romeo 0.006640 porsche 0.006126 suzuki 0.005933 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002270 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001499 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
We can see that the top5 brands of used cars on the German eBay website are German brands. It seems, they are popular in Germany.
It would be interesting to explore variations across top20 brands and, in addition, the Russian car brand lada
.
Let's get the selected brands.
brands = (autos['brand']
.value_counts()
.drop(autos['brand'].value_counts().index[20:-1])
.index
)
print(brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini', 'lada'], dtype='object')
Now we'll aggregate the data and get the mean price by each brand.
brands_price = {}
for b in brands:
price = autos.loc[autos['brand'] == b, 'price_usd'].mean().round(2)
brands_price[b] = price
sorted(brands_price.items(), key=lambda x: x[1], reverse=True)
[('sonstige_autos', 23567.51), ('mini', 10613.46), ('audi', 9336.69), ('mercedes_benz', 8628.45), ('bmw', 8571.48), ('skoda', 6368.0), ('volkswagen', 5604.07), ('hyundai', 5365.25), ('toyota', 5167.09), ('volvo', 4946.5), ('nissan', 4743.4), ('seat', 4397.23), ('mazda', 4112.6), ('ford', 4054.69), ('citroen', 3779.14), ('smart', 3580.22), ('peugeot', 3094.02), ('opel', 2975.24), ('fiat', 2813.75), ('lada', 2688.3), ('renault', 2474.86)]
We observed that in the top6 brands, there's a distinct price gap.
For the top6 brands and Lada, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. Finally, we'll make a single dataframe with all the aggregate data and display it to make visual analysis.
# Get the selected brands
brands = (autos['brand']
.value_counts()
.drop(autos['brand'].value_counts().index[6:-1])
.index
)
print(brands)
# Get the mean price and the mean mileage by each selected brand
brands_price = {}
brands_mileage = {}
for b in brands:
price = autos.loc[autos['brand'] == b, 'price_usd'].mean().round(2)
mileage = autos.loc[autos['brand'] == b, 'odometer_km'].mean().round(2)
brands_price[b] = price
brands_mileage[b] = mileage
# Convert both dictionaries to series objects
bp_series = pd.Series(brands_price)
bm_series = pd.Series(brands_mileage)
# Create a report dataframe
brands_report = pd.DataFrame(bp_series, columns=['mean_price_usd'])
brands_report['mean_odometer_km'] = bm_series
print('\n')
print(brands_report.sort_values(by=['mean_price_usd'], ascending=False))
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'lada'], dtype='object') mean_price_usd mean_odometer_km audi 9336.69 129157.39 mercedes_benz 8628.45 130788.36 bmw 8571.48 132575.91 volkswagen 5604.07 128711.48 ford 4054.69 124266.24 opel 2975.24 129310.04 lada 2688.30 83518.52
We see, that the mean price of the car doesn't clearly depend on the average mileage.
Let's calculate the mean ages of the cars and add them to our report dataframe. We'll compute the age of the car as the difference between the year from ad_last_seen
date and the year of registration.
# Get the mean age by each selected brand
brands_age = {}
for b in brands:
age = (autos.loc[autos['brand'] == b, 'ad_last_seen'].str[0:4].astype(int)
- autos.loc[autos['brand'] == b, 'registration_year']
).mean().round(2)
brands_age[b] = age
# Convert the dictionary to series object
ba_series = pd.Series(brands_age)
# Add to the report dataframe
brands_report['mean_age_years'] = ba_series
print(brands_report.sort_values(by=['mean_price_usd'], ascending=False))
mean_price_usd mean_odometer_km mean_age_years audi 9336.69 129157.39 11.84 mercedes_benz 8628.45 130788.36 13.97 bmw 8571.48 132575.91 12.97 volkswagen 5604.07 128711.48 13.54 ford 4054.69 124266.24 13.23 opel 2975.24 129310.04 13.78 lada 2688.30 83518.52 16.44
There's no any visible link with mean price and mileage or age of car among the top6 brands. But, the values for Lada are very different from the others. This can be explained by the fact that this car brand is not typical for the German car market and is only popular among a small number of automobile enthusiasts. In general, cars of the popular German brands are sold after 10 years of ownership based on the ads for March, April 2016.