In this project I will work with 50,000 row sample from an dataset of German eBay car sales. The base dataset had been cleaned for analysis. This sample has been "dirtied" again so that it resembles what one might find in a scraped dataset.
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.odometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.import pandas as pd
import numpy as np
import pprint as pp
autos = pd.read_csv("autos.csv", encoding="Latin-1" )
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
See above for details.
price
format is $X,XXX (not Euros)odometer
format is XXX,XXXXkmdateCrawled
and adCreated
and lastSeen
weren't recognized as such.vehicleType
, gearbox
, model
, fuelType
, notRepairedDamage
notRepairedDamage
column has "NaN", "nein" & 'ja'autos.columns.tolist()
['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen']
Updating coulumn names from CamelCase to snakecase, and to be more descriptive
new_column_names = ['date_crawled',
'name',
'seller',
'offer_type',
'price',
'abtest',
'vehicle_type',
'registration_year',
'gearbox',
'power_ps',
'model',
'odometer',
'registration_month',
'fuel_type',
'brand',
'unrepaired_damage',
'ad_created',
'number_of_pictures',
'postal_code',
'last_seen']
autos.columns = new_column_names
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | number_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe()
registration_year | power_ps | registration_month | number_of_pictures | postal_code | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
autos.describe(include=np.object)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | gearbox | model | odometer | fuel_type | brand | unrepaired_damage | ad_created | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 47320 | 47242 | 50000 | 45518 | 50000 | 40171 | 50000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | 2 | 245 | 13 | 7 | 40 | 2 | 76 | 39481 |
top | 2016-03-22 09:51:06 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | manuell | golf | 150,000km | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | 36993 | 4024 | 32424 | 30107 | 10687 | 35232 | 1946 | 8 |
for c in ["seller", "offer_type", "abtest", "vehicle_type", "gearbox", "fuel_type", "odometer", "brand"]:
print(c, ":\n", autos[c].value_counts(), "\n")
seller : privat 49999 gewerblich 1 Name: seller, dtype: int64 offer_type : Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64 abtest : test 25756 control 24244 Name: abtest, dtype: int64 vehicle_type : limousine 12859 kleinwagen 10822 kombi 9127 bus 4093 cabrio 3061 coupe 2537 suv 1986 andere 420 Name: vehicle_type, dtype: int64 gearbox : manuell 36993 automatik 10327 Name: gearbox, dtype: int64 fuel_type : benzin 30107 diesel 14567 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fuel_type, dtype: int64 odometer : 150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64 brand : volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 citroen 701 smart 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
Drop:
seller
and offer_type
, because all but one row have same value.number_of_photos
, because the value is 0 for all rows.abtest
because we don't have enough information for it to be useful.Fix:
price
and odometer
to numeric valuesodometer
to odometer_km
date_crawled
and last_seen
to datetimeautos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos["odometer"] = autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.drop(columns=["seller", "offer_type", "number_of_pictures"], inplace=True)
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
autos["odometer_km"].value_counts(dropna=False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
The above values seem reasonable. Nothing seems out of place
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos["price"].dtype
#Weird, considering what's reported by describe.
dtype('int64')
autos["price"].isnull().sum()
0
autos["price"].value_counts(dropna=False).sort_index()
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 ... 151990 1 155000 1 163500 1 163991 1 169000 1 169999 1 175000 1 180000 1 190000 1 194000 1 197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, Length: 2357, dtype: int64
Some of these seem a little sketchy.
How likely is it that a car sold for $99.9M? $27M seems like a stretch, but its possible.
Similarly, how likely is it that 1421 sold for $0, or 156 for $1? Does eBay even accept $0?
I am going to discard the 1421 rows with a price of $0.
autos = autos[autos["price"] > 0 ]
len(autos)
48579
Now lets look at the rows with a price >= $900K since there are only ~10.
autos[autos["price"] >= 900000]
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
514 | 2016-03-17 09:53:08 | Ford_Focus_Turnier_1.6_16V_Style | 999999 | test | kombi | 2009 | manuell | 101 | focus | 125000 | 4 | benzin | ford | nein | 2016-03-17 00:00:00 | 12205 | 2016-04-06 07:17:35 |
2897 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 11111111 | test | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 94469 | 2016-03-12 22:45:27 |
7814 | 2016-04-04 11:53:31 | Ferrari_F40 | 1300000 | control | coupe | 1992 | NaN | 0 | NaN | 50000 | 12 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 60598 | 2016-04-05 11:34:11 |
11137 | 2016-03-29 23:52:57 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | 10000000 | control | coupe | 1960 | manuell | 368 | NaN | 100000 | 1 | benzin | sonstige_autos | nein | 2016-03-29 00:00:00 | 73033 | 2016-04-06 21:18:11 |
22947 | 2016-03-22 12:54:19 | Bmw_530d_zum_ausschlachten | 1234566 | control | kombi | 1999 | automatik | 190 | NaN | 150000 | 2 | diesel | bmw | NaN | 2016-03-22 00:00:00 | 17454 | 2016-04-02 03:17:32 |
24384 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | 11111111 | test | NaN | 1995 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 18519 | 2016-03-21 14:40:18 |
27371 | 2016-03-09 15:45:47 | Fiat_Punto | 12345678 | control | NaN | 2017 | NaN | 95 | punto | 150000 | 0 | NaN | fiat | NaN | 2016-03-09 00:00:00 | 96110 | 2016-03-09 15:45:47 |
37585 | 2016-03-29 11:38:54 | Volkswagen_Jetta_GT | 999990 | test | limousine | 1985 | manuell | 111 | jetta | 150000 | 12 | benzin | volkswagen | ja | 2016-03-29 00:00:00 | 50997 | 2016-03-29 11:38:54 |
39377 | 2016-03-08 23:53:51 | Tausche_volvo_v40_gegen_van | 12345678 | control | NaN | 2018 | manuell | 95 | v40 | 150000 | 6 | NaN | volvo | nein | 2016-03-08 00:00:00 | 14542 | 2016-04-06 23:17:31 |
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | 99999999 | control | limousine | 1999 | automatik | 224 | s_klasse | 150000 | 9 | benzin | mercedes_benz | NaN | 2016-03-22 00:00:00 | 73525 | 2016-04-06 05:15:30 |
42221 | 2016-03-08 20:39:05 | Leasinguebernahme | 27322222 | control | limousine | 2014 | manuell | 163 | c4 | 40000 | 2 | diesel | citroen | NaN | 2016-03-08 00:00:00 | 76532 | 2016-03-08 20:39:05 |
43049 | 2016-03-21 19:53:52 | 2_VW_Busse_T3 | 999999 | test | bus | 1981 | manuell | 70 | transporter | 150000 | 1 | benzin | volkswagen | NaN | 2016-03-21 00:00:00 | 99880 | 2016-03-28 17:18:28 |
47598 | 2016-03-31 18:56:54 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | 12345678 | control | limousine | 2001 | manuell | 101 | vectra | 150000 | 3 | benzin | opel | nein | 2016-03-31 00:00:00 | 4356 | 2016-03-31 18:56:54 |
47634 | 2016-04-04 21:25:21 | Ferrari_FXX | 3890000 | test | coupe | 2006 | NaN | 799 | NaN | 5000 | 7 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 60313 | 2016-04-05 12:07:37 |
The Ferrari prices are probably legitimate, as is the Maserati. The Fords, VW Golf & Bus and BMW stationwagon probably aren't legitimate. The rest I don't know enough about.
Since there are only a few Ferraris in this price range, and since later analysis focuses on popular brands (which Ferarri isn't), I'm just going to drop everything above $900K.
autos = autos[autos["price"] < 900000]
len(autos)
48565
date_crawled
, ad_created
and last_seen
all have timestamps, but are stored as objects.
autos[["date_crawled", "ad_created", "last_seen"]][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
for c in ["date_crawled", "ad_created", "last_seen"]:
print(c,":")
print(autos[c].str[:10].value_counts(normalize=True, dropna=False))
print()
date_crawled : 2016-04-03 0.038608 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-12 0.036920 2016-03-14 0.036549 2016-04-04 0.036487 2016-03-07 0.036014 2016-04-02 0.035478 2016-03-28 0.034860 2016-03-19 0.034778 2016-03-15 0.034284 2016-03-29 0.034099 2016-04-01 0.033687 2016-03-30 0.033687 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-22 0.032987 2016-03-11 0.032575 2016-03-23 0.032225 2016-03-26 0.032204 2016-03-10 0.032184 2016-03-31 0.031834 2016-03-17 0.031628 2016-03-25 0.031607 2016-03-27 0.031092 2016-03-16 0.029610 2016-03-24 0.029342 2016-03-05 0.025327 2016-03-13 0.015670 2016-03-06 0.014043 2016-04-05 0.013096 2016-03-18 0.012911 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64 ad_created : 2016-04-03 0.038855 2016-03-20 0.037949 2016-03-21 0.037579 2016-04-04 0.036858 2016-03-12 0.036755 2016-03-14 0.035190 2016-04-02 0.035149 2016-03-28 0.034984 2016-03-07 0.034737 2016-03-29 0.034037 2016-03-15 0.034016 2016-03-19 0.033687 2016-04-01 0.033687 2016-03-30 0.033501 2016-03-08 0.033316 2016-03-09 0.033151 2016-03-11 0.032904 2016-03-22 0.032801 2016-03-26 0.032266 2016-03-23 0.032060 2016-03-10 0.031895 2016-03-31 0.031875 2016-03-25 0.031751 2016-03-17 0.031278 2016-03-27 0.030989 2016-03-16 0.030125 2016-03-24 0.029280 2016-03-05 0.022897 2016-03-13 0.017008 2016-03-06 0.015320 ... 2016-01-10 0.000041 2016-02-24 0.000041 2016-02-20 0.000041 2016-02-05 0.000041 2016-02-18 0.000041 2016-02-14 0.000041 2016-02-26 0.000041 2016-02-02 0.000041 2016-02-12 0.000041 2016-01-16 0.000021 2015-11-10 0.000021 2016-02-09 0.000021 2016-02-08 0.000021 2016-02-07 0.000021 2016-01-29 0.000021 2016-01-13 0.000021 2016-02-16 0.000021 2016-01-07 0.000021 2015-06-11 0.000021 2016-02-17 0.000021 2016-02-01 0.000021 2015-12-30 0.000021 2016-02-11 0.000021 2016-01-22 0.000021 2015-09-09 0.000021 2015-12-05 0.000021 2016-02-22 0.000021 2016-01-14 0.000021 2015-08-10 0.000021 2016-01-03 0.000021 Name: ad_created, Length: 76, dtype: float64 last_seen : 2016-04-06 0.221806 2016-04-07 0.131947 2016-04-05 0.124761 2016-03-17 0.028086 2016-04-03 0.025203 2016-04-02 0.024915 2016-03-30 0.024771 2016-04-04 0.024483 2016-03-31 0.023783 2016-03-12 0.023783 2016-04-01 0.022794 2016-03-29 0.022341 2016-03-22 0.021373 2016-03-28 0.020859 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-23 0.018532 2016-03-26 0.016802 2016-03-16 0.016452 2016-03-15 0.015876 2016-03-19 0.015834 2016-03-27 0.015649 2016-03-14 0.012602 2016-03-11 0.012375 2016-03-10 0.010666 2016-03-09 0.009595 2016-03-13 0.008895 2016-03-08 0.007413 2016-03-18 0.007351 2016-03-07 0.005395 2016-03-06 0.004324 2016-03-05 0.001071 Name: last_seen, dtype: float64
for c in ["date_crawled", "ad_created", "last_seen"]:
print(c,":")
print(autos[c].str[:10].value_counts(normalize=True, dropna=False).sort_index())
print()
date_crawled : 2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64 ad_created : 2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64 last_seen : 2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
date_crawled
spans a ~one month range of dates from 2016/3/5 to 2016/4/7. From 0.3-3.8%ad_created
ranges from 2015/06/11 to 2016/04/07.last_seen
ranges from 2016-03-05 to 2016-04-07for c in ["date_crawled", "ad_created", "last_seen"]:
print(c,":")
print(pd.to_datetime(autos[c].str[:10]).describe())
print()
date_crawled : count 48565 unique 34 top 2016-04-03 00:00:00 freq 1875 first 2016-03-05 00:00:00 last 2016-04-07 00:00:00 Name: date_crawled, dtype: object ad_created : count 48565 unique 76 top 2016-04-03 00:00:00 freq 1887 first 2015-06-11 00:00:00 last 2016-04-07 00:00:00 Name: ad_created, dtype: object last_seen : count 48565 unique 34 top 2016-04-06 00:00:00 freq 10772 first 2016-03-05 00:00:00 last 2016-04-07 00:00:00 Name: last_seen, dtype: object
It is strange that out of 50000 listings, created over a span of almost a year, their are only 76 unique creation dates.
The range in the number of pages crawled on a given day is interesting. It varies over a pretty wide range. Thinking it through, it's probably a product of both the rate of the spider and the rate of ad creation over time. If the spider can't crawl all the ads in a given run, it may not detect a new ad for some period of time. Ads may be missed if they end before detection.
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos["registration_year"].value_counts().sort_index()
1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 4 1957 2 1958 4 1959 6 1960 23 1961 6 1962 4 1963 8 1964 12 ... 2000 3156 2001 2636 2002 2486 2003 2699 2004 2703 2005 2936 2006 2670 2007 2277 2008 2215 2009 2085 2010 1589 2011 1623 2012 1310 2013 803 2014 663 2015 392 2016 1220 2017 1392 2018 470 2019 2 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, Length: 95, dtype: int64
autos.loc[autos["registration_year"].between(1900,2016),"registration_year"].describe()
count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
There are a number of entries with bogus registration years. I've eliminated anything more recent than 2016, when the crawls were performed. More precise would be to compare ad creation date to year and month of registration, but this is good enough to get a sense of the data.
I've also elimated anything with a registration date earlier than 1900. This is, on the one hand a bit anal, since it eliminates less than a dozen listings, while also possibly a bit lax, since I don't know when Germany first started registering cars.
The average car sold was ~15y old, and 75% of cars were at least ~8y old.
autos.loc[autos["registration_year"] < 1935,
["name", "model", "brand", "registration_year", "registration_month"]
]
name | model | brand | registration_year | registration_month | |
---|---|---|---|---|---|
2221 | Sehr_seltener_Oldtimer_Opel_1210_zum_Restaurieren | andere | opel | 1934 | 0 |
2573 | Hanomag_rekord_15k_Suche_ersatz_teile | NaN | sonstige_autos | 1934 | 1 |
3679 | Suche_Auto | NaN | sonstige_autos | 1910 | 0 |
10556 | UNFAL_Auto | NaN | mitsubishi | 1800 | 2 |
11246 | Ford_Model_A_Roadster_Deluxe_1931 | andere | ford | 1931 | 7 |
21416 | Essex_super_six__Ford_A | andere | ford | 1927 | 5 |
22101 | BMW_Andere | andere | bmw | 1929 | 1 |
22316 | VW_Kaefer.__Zwei_zum_Preis_von_einem. | kaefer | volkswagen | 1000 | 0 |
22659 | Opel_Corsa_B | corsa | opel | 1910 | 0 |
24511 | Trabant__wartburg__Ostalgie | NaN | trabant | 1111 | 0 |
28693 | Renault_Twingo | NaN | renault | 1910 | 0 |
30781 | Opel_Calibra_V6_DTM_Bausatz_1:24 | calibra | opel | 1910 | 0 |
32585 | UNFAL_Auto | NaN | mitsubishi | 1800 | 2 |
45157 | Motorhaube | NaN | trabant | 1910 | 0 |
49283 | Citroen_HY | andere | citroen | 1001 | 0 |
To my eye, the earliest genuine registration is a 1927 Ford Model A. The next is a 1931 Ford Model A. In between the two there is a BMW Andere (Other) registered 1929/1. There are some other early 30s listings that appear to be old cars. I'm going to pick 1927 as the earliest legitimate registration year.
autos.loc[autos["registration_year"] > 2016,
["name", "model", "brand", "registration_year", "registration_month"]
]
name | model | brand | registration_year | registration_month | |
---|---|---|---|---|---|
10 | VW_Golf_Tuning_in_siber/grau | NaN | volkswagen | 2017 | 4 |
55 | Mercedes_E320_AMG_zu_Tauschen! | e_klasse | mercedes_benz | 2017 | 7 |
65 | Ford_Fiesta_zum_ausschlachten | fiesta | ford | 2017 | 9 |
68 | Mini_cooper_s_clubman_/vollausstattung_/_Navi/... | clubman | mini | 2017 | 0 |
84 | Renault_twingo | twingo | renault | 2018 | 0 |
113 | Golf_4_Anfaenger_auto | golf | volkswagen | 2017 | 7 |
164 | Opel_Meriva__nur_76000_Km__unfallfrei__scheckh... | meriva | opel | 2018 | 4 |
197 | VW_Polo_9N_an_Bastler | polo | volkswagen | 2017 | 7 |
253 | Ford_mondeo_Gas_anlage_mit_TÜV_04.2017 | mondeo | ford | 2017 | 8 |
348 | VW_Beetle_1.8Turbo_mit_Vollausstattung_und_seh... | beetle | volkswagen | 2017 | 7 |
390 | Fiat_Bertone_X_1_9__X_1/9__X19__X_19__X1_9__X_19 | andere | fiat | 2018 | 6 |
438 | VW_Golf_5_V_1.6_2004_Klima_Tempomat | golf | volkswagen | 2017 | 0 |
453 | Armee_Jeep | andere | jeep | 4500 | 0 |
454 | Ford_Fiesta_Servo_D3_Kat_86_Steuern_65.000_tkm | fiesta | ford | 2017 | 0 |
457 | Mercedes_Benz_W203_C200_Kombi_Kompressor_c_kla... | c_klasse | mercedes_benz | 2017 | 8 |
477 | Citroën_C4 | NaN | citroen | 2017 | 2 |
545 | Bmw_x5_/__E53 | x_reihe | bmw | 2017 | 7 |
548 | Peugeot_206_1.9_Hdi_4_Tueren_Tuev_02.17 | 2_reihe | peugeot | 2017 | 6 |
557 | Smart_Fortwo_Passion_MHD | fortwo | smart | 2017 | 4 |
568 | PRIMA_ANFÄNGERAUTO_smart_fortwo_cabrio | fortwo | smart | 2017 | 12 |
577 | Volkswagen_Golf_3 | golf | volkswagen | 2017 | 1 |
582 | V10_TDI_TOUAREG | touareg | volkswagen | 2017 | 7 |
654 | Ich_verkaufe_meinen_gepflegten_daewoo | kalos | daewoo | 2017 | 5 |
740 | Seat_Leon_zu_verkaufen! | leon | seat | 2017 | 2 |
754 | Ford_Ka_Kleinwagen_BJ_2008 | ka | ford | 2017 | 3 |
758 | Honda_Civic_mit_TÜV_!!! | civic | honda | 2017 | 12 |
765 | BMW_3er_Limousine_mit_1_Jahr_TÜV!!!! | 3er | bmw | 2017 | 3 |
802 | Lada_mit_wenig_km_neuem_Tuev_bj_08 | kalina | lada | 2018 | 0 |
871 | Citroen_C2_60_PS | c2 | citroen | 2017 | 3 |
889 | Verkaufe_Opel_Corsa_C_1_2 | corsa | opel | 2017 | 10 |
... | ... | ... | ... | ... | ... |
49108 | TDI_20_L_DIESEL140_PS..AUTOMATICK | NaN | volkswagen | 2017 | 1 |
49130 | Renault_Modus | modus | renault | 2017 | 6 |
49153 | Corsa_c20xe | corsa | opel | 5000 | 0 |
49154 | Auto_mit_TÜV_bis_September_2017 | ibiza | seat | 2017 | 7 |
49178 | Mazda_323_F_TÜV__neu_2te_Hand_viele_Neuteile | 3_reihe | mazda | 2018 | 6 |
49185 | Alfa_Romeo_156_Sportwagon_1_8_T_Spark | 156 | alfa_romeo | 2019 | 5 |
49218 | Ford_Fiesta_1.3_4/5_Tuerer_16_Monate_Tuev | fiesta | ford | 2018 | 4 |
49245 | Audi_A3_1.8T_tausch_moegl. | a3 | audi | 2017 | 4 |
49259 | Volkswagen_Lupo_TÜV_06.2017_Klima | lupo | volkswagen | 2017 | 0 |
49261 | Toyota_aygo | aygo | toyota | 2017 | 6 |
49262 | Opel_vectra | NaN | opel | 2018 | 0 |
49266 | Toyota_Corolla_Verso_7_Sitzer_104.000km_2_2_Di... | NaN | toyota | 2017 | 12 |
49343 | bmw_320i_touring | 3er | bmw | 2017 | 0 |
49347 | Honda_civic_mit_TÜV_3/2017 | civic | honda | 2017 | 0 |
49354 | Bmw_e39_523i_mit_neuem_Tuev | 5er | bmw | 2018 | 8 |
49389 | Astra_G_CC | astra | opel | 2017 | 0 |
49411 | Renault_twingo_Tuev_neu | twingo | renault | 2018 | 0 |
49522 | Mazda_Demio_zu_verkaufen!_RESERVIERT | NaN | mazda | 2017 | 12 |
49557 | E320_T_cdi_avangart | e_klasse | mercedes_benz | 2017 | 3 |
49561 | Ford_Galaxy_1_8_TDCI_7_Sitze_Paoramadach_DPF_AHK | galaxy | ford | 2017 | 6 |
49653 | Top_zustand_privat | scenic | renault | 2017 | 8 |
49662 | Daihatsu_Coure_122tkm_Tuev_1/17_2.Hand | cuore | daihatsu | 2017 | 0 |
49689 | VW_GOLF_V_1.9 | NaN | volkswagen | 2017 | 0 |
49696 | Audi_A3_1_6_Attraction_mit_TÜV_10/17 | a3 | audi | 2017 | 8 |
49731 | Renault_MEGANE_1.6_16V__neu_ZANHRIEMEN_TOP_ZUS... | megane | renault | 2017 | 7 |
49770 | VW_Polo_6n_Tuev_Neu!__1.6_75PS | polo | volkswagen | 2018 | 12 |
49796 | Opel_corsa_1.4_zu_verkaufen | corsa | opel | 2017 | 7 |
49841 | Passat_abzugeben. | passat | volkswagen | 2017 | 7 |
49910 | Schoener_fast_neuer_Opel_Mokka_in_Zell_Mosel_m... | andere | opel | 9000 | 3 |
49935 | Mercedes_A_klasse_angemeldet_mit_Tuef_und_Auto... | a_klasse | mercedes_benz | 2017 | 9 |
1879 rows × 5 columns
autos.loc[autos["registration_year"] > 2016, "registration_year"].value_counts()
2017 1392 2018 470 5000 4 9999 3 2019 2 5911 1 9000 1 2800 1 4800 1 8888 1 4500 1 6200 1 4100 1 Name: registration_year, dtype: int64
There are 1453 entries with a registration date of 2017, and 492 with a registration date of 2018. This makes me think it may be worthwhile to look within 2016 for bogus registration dates. The easiest is to look for anythin with a reg month more recent than 4/2016, since that's when the last crawl was done.
autos.loc[(autos["registration_year"] == 2016) & (autos["registration_month"] > 4),
["name", "model", "brand", "registration_year", "registration_month"]]
name | model | brand | registration_year | registration_month | |
---|---|---|---|---|---|
135 | Opel_Meriva_B_Panoramadach__Sitz__und_Lenkradh... | meriva | opel | 2016 | 8 |
256 | Passat_1.9TDI_4Motion_Highline | passat | volkswagen | 2016 | 11 |
295 | Privat_anbiter | astra | opel | 2016 | 8 |
307 | Giessen_ford | focus | ford | 2016 | 8 |
437 | Mazda__klima_leder__Alufelgen | NaN | mazda | 2016 | 7 |
495 | BMW_1er_116i | 1er | bmw | 2016 | 12 |
507 | Citroën_Xsara | NaN | citroen | 2016 | 7 |
574 | VW_vento_2_8_vr6 | andere | volkswagen | 2016 | 5 |
675 | Verkaufe_honda_civic | civic | honda | 2016 | 11 |
756 | Golf_3_Variante_tuev_bis_Juni | golf | volkswagen | 2016 | 8 |
914 | Opel_Zafira__AHK__Standheizung__Mp3 | zafira | opel | 2016 | 11 |
942 | Colt_zu_verkaufen | colt | mitsubishi | 2016 | 10 |
1021 | BMW_316i_Leder_Voll_FAHRBETEIT | 1er | bmw | 2016 | 6 |
1039 | Vw_polo_classic | NaN | volkswagen | 2016 | 7 |
1153 | Audi_A3_Sportback_2.0TDI_S_Line | a3 | audi | 2016 | 5 |
1268 | PKW._Volkswagen_Golf_2_Madison_Edition_Typ_19E | golf | volkswagen | 2016 | 6 |
1326 | Opel_Meriva_A_in_einem_gepflegten_Zustand | meriva | opel | 2016 | 11 |
1389 | Fiat_Punto_1.2_60ps | punto | fiat | 2016 | 5 |
1474 | Golf_Vr6_Turbo_HGP_Turbo_Alles_Eingetragen | golf | volkswagen | 2016 | 5 |
1593 | BMW_520I_AUTOMATIKGETRIEBE | 5er | bmw | 2016 | 6 |
1663 | Opel_vectra_b_2_5_V6_mit_TÜV | vectra | opel | 2016 | 5 |
1741 | Vw_Golf_4_2.0__121000_km | NaN | volkswagen | 2016 | 8 |
2014 | Honda_civic_ej2_Coupe_USA_Model | NaN | honda | 2016 | 5 |
2170 | VW_Lupo_1.0_wenig_Km_Sitzheizung_._El._Fenster | lupo | volkswagen | 2016 | 7 |
2333 | Mitsubishi_Colt_+_Klimaanlage | colt | mitsubishi | 2016 | 11 |
2350 | Opel_corsa_b_Edition_100_cool | corsa | opel | 2016 | 7 |
2371 | Dringet_zuverkaufen_audi_a4_mit_maengel | NaN | audi | 2016 | 12 |
2446 | Golf_5__TDI_1.9 | golf | volkswagen | 2016 | 8 |
2776 | Bastlerfahrzeug_Seat_Ibiza | ibiza | seat | 2016 | 12 |
2837 | Seat_Leon_1.4_Euro3_und_D4_Klima | leon | seat | 2016 | 6 |
... | ... | ... | ... | ... | ... |
47982 | Renault_Espace_2.0_136_PS_1998ccm | espace | renault | 2016 | 6 |
48021 | Golf_4_1_4_tuev_bis_6/16 | golf | volkswagen | 2016 | 5 |
48058 | BMW_120D_Coupe_NAVI/_KEYLESS/LEDER/_TEMPOMAT/_PDC | 1er | bmw | 2016 | 5 |
48062 | Fiat_Scudo_1_6benzin | NaN | fiat | 2016 | 8 |
48198 | VW_Passat_3bg_tdi_131ps | passat | volkswagen | 2016 | 12 |
48317 | e36_325i_coupe_neulack_bilder_folgen_bbs_rf_st... | 3er | bmw | 2016 | 6 |
48347 | OPEL_CORSA_1.2 | corsa | opel | 2016 | 9 |
48352 | Ford_Mondeo_1.6_16v | mondeo | ford | 2016 | 12 |
48464 | A_Klasse_170_CDI_Klima_sitzheizung | a_klasse | mercedes_benz | 2016 | 8 |
48514 | Renault_Senic_Megane_16V_nur_34500km_TÜV_neu | NaN | renault | 2016 | 8 |
48634 | NUR_HEUTE_1100_MAZDA_PREMACY | andere | mazda | 2016 | 11 |
48736 | Nissan_Micra_1l_mit_klima | NaN | nissan | 2016 | 12 |
48752 | Opel_Astra_1_4T_MAHAGONIBRAUN_1.HAND | NaN | opel | 2016 | 10 |
48828 | Grand_Cherokee_ZJ__5_2l__neue_Stossdaempfer__a... | NaN | jeep | 2016 | 6 |
48866 | Opel_vectra_B_1_8_16v | vectra | opel | 2016 | 5 |
48893 | Polo_Kombi | polo | volkswagen | 2016 | 6 |
48952 | Audi_80 | NaN | audi | 2016 | 6 |
49145 | Grand_Voyager_Chrysler_mit_TÜV_bis_08.2016 | voyager | chrysler | 2016 | 10 |
49195 | Golf_IV_Generation__1_4l__55_kw__Klima | golf | volkswagen | 2016 | 9 |
49239 | Audi_Coupe_typ_89_quattro_2_6_s2 | NaN | audi | 2016 | 12 |
49258 | BMW_e36_328i_Cabrio_!!!Neu_lackiert!!! | NaN | bmw | 2016 | 5 |
49385 | Golf_3_500_Festpreis | golf | volkswagen | 2016 | 5 |
49417 | Opel_vectra_gruene_Plakette_guter_Zustand | vectra | opel | 2016 | 7 |
49528 | BMW_E36__316i_Compact_M_Paket_"Schlachtfest" | 3er | bmw | 2016 | 12 |
49531 | Fiat_Merena_Diesel_Kombi | NaN | fiat | 2016 | 9 |
49547 | Smart_Passion_mit_Panorama_Dach | fortwo | smart | 2016 | 10 |
49852 | TOP__Golf_3_1.8l | golf | volkswagen | 2016 | 7 |
49876 | Audi_a5_3.0_tdi_s_line | NaN | audi | 2016 | 10 |
49919 | Fiat_Punto | punto | fiat | 2016 | 8 |
49938 | Mercedes_Benz_A_160_Avantgarde | a_klasse | mercedes_benz | 2016 | 6 |
618 rows × 5 columns
autos = autos[autos["registration_year"].between(1917,2016)]
autos = autos[~((autos["registration_year"] == 2016) & (autos["registration_month"] > 4))]
autos["registration_year"].value_counts(normalize=True)
2000 0.068522 2005 0.063746 1999 0.062899 2004 0.058687 2003 0.058600 2006 0.057970 2001 0.057232 2002 0.053975 1998 0.051305 2007 0.049438 2008 0.048092 2009 0.045269 1997 0.042360 2011 0.035238 2010 0.034500 1996 0.029810 2012 0.028442 1995 0.026640 2013 0.017435 2014 0.014395 1994 0.013657 2016 0.013070 1993 0.009227 2015 0.008511 1992 0.008033 1990 0.007534 1991 0.007360 1989 0.003778 1988 0.002931 1985 0.002063 ... 1966 0.000478 1977 0.000478 1976 0.000456 1969 0.000413 1975 0.000391 1965 0.000369 1964 0.000261 1963 0.000174 1961 0.000130 1959 0.000130 1956 0.000087 1958 0.000087 1937 0.000087 1962 0.000087 1950 0.000065 1954 0.000043 1941 0.000043 1934 0.000043 1957 0.000043 1951 0.000043 1955 0.000043 1931 0.000022 1953 0.000022 1943 0.000022 1938 0.000022 1939 0.000022 1927 0.000022 1929 0.000022 1948 0.000022 1952 0.000022 Name: registration_year, Length: 77, dtype: float64
autos["registration_year"].describe()
count 46058.000000 mean 2002.745213 std 7.004326 min 1927.000000 25% 1999.000000 50% 2003.000000 75% 2007.000000 max 2016.000000 Name: registration_year, dtype: float64
I've eliminated 2638 entries with problematic registration years. My observations after my first refinement of the dataset still holds:
The average car sold was ~15y old, and 75% of cars were at least ~8y old.
len(autos)
46058
autos["brand"].describe()
count 46058 unique 40 top volkswagen freq 9708 Name: brand, dtype: object
autos["brand"].value_counts(normalize=True)
volkswagen 0.210778 bmw 0.110621 opel 0.107061 mercedes_benz 0.096813 audi 0.086912 ford 0.069716 renault 0.047071 peugeot 0.029767 fiat 0.025598 seat 0.018129 skoda 0.016501 nissan 0.015198 mazda 0.015155 smart 0.014069 citroen 0.013939 toyota 0.012723 hyundai 0.010053 sonstige_autos 0.009879 volvo 0.009227 mini 0.008793 mitsubishi 0.008207 honda 0.007838 kia 0.007165 alfa_romeo 0.006622 porsche 0.006166 suzuki 0.005949 chevrolet 0.005710 chrysler 0.003496 dacia 0.002627 daihatsu 0.002519 jeep 0.002280 subaru 0.002149 land_rover 0.002106 saab 0.001672 jaguar 0.001585 daewoo 0.001520 trabant 0.001390 rover 0.001324 lancia 0.001086 lada 0.000586 Name: brand, dtype: float64
top10_counts = autos["brand"].value_counts(normalize=True)[:10]
top10_counts.sum()
0.8024664553389205
There are forty unique auto brands represented in the dataset. The top four brands account for ~50% of the entries. Three are German domestic manufacturers, the fourth is Opel, which is GM's German subsidiary. The fifth most common brand is Audi, also German. The top five account for ~60% of the cars in the dataset.
The top ten brands account for 80% of the dataset.
I am going to look at some key aggregate details for the top ten brands.
top_brands = [brand for brand in top10_counts.index]
top_brands
['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat']
brand_mean_prices = {}
for brand in top_brands:
selection = autos.loc[autos["brand"] == brand]
brand_mean_prices[brand] = selection["price"].mean()
brand_mean_prices
{'audi': 9387.175118661004, 'bmw': 8367.863788027478, 'fiat': 2830.815945716709, 'ford': 3790.819059483027, 'mercedes_benz': 8681.313298945952, 'opel': 2996.046643682823, 'peugeot': 3108.5754923413565, 'renault': 2493.9561808118083, 'seat': 4443.226347305389, 'volkswagen': 5444.281314379893}
sum(brand_mean_prices.values())/len(brand_mean_prices.values())
5154.407318935543
autos["price"].describe()
count 46058.000000 mean 6022.932954 std 9219.593098 min 1.000000 25% 1250.000000 50% 3150.000000 75% 7600.000000 max 350000.000000 Name: price, dtype: float64
The mean price for all cars in the dataset is $6,022. (It was $9,685 before removing (~10) cars above $900,000 from the dataset)
The mean price for cars from the top ten brands is $5154
The mean price of cars from the other top-ten brands is below the dataset mean.
Among USA-based manufactuers, the average Ford sells for 26% more than the average Opel.
Among the top tend brands, Audi, Mercedes and BMW cars command a premium. Volkswagon is right in the middle and Fiat, Ford, Opel, Peugot and Renault are budget brands.
brand_mean_mileage = {}
for brand in top_brands:
selection = autos.loc[autos["brand"] == brand]
brand_mean_mileage[brand] = selection["odometer_km"].mean()
brand_mean_mileage
{'audi': 129056.95728203848, 'bmw': 132490.67713444555, 'fiat': 116849.0245971162, 'ford': 124051.6972905637, 'mercedes_benz': 130638.0354339538, 'opel': 129196.91746096127, 'peugeot': 126867.25018234865, 'renault': 127984.31734317343, 'seat': 120838.32335329341, 'volkswagen': 128550.16481252575}
bmp_series = pd.Series(brand_mean_prices)
bmm_series = pd.Series(brand_mean_mileage)
brand_data = pd.DataFrame(bmp_series, columns=['mean_price'])
brand_data["mean_mileage"] = bmm_series
brand_data
mean_price | mean_mileage | |
---|---|---|
audi | 9387.175119 | 129056.957282 |
bmw | 8367.863788 | 132490.677134 |
fiat | 2830.815946 | 116849.024597 |
ford | 3790.819059 | 124051.697291 |
mercedes_benz | 8681.313299 | 130638.035434 |
opel | 2996.046644 | 129196.917461 |
peugeot | 3108.575492 | 126867.250182 |
renault | 2493.956181 | 127984.317343 |
seat | 4443.226347 | 120838.323353 |
volkswagen | 5444.281314 | 128550.164813 |
brand_data.describe()
mean_price | mean_mileage | |
---|---|---|
count | 10.000000 | 10.000000 |
mean | 5154.407319 | 126652.336489 |
std | 2676.064597 | 4764.710803 |
min | 2493.956181 | 116849.024597 |
25% | 3024.178856 | 124755.585514 |
50% | 4117.022703 | 128267.241078 |
75% | 7636.968170 | 129161.927416 |
max | 9387.175119 | 132490.677134 |
It would be interesting to know the original selling price of the vehicles, so one could look at % depreciation vs mileage.
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
#saving myself some typing?
to_translate =["vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]
for c in to_translate:
values = autos[c].unique()
temp_dict = {v:v for v in values}
print(c," = ", end="")
pp.pprint(temp_dict)
vehicle_type = {nan: nan, 'andere': 'andere', 'bus': 'bus', 'cabrio': 'cabrio', 'coupe': 'coupe', 'kleinwagen': 'kleinwagen', 'kombi': 'kombi', 'limousine': 'limousine', 'suv': 'suv'} gearbox = {nan: nan, 'automatik': 'automatik', 'manuell': 'manuell'} fuel_type = {nan: nan, 'andere': 'andere', 'benzin': 'benzin', 'cng': 'cng', 'diesel': 'diesel', 'elektro': 'elektro', 'hybrid': 'hybrid', 'lpg': 'lpg'} unrepaired_damage = {nan: nan, 'ja': 'ja', 'nein': 'nein'}
vehicle_type_translation = {'andere': 'other',
'bus': 'bus',
'cabrio': 'convertable',
'coupe': 'coupe',
'kleinwagen': 'subcompact',
'kombi': 'stationwagon',
'limousine': 'limousine',
'suv': 'suv'}
autos["vehicle_type"] = autos["vehicle_type"].map(vehicle_type_translation)
gearbox_translation = {'automatik': 'automatic', 'manuell': 'manual'}
autos["gearbox"] = autos["gearbox"].map(gearbox_translation)
fuel_type_translation = {'andere': 'other',
'benzin': 'gasoline',
'cng': 'cng',
'diesel': 'diesel',
'elektro': 'electric',
'hybrid': 'hybrid',
'lpg': 'lpg'}
autos["fuel_type"] = autos["fuel_type"].map(fuel_type_translation)
unrepaired_damage_translation = {'ja': 'yes', 'nein': 'no'}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(unrepaired_damage_translation)
autos["model"] = autos["model"].str.replace("andere","other")
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | other | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | gasoline | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | gasoline | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | subcompact | 2007 | automatic | 71 | fortwo | 70000 | 6 | gasoline | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | stationwagon | 2003 | manual | 0 | focus | 150000 | 7 | gasoline | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
date_cols = ["date_crawled", "ad_created", "last_seen"]
for c in date_cols:
autos[c] = autos[c].str[:10].str.replace("-","").astype(int)
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 46058 entries, 0 to 49999 Data columns (total 17 columns): date_crawled 46058 non-null int64 name 46058 non-null object price 46058 non-null int64 abtest 46058 non-null object vehicle_type 43975 non-null object registration_year 46058 non-null int64 gearbox 44009 non-null object power_ps 46058 non-null int64 model 43938 non-null object odometer_km 46058 non-null int64 registration_month 46058 non-null int64 fuel_type 42999 non-null object brand 46058 non-null object unrepaired_damage 38004 non-null object ad_created 46058 non-null int64 postal_code 46058 non-null int64 last_seen 46058 non-null int64 dtypes: int64(9), object(8) memory usage: 6.3+ MB
autos[date_cols].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 20160326 | 20160326 | 20160406 |
1 | 20160404 | 20160404 | 20160406 |
2 | 20160326 | 20160326 | 20160406 |
3 | 20160312 | 20160312 | 20160315 |
4 | 20160401 | 20160401 | 20160401 |
autos["name"]
0 Peugeot_807_160_NAVTECH_ON_BOARD 1 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 Volkswagen_Golf_1.6_United 3 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 Golf_IV_1.9_TDI_90PS 8 Seat_Arosa 9 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 11 Mercedes_A140_Motorschaden 12 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 Audi_A3_1.6_tuning 14 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 Opel_Vectra_B_Kombi 17 Volkswagen_Scirocco_2_G60 18 Verkaufen_mein_bmw_e36_320_i_touring 19 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 20 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... 21 Porsche_911_Carrera_4S_Cabrio 22 MINI_Cooper_S_Cabrio 23 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima 24 BMW_535i_xDrive_Sport_Aut. 25 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung 26 Volkswagen_Polo_Fox 28 MINI_Cooper_D 29 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... 30 Peugeot_206_Unfallfahrzeug 31 Noch_gut_erhaltenen_C_320 ... 49968 Mercedes_Benz_190_D_2.5_Automatik 49969 Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ 49970 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... 49971 W.Lupo_1.0 49972 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. 49973 Mercedes_Benz_SLK_200_Kompressor 49975 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort 49976 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... 49977 Mercedes_Benz_C200_Cdi_W203 49978 Mercedes_Benz_E_200_Classic 49979 Volkswagen_Polo_1.6_TDI_Style 49980 Ford_Escort_Turnier_16V 49981 Opel_Astra_Kombi_mit_Anhaengerkupplung 49982 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm 49983 Ford_focus_99 49985 Verkaufe_meinen_vw_vento! 49986 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... 49987 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... 49988 BMW_330_Ci 49989 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 49990 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban 49991 Kleinwagen 49992 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport 49993 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug 49994 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... 49995 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 Fiat_500_C_1.2_Dualogic_Lounge 49998 Audi_A3_2.0_TDI_Sportback_Ambition 49999 Opel_Vectra_1.6_16V Name: name, Length: 46058, dtype: object
One might be able to extract engine engine details, like displacement and 16V from the name column in some of the rows.
autos["name"].str.extract(r'_(\d\.\d+)').value_counts()
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame) if __name__ == '__main__':
2.0 3619 1.6 2643 1.4 2169 1.2 1623 1.9 1593 1.8 1444 3.0 671 1.0 648 2.5 497 2.2 487 1.3 374 2.4 207 1.1 190 1.7 186 1.5 179 2.7 153 2.8 144 2.3 96 3.2 80 4.2 65 1.25 65 2.6 46 4.0 29 2.9 21 3.3 20 3.5 16 5.0 16 0.8 16 1.33 13 5.2 10 ... 9.16 1 4.18 1 1.200 1 7.7 1 5.28 1 3.9 1 9.5 1 4.8 1 3.800 1 3.1 1 5.9 1 6.5 1 1.500 1 3.2016 1 8.17 1 3.16 1 5.3 1 6.7 1 5.2009 1 7.2017 1 8.2016 1 4.17 1 7.4 1 6.2016 1 5.17 1 5.2017 1 2.2005 1 4.9 1 5.16 1 8.4 1 Name: name, Length: 99, dtype: int64
A simple regular expression matches a lot of plausible values for displacement, as well as a number of implausible values. These implausible values could probably be discarded based some string and numeric filtering. This would likely still yield some bogus values, though. I will leave the rest of this exercise to the reader ;).
brands = np.sort(autos["brand"].unique())
for b in brands:
selection = (autos["brand"] == b)
print(b,":")
print(autos.loc[selection, "model"].value_counts(normalize=True)[:3])
print()
alfa_romeo : 156 0.297945 147 0.267123 other 0.202055 Name: model, dtype: float64 audi : a4 0.314763 a3 0.207445 a6 0.203594 Name: model, dtype: float64 bmw : 3er 0.530892 5er 0.230974 1er 0.105565 Name: model, dtype: float64 chevrolet : other 0.588710 matiz 0.133065 captiva 0.096774 Name: model, dtype: float64 chrysler : other 0.339869 voyager 0.274510 ptcruiser 0.196078 Name: model, dtype: float64 citroen : other 0.350081 berlingo 0.155592 c5 0.116694 Name: model, dtype: float64 dacia : sandero 0.361345 logan 0.352941 duster 0.226891 Name: model, dtype: float64 daewoo : matiz 0.333333 kalos 0.257576 lanos 0.151515 Name: model, dtype: float64 daihatsu : cuore 0.548673 other 0.123894 sirion 0.097345 Name: model, dtype: float64 fiat : punto 0.360889 other 0.202667 500 0.105778 Name: model, dtype: float64 ford : focus 0.243673 fiesta 0.229721 mondeo 0.153796 Name: model, dtype: float64 honda : civic 0.479412 other 0.150000 cr_reihe 0.147059 Name: model, dtype: float64 hyundai : i_reihe 0.395402 other 0.333333 getz 0.110345 Name: model, dtype: float64 jaguar : other 0.541667 x_type 0.277778 s_type 0.180556 Name: model, dtype: float64 jeep : grand 0.43 wrangler 0.24 cherokee 0.18 Name: model, dtype: float64 kia : other 0.207668 sorento 0.166134 sportage 0.166134 Name: model, dtype: float64 lada : niva 0.60 other 0.28 samara 0.12 Name: model, dtype: float64 lancia : ypsilon 0.404255 other 0.234043 lybra 0.212766 Name: model, dtype: float64 land_rover : freelander 0.322917 defender 0.229167 discovery 0.135417 Name: model, dtype: float64 mazda : 3_reihe 0.244344 6_reihe 0.238311 other 0.206637 Name: model, dtype: float64 mercedes_benz : c_klasse 0.260198 e_klasse 0.218944 a_klasse 0.120765 Name: model, dtype: float64 mini : cooper 0.668317 one 0.237624 clubman 0.064356 Name: model, dtype: float64 mitsubishi : colt 0.304469 other 0.265363 carisma 0.159218 Name: model, dtype: float64 nissan : micra 0.351032 other 0.176991 almera 0.131268 Name: model, dtype: float64 opel : corsa 0.328635 astra 0.280034 vectra 0.112350 Name: model, dtype: float64 peugeot : 2_reihe 0.459984 3_reihe 0.245532 1_reihe 0.118104 Name: model, dtype: float64 porsche : 911 0.500000 boxster 0.167857 cayenne 0.167857 Name: model, dtype: float64 renault : twingo 0.286667 clio 0.221905 megane 0.158095 Name: model, dtype: float64 rover : other 0.931034 freelander 0.034483 rangerover 0.017241 Name: model, dtype: float64 saab : other 0.716216 900 0.216216 9000 0.067568 Name: model, dtype: float64 seat : ibiza 0.391626 leon 0.233990 arosa 0.128079 Name: model, dtype: float64 skoda : octavia 0.407507 fabia 0.369973 superb 0.068365 Name: model, dtype: float64 smart : fortwo 0.864952 forfour 0.072347 roadster 0.051447 Name: model, dtype: float64 sonstige_autos : Series([], Name: model, dtype: float64) subaru : legacy 0.322581 impreza 0.247312 justy 0.193548 Name: model, dtype: float64 suzuki : other 0.443609 swift 0.308271 jimny 0.124060 Name: model, dtype: float64 toyota : yaris 0.221053 other 0.185965 corolla 0.149123 Name: model, dtype: float64 trabant : 601 0.854545 other 0.145455 Name: model, dtype: float64 volkswagen : golf 0.389596 polo 0.168254 passat 0.143209 Name: model, dtype: float64 volvo : v70 0.218978 v40 0.211679 other 0.199513 Name: model, dtype: float64
Above we can see the three most common models for each brand. In many cases "other" is one of the most common values, which suggests deficiencies in the listing-creation process.
I'm going to add a new column to the dataset that combines brand and model in case it is useful for later analysis.
autos["brand_model"] = autos["brand"] + "/" + autos["model"]
autos["brand_model"].value_counts().head(20)
volkswagen/golf 3640 bmw/3er 2595 volkswagen/polo 1572 opel/corsa 1562 volkswagen/passat 1338 opel/astra 1331 audi/a4 1226 bmw/5er 1129 mercedes_benz/c_klasse 1129 mercedes_benz/e_klasse 950 audi/a3 808 audi/a6 793 ford/focus 751 ford/fiesta 708 volkswagen/transporter 670 renault/twingo 602 peugeot/2_reihe 592 smart/fortwo 538 opel/vectra 534 mercedes_benz/a_klasse 524 Name: brand_model, dtype: int64
An investigation of the relationship between selling price and odometer readings will have to take into account the wide range in average selling prices between the top ten brands. So, I've aggregated them by brand.
NOTE: Odometer readings in the dataset have already been binned, so no further binning has been performed.
for b in sorted(top_brands):
brand_subset = autos[autos["brand"] == b]
odometer_readings = np.sort(brand_subset["odometer_km"].unique())
print(b,":")
print(brand_subset[["price", "odometer_km"]].mean())
for r in odometer_readings:
odometer_subset = brand_subset[brand_subset["odometer_km"] == r]
print(r,":", round(odometer_subset["price"].mean(),-3))
price_40kkm = brand_subset.loc[brand_subset["odometer_km"] == 40000, "price"].mean()
price_90kkm = brand_subset.loc[brand_subset["odometer_km"] == 90000, "price"].mean()
print("90K/40K price ratio:",
round(price_90kkm/price_40kkm,1))
print()
audi : price 9387.175119 odometer_km 129056.957282 dtype: float64 5000 : 16000.0 10000 : 28000.0 20000 : 27000.0 30000 : 27000.0 40000 : 26000.0 50000 : 23000.0 60000 : 20000.0 70000 : 22000.0 80000 : 16000.0 90000 : 16000.0 100000 : 15000.0 125000 : 12000.0 150000 : 6000.0 90K/40K price ratio: 0.6 bmw : price 8367.863788 odometer_km 132490.677134 dtype: float64 5000 : 10000.0 10000 : 31000.0 20000 : 27000.0 30000 : 24000.0 40000 : 25000.0 50000 : 24000.0 60000 : 21000.0 70000 : 17000.0 80000 : 16000.0 90000 : 15000.0 100000 : 13000.0 125000 : 11000.0 150000 : 6000.0 90K/40K price ratio: 0.6 fiat : price 2830.815946 odometer_km 116849.024597 dtype: float64 5000 : 5000.0 10000 : 10000.0 20000 : 7000.0 30000 : 8000.0 40000 : 7000.0 50000 : 5000.0 60000 : 5000.0 70000 : 5000.0 80000 : 4000.0 90000 : 4000.0 100000 : 3000.0 125000 : 2000.0 150000 : 2000.0 90K/40K price ratio: 0.5 ford : price 3790.819059 odometer_km 124051.697291 dtype: float64 5000 : 8000.0 10000 : 16000.0 20000 : 12000.0 30000 : 13000.0 40000 : 10000.0 50000 : 10000.0 60000 : 7000.0 70000 : 8000.0 80000 : 6000.0 90000 : 4000.0 100000 : 5000.0 125000 : 4000.0 150000 : 2000.0 90K/40K price ratio: 0.4 mercedes_benz : price 8681.313299 odometer_km 130638.035434 dtype: float64 5000 : 11000.0 10000 : 29000.0 20000 : 28000.0 30000 : 19000.0 40000 : 28000.0 50000 : 24000.0 60000 : 22000.0 70000 : 17000.0 80000 : 16000.0 90000 : 16000.0 100000 : 14000.0 125000 : 10000.0 150000 : 6000.0 90K/40K price ratio: 0.6 opel : price 2996.046644 odometer_km 129196.917461 dtype: float64 5000 : 4000.0 10000 : 12000.0 20000 : 9000.0 30000 : 11000.0 40000 : 9000.0 50000 : 9000.0 60000 : 7000.0 70000 : 8000.0 80000 : 6000.0 90000 : 4000.0 100000 : 5000.0 125000 : 3000.0 150000 : 2000.0 90K/40K price ratio: 0.5 peugeot : price 3108.575492 odometer_km 126867.250182 dtype: float64 5000 : 3000.0 10000 : 13000.0 20000 : 5000.0 30000 : 12000.0 40000 : 10000.0 50000 : 8000.0 60000 : 6000.0 70000 : 7000.0 80000 : 5000.0 90000 : 4000.0 100000 : 5000.0 125000 : 4000.0 150000 : 2000.0 90K/40K price ratio: 0.4 renault : price 2493.956181 odometer_km 127984.317343 dtype: float64 5000 : 5000.0 10000 : 8000.0 20000 : 9000.0 30000 : 12000.0 40000 : 7000.0 50000 : 7000.0 60000 : 7000.0 70000 : 6000.0 80000 : 4000.0 90000 : 4000.0 100000 : 3000.0 125000 : 2000.0 150000 : 2000.0 90K/40K price ratio: 0.5 seat : price 4443.226347 odometer_km 120838.323353 dtype: float64 5000 : 6000.0 10000 : 20000.0 20000 : 11000.0 30000 : 15000.0 40000 : 12000.0 50000 : 10000.0 60000 : 10000.0 70000 : 8000.0 80000 : 7000.0 90000 : 6000.0 100000 : 5000.0 125000 : 4000.0 150000 : 2000.0 90K/40K price ratio: 0.5 volkswagen : price 5444.281314 odometer_km 128550.164813 dtype: float64 5000 : 4000.0 10000 : 19000.0 20000 : 17000.0 30000 : 15000.0 40000 : 16000.0 50000 : 13000.0 60000 : 13000.0 70000 : 12000.0 80000 : 11000.0 90000 : 9000.0 100000 : 8000.0 125000 : 6000.0 150000 : 4000.0 90K/40K price ratio: 0.6
As an alternative, lets display the aggreagated data as a data frame.
brand_data['90_40_price_ratio'] = np.NAN #done to place early in column order
odometer_readings = np.sort(autos["odometer_km"].unique())
for r in odometer_readings:
odometer_subset = autos[autos["odometer_km"] == r]
for b in brand_data.index:
brand_subset = odometer_subset[odometer_subset["brand"] == b]
brand_data.loc[b,r] = round(brand_subset["price"].mean(),-3)
price_ratios = {}
for b in brand_data.index:
price_ratios[b] = round(brand_data.loc[b,90000]/brand_data.loc[b,40000],2)
price_ratios = pd.Series(price_ratios)
brand_data["90_40_price_ratio"] = price_ratios
brand_data
mean_price | mean_mileage | 90_40_price_ratio | 5000 | 10000 | 20000 | 30000 | 40000 | 50000 | 60000 | 70000 | 80000 | 90000 | 100000 | 125000 | 150000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
audi | 9387.175119 | 129056.957282 | 0.62 | 16000.0 | 28000.0 | 27000.0 | 27000.0 | 26000.0 | 23000.0 | 20000.0 | 22000.0 | 16000.0 | 16000.0 | 15000.0 | 12000.0 | 6000.0 |
bmw | 8367.863788 | 132490.677134 | 0.60 | 10000.0 | 31000.0 | 27000.0 | 24000.0 | 25000.0 | 24000.0 | 21000.0 | 17000.0 | 16000.0 | 15000.0 | 13000.0 | 11000.0 | 6000.0 |
fiat | 2830.815946 | 116849.024597 | 0.57 | 5000.0 | 10000.0 | 7000.0 | 8000.0 | 7000.0 | 5000.0 | 5000.0 | 5000.0 | 4000.0 | 4000.0 | 3000.0 | 2000.0 | 2000.0 |
ford | 3790.819059 | 124051.697291 | 0.40 | 8000.0 | 16000.0 | 12000.0 | 13000.0 | 10000.0 | 10000.0 | 7000.0 | 8000.0 | 6000.0 | 4000.0 | 5000.0 | 4000.0 | 2000.0 |
mercedes_benz | 8681.313299 | 130638.035434 | 0.57 | 11000.0 | 29000.0 | 28000.0 | 19000.0 | 28000.0 | 24000.0 | 22000.0 | 17000.0 | 16000.0 | 16000.0 | 14000.0 | 10000.0 | 6000.0 |
opel | 2996.046644 | 129196.917461 | 0.44 | 4000.0 | 12000.0 | 9000.0 | 11000.0 | 9000.0 | 9000.0 | 7000.0 | 8000.0 | 6000.0 | 4000.0 | 5000.0 | 3000.0 | 2000.0 |
peugeot | 3108.575492 | 126867.250182 | 0.40 | 3000.0 | 13000.0 | 5000.0 | 12000.0 | 10000.0 | 8000.0 | 6000.0 | 7000.0 | 5000.0 | 4000.0 | 5000.0 | 4000.0 | 2000.0 |
renault | 2493.956181 | 127984.317343 | 0.57 | 5000.0 | 8000.0 | 9000.0 | 12000.0 | 7000.0 | 7000.0 | 7000.0 | 6000.0 | 4000.0 | 4000.0 | 3000.0 | 2000.0 | 2000.0 |
seat | 4443.226347 | 120838.323353 | 0.50 | 6000.0 | 20000.0 | 11000.0 | 15000.0 | 12000.0 | 10000.0 | 10000.0 | 8000.0 | 7000.0 | 6000.0 | 5000.0 | 4000.0 | 2000.0 |
volkswagen | 5444.281314 | 128550.164813 | 0.56 | 4000.0 | 19000.0 | 17000.0 | 15000.0 | 16000.0 | 13000.0 | 13000.0 | 12000.0 | 11000.0 | 9000.0 | 8000.0 | 6000.0 | 4000.0 |
Notice that the prices for the lowest mileage cars are unusually low, lower than for higher mileage cars. One hypothesis is that these tend to be wrecked cars with unrepaired damage. This could be investigated with the available data.
I've experimented with finding a quick metric for comparing depreciation among brands. For each brand, I've computed the ratio of mean price for 40,000km vehicles with those for 90,000km.
Between 40,000km and 90,000km:
damaged = autos[autos["unrepaired_damage"] == "yes"]
undamaged = autos[autos["unrepaired_damage"] == "no"]
print("Mean Damaged Price:", round(damaged["price"].mean(),-1))
print("Mean Undamaged Price:", round(undamaged["price"].mean(),-1))
Mean Damaged Price: 2260.0 Mean Undamaged Price: 7200.0
Cars with unrepaired damage sell, on average, for less than 1/3 the price of cars without unrepaired damage.
We can also look at price differences for damaged cars on a model by model basis.
autos.groupby(["brand_model","unrepaired_damage"])["price"].mean()
brand_model unrepaired_damage alfa_romeo/145 no 2750.000000 alfa_romeo/147 no 2722.854545 yes 1286.846154 alfa_romeo/156 no 1689.516667 yes 1190.388889 alfa_romeo/159 no 6659.653846 yes 7800.000000 alfa_romeo/other no 8204.046512 yes 3548.625000 alfa_romeo/spider no 9182.925926 audi/100 no 2031.725000 yes 1166.500000 audi/200 no 2250.000000 audi/80 no 1771.739583 yes 641.421053 audi/90 no 1699.600000 audi/a1 no 14793.827160 audi/a2 no 3808.848485 yes 2333.333333 audi/a3 no 9280.805466 yes 2845.285714 audi/a4 no 8183.572204 yes 2886.800000 audi/a5 no 22454.473214 yes 21649.800000 audi/a6 no 9773.868552 yes 3201.715789 audi/a8 no 11933.750000 yes 3780.000000 audi/other no 22860.139175 ... volkswagen/polo yes 988.312139 volkswagen/scirocco no 12298.285714 volkswagen/sharan no 6684.231707 yes 5235.214286 volkswagen/tiguan no 18086.550459 volkswagen/touareg no 16235.134146 yes 20933.333333 volkswagen/touran no 8642.105882 yes 5496.187500 volkswagen/transporter no 11924.129225 yes 4932.614035 volkswagen/up no 8371.279070 yes 6449.333333 volvo/850 no 2280.388889 yes 580.000000 volvo/c_reihe no 8339.076923 yes 2999.000000 volvo/other no 4245.936508 yes 2022.500000 volvo/s60 no 7379.142857 yes 3500.000000 volvo/v40 no 2227.868852 yes 756.461538 volvo/v50 no 5545.375000 yes 3495.000000 volvo/v60 no 21000.000000 volvo/v70 no 5009.258065 yes 1570.615385 volvo/xc_reihe no 13378.673913 yes 11450.000000 Name: price, Length: 524, dtype: float64
Another way of doing it...
damaged_comparison = pd.DataFrame(undamaged.groupby("brand_model")["price"].mean())
damaged_comparison.rename({"price": "mean_undamaged_price"}, axis=1, inplace=True)
damaged_comparison["mean_damaged_price"] = damaged.groupby("brand_model")["price"].mean()
damaged_comparison
mean_undamaged_price | mean_damaged_price | |
---|---|---|
brand_model | ||
alfa_romeo/145 | 2750.000000 | NaN |
alfa_romeo/147 | 2722.854545 | 1286.846154 |
alfa_romeo/156 | 1689.516667 | 1190.388889 |
alfa_romeo/159 | 6659.653846 | 7800.000000 |
alfa_romeo/other | 8204.046512 | 3548.625000 |
alfa_romeo/spider | 9182.925926 | NaN |
audi/100 | 2031.725000 | 1166.500000 |
audi/200 | 2250.000000 | NaN |
audi/80 | 1771.739583 | 641.421053 |
audi/90 | 1699.600000 | NaN |
audi/a1 | 14793.827160 | NaN |
audi/a2 | 3808.848485 | 2333.333333 |
audi/a3 | 9280.805466 | 2845.285714 |
audi/a4 | 8183.572204 | 2886.800000 |
audi/a5 | 22454.473214 | 21649.800000 |
audi/a6 | 9773.868552 | 3201.715789 |
audi/a8 | 11933.750000 | 3780.000000 |
audi/other | 22860.139175 | 4823.000000 |
audi/q3 | 28346.962963 | NaN |
audi/q5 | 26827.660714 | NaN |
audi/q7 | 22040.184211 | 11950.000000 |
audi/tt | 13743.041667 | 7521.111111 |
bmw/1er | 12035.286353 | 6220.645161 |
bmw/3er | 6960.184942 | 2335.140845 |
bmw/5er | 8891.072855 | 3926.722772 |
bmw/6er | 19530.782609 | NaN |
bmw/7er | 10414.252747 | 5389.250000 |
bmw/m_reihe | 28971.975610 | 8900.000000 |
bmw/other | 19302.000000 | 10579.800000 |
bmw/x_reihe | 19259.575290 | 11625.333333 |
... | ... | ... |
volkswagen/beetle | 7549.046729 | 3094.000000 |
volkswagen/bora | 2441.763889 | 1762.416667 |
volkswagen/caddy | 7958.980519 | 4643.125000 |
volkswagen/cc | 15144.352941 | NaN |
volkswagen/eos | 11690.161290 | 7500.000000 |
volkswagen/fox | 2891.621212 | 1818.750000 |
volkswagen/golf | 6114.460083 | 1877.132450 |
volkswagen/jetta | 5339.961538 | NaN |
volkswagen/kaefer | 8983.866667 | 2960.000000 |
volkswagen/lupo | 1546.163462 | 781.853659 |
volkswagen/other | 5389.897059 | 7116.666667 |
volkswagen/passat | 5838.317526 | 2178.945946 |
volkswagen/phaeton | 13704.222222 | NaN |
volkswagen/polo | 3234.969981 | 988.312139 |
volkswagen/scirocco | 12298.285714 | NaN |
volkswagen/sharan | 6684.231707 | 5235.214286 |
volkswagen/tiguan | 18086.550459 | NaN |
volkswagen/touareg | 16235.134146 | 20933.333333 |
volkswagen/touran | 8642.105882 | 5496.187500 |
volkswagen/transporter | 11924.129225 | 4932.614035 |
volkswagen/up | 8371.279070 | 6449.333333 |
volvo/850 | 2280.388889 | 580.000000 |
volvo/c_reihe | 8339.076923 | 2999.000000 |
volvo/other | 4245.936508 | 2022.500000 |
volvo/s60 | 7379.142857 | 3500.000000 |
volvo/v40 | 2227.868852 | 756.461538 |
volvo/v50 | 5545.375000 | 3495.000000 |
volvo/v60 | 21000.000000 | NaN |
volvo/v70 | 5009.258065 | 1570.615385 |
volvo/xc_reihe | 13378.673913 | 11450.000000 |
287 rows × 2 columns
damaged_comparison
mean_undamaged_price | mean_damaged_price | |
---|---|---|
brand_model | ||
alfa_romeo/145 | 2750.000000 | NaN |
alfa_romeo/147 | 2722.854545 | 1286.846154 |
alfa_romeo/156 | 1689.516667 | 1190.388889 |
alfa_romeo/159 | 6659.653846 | 7800.000000 |
alfa_romeo/other | 8204.046512 | 3548.625000 |
alfa_romeo/spider | 9182.925926 | NaN |
audi/100 | 2031.725000 | 1166.500000 |
audi/200 | 2250.000000 | NaN |
audi/80 | 1771.739583 | 641.421053 |
audi/90 | 1699.600000 | NaN |
audi/a1 | 14793.827160 | NaN |
audi/a2 | 3808.848485 | 2333.333333 |
audi/a3 | 9280.805466 | 2845.285714 |
audi/a4 | 8183.572204 | 2886.800000 |
audi/a5 | 22454.473214 | 21649.800000 |
audi/a6 | 9773.868552 | 3201.715789 |
audi/a8 | 11933.750000 | 3780.000000 |
audi/other | 22860.139175 | 4823.000000 |
audi/q3 | 28346.962963 | NaN |
audi/q5 | 26827.660714 | NaN |
audi/q7 | 22040.184211 | 11950.000000 |
audi/tt | 13743.041667 | 7521.111111 |
bmw/1er | 12035.286353 | 6220.645161 |
bmw/3er | 6960.184942 | 2335.140845 |
bmw/5er | 8891.072855 | 3926.722772 |
bmw/6er | 19530.782609 | NaN |
bmw/7er | 10414.252747 | 5389.250000 |
bmw/m_reihe | 28971.975610 | 8900.000000 |
bmw/other | 19302.000000 | 10579.800000 |
bmw/x_reihe | 19259.575290 | 11625.333333 |
... | ... | ... |
volkswagen/beetle | 7549.046729 | 3094.000000 |
volkswagen/bora | 2441.763889 | 1762.416667 |
volkswagen/caddy | 7958.980519 | 4643.125000 |
volkswagen/cc | 15144.352941 | NaN |
volkswagen/eos | 11690.161290 | 7500.000000 |
volkswagen/fox | 2891.621212 | 1818.750000 |
volkswagen/golf | 6114.460083 | 1877.132450 |
volkswagen/jetta | 5339.961538 | NaN |
volkswagen/kaefer | 8983.866667 | 2960.000000 |
volkswagen/lupo | 1546.163462 | 781.853659 |
volkswagen/other | 5389.897059 | 7116.666667 |
volkswagen/passat | 5838.317526 | 2178.945946 |
volkswagen/phaeton | 13704.222222 | NaN |
volkswagen/polo | 3234.969981 | 988.312139 |
volkswagen/scirocco | 12298.285714 | NaN |
volkswagen/sharan | 6684.231707 | 5235.214286 |
volkswagen/tiguan | 18086.550459 | NaN |
volkswagen/touareg | 16235.134146 | 20933.333333 |
volkswagen/touran | 8642.105882 | 5496.187500 |
volkswagen/transporter | 11924.129225 | 4932.614035 |
volkswagen/up | 8371.279070 | 6449.333333 |
volvo/850 | 2280.388889 | 580.000000 |
volvo/c_reihe | 8339.076923 | 2999.000000 |
volvo/other | 4245.936508 | 2022.500000 |
volvo/s60 | 7379.142857 | 3500.000000 |
volvo/v40 | 2227.868852 | 756.461538 |
volvo/v50 | 5545.375000 | 3495.000000 |
volvo/v60 | 21000.000000 | NaN |
volvo/v70 | 5009.258065 | 1570.615385 |
volvo/xc_reihe | 13378.673913 | 11450.000000 |
287 rows × 2 columns
See above ;)