In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
Few modifications are made to the original dataset:
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.The aim of this project is to clean the data and analyze the included used car listings.
We start by importing the needed libraries and reading the dataset into pandas.
# import libraries
import pandas as pd
import numpy as np
# read the file 'autos.csv' into pandas
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
# render the first few and last few values
# of the pandas object 'autos'
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Few observations on the programming side:
print(autos.info())
and print(autos.head())
.print(autos.head())
then the output is not as nice as when you just do autos.head()
. With autos.head()
, the output is in a nice table, as we above.Few observations based on the output (not sure how relevant):
Will convert here column names from camelCase to snamecase. We will also reword some of the column names based on the data dictionary to make them more descriptive. So, we will do the following:
yearOfRegistration
to registration_year
.dateCreated
to ad_created
.# get column names
autos_cols = autos.columns
# print them, then cut and paste result
# to make the changes easier
print(autos_cols)
# rename columns
autos_cols = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
# now put the renamed columns back to the data frame
autos.columns = autos_cols
# chec
autos.head()
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now will do some data exploration to determine what extra cleaning tasks to do. We will look for:
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 18:39:34 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Based on the output above, we see the following:
nr_of_pictures
column as all the numbers there are zero.registration_year
is odd: min = 1000, max = 9999 and std = 105. This needs more cleaning.power_ps
has minimum of zero. Need to look at.registration_month
has minimum of zero instead of 1, so need to clean this.price
zero.vehicle_type
is 44905. So some are NaN or null.gearbox
count is 47320. So some are NaN or null.power_ps
.odometer
looks to be saved as text (because of km and comma separator), this is why it could not calculate means,...etc.price
is also stored as text (because of the \$ sign and the comma separator).seller
values are mostly the same.offer_type
values are mostly the same.Will start with cleaning price
and odometer
and will convert the numbers to float
.
autos['price'] = (autos['price'].str.replace('$','')
.str.replace(',','')
.astype(float))
autos['odometer'] = (autos['odometer']
.str.replace('km','')
.str.replace(',','')
.astype(float))
# rename odometer column to odometer_km
autos.rename(columns={'odometer' : 'odometer_km'}, inplace=True)
# check
autos[['price', 'odometer_km']].head()
price | odometer_km | |
---|---|---|
0 | 5000.0 | 150000.0 |
1 | 8500.0 | 150000.0 |
2 | 8990.0 | 70000.0 |
3 | 4350.0 | 70000.0 |
4 | 1350.0 | 150000.0 |
print(autos['odometer_km'].unique().shape)
autos['odometer_km'].describe()
(13,)
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts()
150000.0 32424 125000.0 5170 100000.0 2169 90000.0 1757 80000.0 1436 70000.0 1230 60000.0 1164 50000.0 1027 5000.0 967 40000.0 819 30000.0 789 20000.0 784 10000.0 264 Name: odometer_km, dtype: int64
print(autos['price'].unique().shape)
autos['price'].describe()
(2357,)
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().head()
0.0 1421 500.0 781 1500.0 734 2500.0 643 1200.0 639 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending = False).head(20)
99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 Name: price, dtype: int64
There are outliers in odometer_km
and price
. Which one to choose? I will choose the one on price
. Generally, the larger the price the larger the odometer reading, although few exceptions may arise, but one can live with them!
In the above, the price increases by a small factor from one price to the one immediately above it (for example, \$198000 to $197000 is about 1.005) until we get to $350000 to $999990, where the factor is about 2.86, and then one gets to the really expensive cars range, regardless of the factor between the numbers. So I will take the cutoff at $350000 and will call every car more than that as an outlier.
To get rid o these outliers, I will fill these particular cells with NaN and then remove these cells.
autos[autos['price'] > 350000] = np.NaN
(autos['price'].value_counts()
.sort_index(ascending = False)
.head(10))
350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 Name: price, dtype: int64
# remove ROWS through 'dropna()'
autos.dropna()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004.0 | manuell | 158.0 | andere | 150000.0 | 3.0 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0.0 | 79588.0 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997.0 | automatik | 286.0 | 7er | 150000.0 | 6.0 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0.0 | 71034.0 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009.0 | manuell | 102.0 | golf | 70000.0 | 7.0 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0.0 | 35394.0 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007.0 | automatik | 71.0 | fortwo | 70000.0 | 6.0 | benzin | smart | nein | 2016-03-12 00:00:00 | 0.0 | 33729.0 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003.0 | manuell | 0.0 | focus | 150000.0 | 7.0 | benzin | ford | nein | 2016-04-01 00:00:00 | 0.0 | 39218.0 | 2016-04-01 14:38:50 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990.0 | control | limousine | 1998.0 | manuell | 90.0 | golf | 150000.0 | 12.0 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0.0 | 53474.0 | 2016-04-07 03:17:32 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590.0 | control | bus | 1997.0 | manuell | 90.0 | megane | 150000.0 | 7.0 | benzin | renault | nein | 2016-03-16 00:00:00 | 0.0 | 15749.0 | 2016-04-06 10:46:35 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299.0 | control | kleinwagen | 2010.0 | automatik | 71.0 | fortwo | 50000.0 | 9.0 | benzin | smart | nein | 2016-03-31 00:00:00 | 0.0 | 34590.0 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | 1350.0 | control | limousine | 1999.0 | manuell | 101.0 | a3 | 150000.0 | 11.0 | benzin | audi | nein | 2016-03-23 00:00:00 | 0.0 | 12043.0 | 2016-04-01 14:17:13 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | 350.0 | test | kombi | 1999.0 | manuell | 101.0 | vectra | 150000.0 | 5.0 | benzin | opel | nein | 2016-03-16 00:00:00 | 0.0 | 57299.0 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500.0 | test | coupe | 1990.0 | manuell | 205.0 | scirocco | 150000.0 | 6.0 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0.0 | 74821.0 | 2016-04-05 20:46:26 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150.0 | control | suv | 2004.0 | manuell | 124.0 | andere | 150000.0 | 2.0 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0.0 | 40878.0 | 2016-03-17 14:45:58 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | 41500.0 | test | cabrio | 2004.0 | manuell | 320.0 | 911 | 150000.0 | 4.0 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0.0 | 65428.0 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450.0 | control | cabrio | 2015.0 | manuell | 184.0 | cooper | 10000.0 | 1.0 | benzin | mini | nein | 2016-03-28 00:00:00 | 0.0 | 44789.0 | 2016-04-01 06:45:30 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | 48500.0 | control | limousine | 2014.0 | automatik | 306.0 | 5er | 30000.0 | 12.0 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0.0 | 22547.0 | 2016-04-07 13:16:50 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | 777.0 | control | kleinwagen | 1992.0 | manuell | 54.0 | polo | 125000.0 | 2.0 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0.0 | 38110.0 | 2016-04-05 23:46:48 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | 5250.0 | control | kleinwagen | 2007.0 | manuell | 110.0 | cooper | 150000.0 | 7.0 | diesel | mini | ja | 2016-03-19 00:00:00 | 0.0 | 15745.0 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | 4999.0 | test | kombi | 2004.0 | automatik | 204.0 | e_klasse | 150000.0 | 10.0 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0.0 | 47638.0 | 2016-04-02 12:45:44 |
30 | 2016-03-14 11:47:31 | Peugeot_206_Unfallfahrzeug | privat | Angebot | 80.0 | test | kleinwagen | 2002.0 | manuell | 60.0 | 2_reihe | 150000.0 | 6.0 | benzin | peugeot | ja | 2016-03-14 00:00:00 | 0.0 | 57076.0 | 2016-03-14 11:47:31 |
31 | 2016-03-14 16:53:09 | Noch_gut_erhaltenen_C_320 | privat | Angebot | 2850.0 | test | kombi | 2002.0 | automatik | 218.0 | c_klasse | 150000.0 | 7.0 | benzin | mercedes_benz | nein | 2016-03-14 00:00:00 | 0.0 | 41065.0 | 2016-03-16 07:19:04 |
33 | 2016-03-15 12:25:26 | Audi_A3_2.0_TDI_Sportback_DPF_Kupplung_defekt_... | privat | Angebot | 4800.0 | test | kombi | 2006.0 | manuell | 140.0 | a3 | 150000.0 | 8.0 | diesel | audi | nein | 2016-03-15 00:00:00 | 0.0 | 61169.0 | 2016-04-07 13:15:31 |
35 | 2016-03-29 21:52:56 | Ford_Mondeo | privat | Angebot | 1200.0 | test | kombi | 1998.0 | manuell | 0.0 | mondeo | 150000.0 | 12.0 | benzin | ford | nein | 2016-03-29 00:00:00 | 0.0 | 52428.0 | 2016-04-06 08:45:35 |
36 | 2016-03-16 23:46:31 | Ford_Mondeo_2.0_Turnier_TDCi_DPF_Titanium | privat | Angebot | 7500.0 | test | kombi | 2007.0 | manuell | 131.0 | mondeo | 150000.0 | 11.0 | diesel | ford | nein | 2016-03-16 00:00:00 | 0.0 | 34246.0 | 2016-04-05 16:21:16 |
37 | 2016-03-23 16:47:46 | Seat_Altea_XL_2.0_TDI_CR_DPF_Style | privat | Angebot | 13800.0 | test | bus | 2012.0 | manuell | 140.0 | altea | 90000.0 | 2.0 | diesel | seat | nein | 2016-03-23 00:00:00 | 0.0 | 85113.0 | 2016-04-05 20:17:53 |
38 | 2016-03-21 15:51:10 | Volkswagen_Golf_1.4_Special | privat | Angebot | 2850.0 | control | limousine | 2002.0 | manuell | 75.0 | golf | 125000.0 | 2.0 | benzin | volkswagen | nein | 2016-03-21 00:00:00 | 0.0 | 63674.0 | 2016-03-28 12:16:06 |
39 | 2016-03-11 13:58:38 | Mercedes_C_180_Kombi__zu_verkaufen | privat | Angebot | 1450.0 | test | kombi | 1999.0 | automatik | 122.0 | c_klasse | 150000.0 | 6.0 | benzin | mercedes_benz | nein | 2016-03-11 00:00:00 | 0.0 | 12683.0 | 2016-04-06 15:17:52 |
40 | 2016-03-07 14:50:03 | VW_Golf__4_Cabrio_2.0_GTI_16V___Leder_MFA_Alus... | privat | Angebot | 3500.0 | control | cabrio | 1999.0 | manuell | 150.0 | golf | 150000.0 | 1.0 | benzin | volkswagen | nein | 2016-03-07 00:00:00 | 0.0 | 6780.0 | 2016-03-12 02:15:52 |
43 | 2016-03-08 16:51:22 | Audi_A1_1.2_TFSI_Attraction_mit_Garantie | privat | Angebot | 11500.0 | test | kleinwagen | 2011.0 | manuell | 86.0 | a1 | 60000.0 | 9.0 | benzin | audi | nein | 2016-03-08 00:00:00 | 0.0 | 85049.0 | 2016-04-07 03:16:05 |
44 | 2016-03-12 10:38:06 | Zu_verkaufen | privat | Angebot | 900.0 | control | kleinwagen | 1998.0 | manuell | 43.0 | twingo | 150000.0 | 7.0 | benzin | renault | nein | 2016-03-12 00:00:00 | 0.0 | 79725.0 | 2016-04-06 01:17:27 |
45 | 2016-03-12 17:39:55 | Toyota_Starlet_1.3 | privat | Angebot | 1000.0 | test | kleinwagen | 1995.0 | manuell | 75.0 | andere | 150000.0 | 3.0 | benzin | toyota | nein | 2016-03-12 00:00:00 | 0.0 | 74597.0 | 2016-04-07 08:45:19 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49959 | 2016-03-28 21:53:21 | Mercedes_Benz_C_180_T_Kompressor_Classic | privat | Angebot | 4200.0 | control | kombi | 2003.0 | manuell | 143.0 | c_klasse | 150000.0 | 12.0 | benzin | mercedes_benz | nein | 2016-03-28 00:00:00 | 0.0 | 53844.0 | 2016-04-07 05:16:23 |
49962 | 2016-03-14 17:57:15 | Mitsubishi_Space_Star_1__3_L__Bj_2004_Standhei... | privat | Angebot | 2200.0 | test | limousine | 2004.0 | manuell | 82.0 | andere | 150000.0 | 4.0 | benzin | mitsubishi | ja | 2016-03-14 00:00:00 | 0.0 | 45481.0 | 2016-03-15 13:45:08 |
49963 | 2016-03-26 19:52:59 | Mercedes_Benz_B_200_CDI_Special_Edition | privat | Angebot | 7600.0 | control | bus | 2007.0 | manuell | 140.0 | b_klasse | 125000.0 | 5.0 | diesel | mercedes_benz | nein | 2016-03-26 00:00:00 | 0.0 | 92272.0 | 2016-03-28 09:45:44 |
49965 | 2016-03-11 10:53:51 | Opel_Astra_1.6_Lenkradheizung~Sitzheizung | privat | Angebot | 6700.0 | test | limousine | 2011.0 | manuell | 116.0 | astra | 125000.0 | 3.0 | benzin | opel | nein | 2016-03-11 00:00:00 | 0.0 | 37627.0 | 2016-03-15 01:44:57 |
49966 | 2016-04-02 19:49:19 | Citroën_C1_1.0_**Euro4**TÜV_OKT_2017**Scheiten... | privat | Angebot | 1490.0 | control | kleinwagen | 2006.0 | manuell | 68.0 | c1 | 150000.0 | 7.0 | benzin | citroen | ja | 2016-04-02 00:00:00 | 0.0 | 26603.0 | 2016-04-02 19:49:19 |
49968 | 2016-04-01 17:49:15 | Mercedes_Benz_190_D_2.5_Automatik | privat | Angebot | 2100.0 | test | limousine | 1986.0 | automatik | 90.0 | andere | 150000.0 | 9.0 | diesel | mercedes_benz | nein | 2016-04-01 00:00:00 | 0.0 | 40227.0 | 2016-04-05 13:16:35 |
49969 | 2016-03-17 18:49:02 | Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ | privat | Angebot | 4500.0 | control | suv | 2005.0 | manuell | 136.0 | x_trail | 150000.0 | 5.0 | diesel | nissan | nein | 2016-03-17 00:00:00 | 0.0 | 17379.0 | 2016-03-25 23:18:15 |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | 15800.0 | control | bus | 2010.0 | automatik | 136.0 | c4 | 60000.0 | 4.0 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0.0 | 14947.0 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | 950.0 | test | kleinwagen | 2001.0 | manuell | 50.0 | lupo | 150000.0 | 4.0 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0.0 | 65197.0 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | 3300.0 | control | bus | 2004.0 | automatik | 150.0 | vito | 150000.0 | 10.0 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0.0 | 65326.0 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | 6000.0 | control | cabrio | 2004.0 | manuell | 163.0 | slk | 150000.0 | 11.0 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0.0 | 53567.0 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | 0.0 | control | cabrio | 1983.0 | manuell | 70.0 | golf | 150000.0 | 2.0 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0.0 | 8209.0 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | 9700.0 | control | kleinwagen | 2012.0 | automatik | 88.0 | jazz | 100000.0 | 11.0 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0.0 | 84385.0 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | 5900.0 | test | kombi | 1992.0 | automatik | 150.0 | 80 | 150000.0 | 12.0 | benzin | audi | nein | 2016-03-19 00:00:00 | 0.0 | 36100.0 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | 5500.0 | control | limousine | 2003.0 | manuell | 116.0 | c_klasse | 150000.0 | 2.0 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0.0 | 33739.0 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | 900.0 | control | limousine | 1996.0 | automatik | 136.0 | e_klasse | 150000.0 | 9.0 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0.0 | 24405.0 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | 11000.0 | test | kleinwagen | 2011.0 | manuell | 90.0 | polo | 70000.0 | 11.0 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0.0 | 48455.0 | 2016-04-07 01:45:12 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | 2000.0 | control | kombi | 1998.0 | manuell | 115.0 | astra | 150000.0 | 12.0 | benzin | opel | nein | 2016-03-15 00:00:00 | 0.0 | 86859.0 | 2016-04-05 17:21:46 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | 15900.0 | control | limousine | 2010.0 | automatik | 218.0 | 300c | 125000.0 | 11.0 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0.0 | 73527.0 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | 21990.0 | control | limousine | 2013.0 | manuell | 150.0 | a3 | 50000.0 | 11.0 | diesel | audi | nein | 2016-03-22 00:00:00 | 0.0 | 94362.0 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | 9550.0 | control | coupe | 2001.0 | manuell | 231.0 | 3er | 150000.0 | 10.0 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0.0 | 83646.0 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | 150.0 | test | kleinwagen | 1997.0 | manuell | 0.0 | polo | 150000.0 | 5.0 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0.0 | 21244.0 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | 17500.0 | test | limousine | 2012.0 | manuell | 156.0 | a_klasse | 30000.0 | 12.0 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0.0 | 58239.0 | 2016-04-06 22:46:57 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | 4800.0 | control | kleinwagen | 2009.0 | manuell | 120.0 | andere | 125000.0 | 9.0 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0.0 | 68642.0 | 2016-03-13 01:44:51 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | 5000.0 | control | kombi | 2001.0 | automatik | 299.0 | a6 | 150000.0 | 1.0 | benzin | audi | nein | 2016-03-22 00:00:00 | 0.0 | 46537.0 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900.0 | control | limousine | 2011.0 | automatik | 239.0 | q5 | 100000.0 | 1.0 | diesel | audi | nein | 2016-03-27 00:00:00 | 0.0 | 82131.0 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980.0 | control | cabrio | 1996.0 | manuell | 75.0 | astra | 150000.0 | 5.0 | benzin | opel | nein | 2016-03-28 00:00:00 | 0.0 | 44807.0 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200.0 | test | cabrio | 2014.0 | automatik | 69.0 | 500 | 5000.0 | 11.0 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0.0 | 73430.0 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900.0 | control | kombi | 2013.0 | manuell | 150.0 | a3 | 40000.0 | 11.0 | diesel | audi | nein | 2016-03-08 00:00:00 | 0.0 | 35683.0 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | 1250.0 | control | limousine | 1996.0 | manuell | 101.0 | vectra | 150000.0 | 1.0 | benzin | opel | nein | 2016-03-13 00:00:00 | 0.0 | 45897.0 | 2016-04-06 21:18:48 |
35043 rows × 20 columns
From autos()
info above, data_crawled
, last_seen
and ad_created
are identified as objects, therefore, there are strings and possbily some NaN. We need to convert them to numbers to understand their distributions. Let us look at the first 5 rows.
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
Each day in the data frame above is made of 10 characters. Get these 10 characters!
autos['date_crawled'].str[:10].head()
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: date_crawled, dtype: object
# include missing values and use percentages
# instead of counts
(autos['date_crawled']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True))
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03326 2016-03-09 0.03320 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03676 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03150 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03748 2016-03-22 0.03290 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03414 2016-03-30 0.03362 2016-03-31 0.03190 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03648 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 NaN 0.00028 Name: date_crawled, dtype: float64
(autos['date_crawled']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True))
print('with dropping NaN, min and max are:')
(print(autos['date_crawled']
.str[:10]
.value_counts(normalize = True)
.sort_index(ascending = True).min()))
(print(autos['date_crawled']
.str[:10]
.value_counts(normalize = True)
.sort_index(ascending = True).max()))
print('\n','without dropping NaN, min and max are:')
(print(autos['date_crawled']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).min()))
(print(autos['date_crawled']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).max()))
(print('\n', 'descritopn \n', autos['date_crawled']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).describe()))
with dropping NaN, min and max are: 0.0014203977113591807 0.03869083343336134 without dropping NaN, min and max are: 0.00028 0.03868 descritopn count 35.000000 mean 0.028571 std 0.010803 min 0.000280 25% 0.029300 50% 0.032480 75% 0.034870 max 0.038680 Name: date_crawled, dtype: float64
So without dropping the NaN, max relative frequency is for 2016-04-03 and min is for 2016-04-07. When NaN is dropped, max is the same date (naturally) and min is NaN (next min is 2016-04-07).
# do the same for ad_created
(print(autos['ad_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True)))
(print('min, max are \n', autos['ad_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).min()))
(print(autos['ad_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).max()))
(print('\n', 'descritopn \n', autos['ad_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).describe()))
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 2015-12-30 0.00002 2016-01-03 0.00002 2016-01-07 0.00002 2016-01-10 0.00004 2016-01-13 0.00002 2016-01-14 0.00002 2016-01-16 0.00002 2016-01-22 0.00002 2016-01-27 0.00006 2016-01-29 0.00002 2016-02-01 0.00002 2016-02-02 0.00004 2016-02-05 0.00004 2016-02-07 0.00002 2016-02-08 0.00002 2016-02-09 0.00004 2016-02-11 0.00002 2016-02-12 0.00006 2016-02-14 0.00004 2016-02-16 0.00002 2016-02-17 0.00002 2016-02-18 0.00004 2016-02-19 0.00006 2016-02-20 0.00004 2016-02-21 0.00006 ... 2016-03-10 0.03186 2016-03-11 0.03278 2016-03-12 0.03660 2016-03-13 0.01692 2016-03-14 0.03522 2016-03-15 0.03374 2016-03-16 0.03000 2016-03-17 0.03118 2016-03-18 0.01372 2016-03-19 0.03384 2016-03-20 0.03786 2016-03-21 0.03768 2016-03-22 0.03276 2016-03-23 0.03218 2016-03-24 0.02908 2016-03-25 0.03188 2016-03-26 0.03256 2016-03-27 0.03090 2016-03-28 0.03496 2016-03-29 0.03410 2016-03-30 0.03344 2016-03-31 0.03190 2016-04-01 0.03380 2016-04-02 0.03508 2016-04-03 0.03892 2016-04-04 0.03684 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 NaN 0.00028 Name: ad_created, Length: 77, dtype: float64 min, max are 2e-05 0.03892 descritopn count 77.000000 mean 0.012987 std 0.015916 min 0.000020 25% 0.000020 50% 0.000160 75% 0.032180 max 0.038920 Name: ad_created, dtype: float64
For ad_created, min is for 2015 and Jan 2016. Max is for 2016-04-03.
# do the same for last_seen
(print(autos['last_seen']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True)))
print('min and max are \n')
(print(autos['last_seen']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).min()))
(print(autos['last_seen']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).max()))
(print('\n', 'descritopn \n', autos['last_seen']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index(ascending = True).describe()))
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00758 2016-03-09 0.00984 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02380 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02072 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02084 2016-03-29 0.02232 2016-03-30 0.02484 2016-03-31 0.02382 2016-04-01 0.02310 2016-04-02 0.02488 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12424 2016-04-06 0.22092 2016-04-07 0.13092 NaN 0.00028 Name: last_seen, dtype: float64 min and max are 0.00028 0.22092 descritopn count 35.000000 mean 0.028571 std 0.043109 min 0.000280 25% 0.011640 50% 0.019200 75% 0.023810 max 0.220920 Name: last_seen, dtype: float64
For last_seen DATES, min is for 2016-03-05 and max for 2016-04-06. The actual min is for NaN.
Let us look at registration_year
through describe
.
autos['registration_year'].describe()
count 49986.000000 mean 2005.075721 std 105.727161 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
As noted before, there are two odd years in registration year data: 1000 and 9999. Need to clean these.
First car appeared in 1886 (a Benz) and cars were available for the public by 1908 (Ford T model). So pick registration cutoff at 1900. Since cars in here are listed by 2016, they have to be already registered. So any registration after that is not correct. Therefore, max cutoff for registration year data is 2016.
Usual trick: relpace years < 1900 and > 2016 by NaN and then remove these rows.
autos[autos['registration_year'] > 2016] = np.NaN
autos[autos['registration_year'] < 1900] = np.NaN
autos.dropna()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004.0 | manuell | 158.0 | andere | 150000.0 | 3.0 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0.0 | 79588.0 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997.0 | automatik | 286.0 | 7er | 150000.0 | 6.0 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0.0 | 71034.0 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009.0 | manuell | 102.0 | golf | 70000.0 | 7.0 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0.0 | 35394.0 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007.0 | automatik | 71.0 | fortwo | 70000.0 | 6.0 | benzin | smart | nein | 2016-03-12 00:00:00 | 0.0 | 33729.0 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003.0 | manuell | 0.0 | focus | 150000.0 | 7.0 | benzin | ford | nein | 2016-04-01 00:00:00 | 0.0 | 39218.0 | 2016-04-01 14:38:50 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990.0 | control | limousine | 1998.0 | manuell | 90.0 | golf | 150000.0 | 12.0 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0.0 | 53474.0 | 2016-04-07 03:17:32 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590.0 | control | bus | 1997.0 | manuell | 90.0 | megane | 150000.0 | 7.0 | benzin | renault | nein | 2016-03-16 00:00:00 | 0.0 | 15749.0 | 2016-04-06 10:46:35 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299.0 | control | kleinwagen | 2010.0 | automatik | 71.0 | fortwo | 50000.0 | 9.0 | benzin | smart | nein | 2016-03-31 00:00:00 | 0.0 | 34590.0 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | 1350.0 | control | limousine | 1999.0 | manuell | 101.0 | a3 | 150000.0 | 11.0 | benzin | audi | nein | 2016-03-23 00:00:00 | 0.0 | 12043.0 | 2016-04-01 14:17:13 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | 350.0 | test | kombi | 1999.0 | manuell | 101.0 | vectra | 150000.0 | 5.0 | benzin | opel | nein | 2016-03-16 00:00:00 | 0.0 | 57299.0 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500.0 | test | coupe | 1990.0 | manuell | 205.0 | scirocco | 150000.0 | 6.0 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0.0 | 74821.0 | 2016-04-05 20:46:26 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150.0 | control | suv | 2004.0 | manuell | 124.0 | andere | 150000.0 | 2.0 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0.0 | 40878.0 | 2016-03-17 14:45:58 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | 41500.0 | test | cabrio | 2004.0 | manuell | 320.0 | 911 | 150000.0 | 4.0 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0.0 | 65428.0 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450.0 | control | cabrio | 2015.0 | manuell | 184.0 | cooper | 10000.0 | 1.0 | benzin | mini | nein | 2016-03-28 00:00:00 | 0.0 | 44789.0 | 2016-04-01 06:45:30 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | 48500.0 | control | limousine | 2014.0 | automatik | 306.0 | 5er | 30000.0 | 12.0 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0.0 | 22547.0 | 2016-04-07 13:16:50 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | 777.0 | control | kleinwagen | 1992.0 | manuell | 54.0 | polo | 125000.0 | 2.0 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0.0 | 38110.0 | 2016-04-05 23:46:48 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | 5250.0 | control | kleinwagen | 2007.0 | manuell | 110.0 | cooper | 150000.0 | 7.0 | diesel | mini | ja | 2016-03-19 00:00:00 | 0.0 | 15745.0 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | 4999.0 | test | kombi | 2004.0 | automatik | 204.0 | e_klasse | 150000.0 | 10.0 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0.0 | 47638.0 | 2016-04-02 12:45:44 |
30 | 2016-03-14 11:47:31 | Peugeot_206_Unfallfahrzeug | privat | Angebot | 80.0 | test | kleinwagen | 2002.0 | manuell | 60.0 | 2_reihe | 150000.0 | 6.0 | benzin | peugeot | ja | 2016-03-14 00:00:00 | 0.0 | 57076.0 | 2016-03-14 11:47:31 |
31 | 2016-03-14 16:53:09 | Noch_gut_erhaltenen_C_320 | privat | Angebot | 2850.0 | test | kombi | 2002.0 | automatik | 218.0 | c_klasse | 150000.0 | 7.0 | benzin | mercedes_benz | nein | 2016-03-14 00:00:00 | 0.0 | 41065.0 | 2016-03-16 07:19:04 |
33 | 2016-03-15 12:25:26 | Audi_A3_2.0_TDI_Sportback_DPF_Kupplung_defekt_... | privat | Angebot | 4800.0 | test | kombi | 2006.0 | manuell | 140.0 | a3 | 150000.0 | 8.0 | diesel | audi | nein | 2016-03-15 00:00:00 | 0.0 | 61169.0 | 2016-04-07 13:15:31 |
35 | 2016-03-29 21:52:56 | Ford_Mondeo | privat | Angebot | 1200.0 | test | kombi | 1998.0 | manuell | 0.0 | mondeo | 150000.0 | 12.0 | benzin | ford | nein | 2016-03-29 00:00:00 | 0.0 | 52428.0 | 2016-04-06 08:45:35 |
36 | 2016-03-16 23:46:31 | Ford_Mondeo_2.0_Turnier_TDCi_DPF_Titanium | privat | Angebot | 7500.0 | test | kombi | 2007.0 | manuell | 131.0 | mondeo | 150000.0 | 11.0 | diesel | ford | nein | 2016-03-16 00:00:00 | 0.0 | 34246.0 | 2016-04-05 16:21:16 |
37 | 2016-03-23 16:47:46 | Seat_Altea_XL_2.0_TDI_CR_DPF_Style | privat | Angebot | 13800.0 | test | bus | 2012.0 | manuell | 140.0 | altea | 90000.0 | 2.0 | diesel | seat | nein | 2016-03-23 00:00:00 | 0.0 | 85113.0 | 2016-04-05 20:17:53 |
38 | 2016-03-21 15:51:10 | Volkswagen_Golf_1.4_Special | privat | Angebot | 2850.0 | control | limousine | 2002.0 | manuell | 75.0 | golf | 125000.0 | 2.0 | benzin | volkswagen | nein | 2016-03-21 00:00:00 | 0.0 | 63674.0 | 2016-03-28 12:16:06 |
39 | 2016-03-11 13:58:38 | Mercedes_C_180_Kombi__zu_verkaufen | privat | Angebot | 1450.0 | test | kombi | 1999.0 | automatik | 122.0 | c_klasse | 150000.0 | 6.0 | benzin | mercedes_benz | nein | 2016-03-11 00:00:00 | 0.0 | 12683.0 | 2016-04-06 15:17:52 |
40 | 2016-03-07 14:50:03 | VW_Golf__4_Cabrio_2.0_GTI_16V___Leder_MFA_Alus... | privat | Angebot | 3500.0 | control | cabrio | 1999.0 | manuell | 150.0 | golf | 150000.0 | 1.0 | benzin | volkswagen | nein | 2016-03-07 00:00:00 | 0.0 | 6780.0 | 2016-03-12 02:15:52 |
43 | 2016-03-08 16:51:22 | Audi_A1_1.2_TFSI_Attraction_mit_Garantie | privat | Angebot | 11500.0 | test | kleinwagen | 2011.0 | manuell | 86.0 | a1 | 60000.0 | 9.0 | benzin | audi | nein | 2016-03-08 00:00:00 | 0.0 | 85049.0 | 2016-04-07 03:16:05 |
44 | 2016-03-12 10:38:06 | Zu_verkaufen | privat | Angebot | 900.0 | control | kleinwagen | 1998.0 | manuell | 43.0 | twingo | 150000.0 | 7.0 | benzin | renault | nein | 2016-03-12 00:00:00 | 0.0 | 79725.0 | 2016-04-06 01:17:27 |
45 | 2016-03-12 17:39:55 | Toyota_Starlet_1.3 | privat | Angebot | 1000.0 | test | kleinwagen | 1995.0 | manuell | 75.0 | andere | 150000.0 | 3.0 | benzin | toyota | nein | 2016-03-12 00:00:00 | 0.0 | 74597.0 | 2016-04-07 08:45:19 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49959 | 2016-03-28 21:53:21 | Mercedes_Benz_C_180_T_Kompressor_Classic | privat | Angebot | 4200.0 | control | kombi | 2003.0 | manuell | 143.0 | c_klasse | 150000.0 | 12.0 | benzin | mercedes_benz | nein | 2016-03-28 00:00:00 | 0.0 | 53844.0 | 2016-04-07 05:16:23 |
49962 | 2016-03-14 17:57:15 | Mitsubishi_Space_Star_1__3_L__Bj_2004_Standhei... | privat | Angebot | 2200.0 | test | limousine | 2004.0 | manuell | 82.0 | andere | 150000.0 | 4.0 | benzin | mitsubishi | ja | 2016-03-14 00:00:00 | 0.0 | 45481.0 | 2016-03-15 13:45:08 |
49963 | 2016-03-26 19:52:59 | Mercedes_Benz_B_200_CDI_Special_Edition | privat | Angebot | 7600.0 | control | bus | 2007.0 | manuell | 140.0 | b_klasse | 125000.0 | 5.0 | diesel | mercedes_benz | nein | 2016-03-26 00:00:00 | 0.0 | 92272.0 | 2016-03-28 09:45:44 |
49965 | 2016-03-11 10:53:51 | Opel_Astra_1.6_Lenkradheizung~Sitzheizung | privat | Angebot | 6700.0 | test | limousine | 2011.0 | manuell | 116.0 | astra | 125000.0 | 3.0 | benzin | opel | nein | 2016-03-11 00:00:00 | 0.0 | 37627.0 | 2016-03-15 01:44:57 |
49966 | 2016-04-02 19:49:19 | Citroën_C1_1.0_**Euro4**TÜV_OKT_2017**Scheiten... | privat | Angebot | 1490.0 | control | kleinwagen | 2006.0 | manuell | 68.0 | c1 | 150000.0 | 7.0 | benzin | citroen | ja | 2016-04-02 00:00:00 | 0.0 | 26603.0 | 2016-04-02 19:49:19 |
49968 | 2016-04-01 17:49:15 | Mercedes_Benz_190_D_2.5_Automatik | privat | Angebot | 2100.0 | test | limousine | 1986.0 | automatik | 90.0 | andere | 150000.0 | 9.0 | diesel | mercedes_benz | nein | 2016-04-01 00:00:00 | 0.0 | 40227.0 | 2016-04-05 13:16:35 |
49969 | 2016-03-17 18:49:02 | Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ | privat | Angebot | 4500.0 | control | suv | 2005.0 | manuell | 136.0 | x_trail | 150000.0 | 5.0 | diesel | nissan | nein | 2016-03-17 00:00:00 | 0.0 | 17379.0 | 2016-03-25 23:18:15 |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | 15800.0 | control | bus | 2010.0 | automatik | 136.0 | c4 | 60000.0 | 4.0 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0.0 | 14947.0 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | 950.0 | test | kleinwagen | 2001.0 | manuell | 50.0 | lupo | 150000.0 | 4.0 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0.0 | 65197.0 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | 3300.0 | control | bus | 2004.0 | automatik | 150.0 | vito | 150000.0 | 10.0 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0.0 | 65326.0 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | 6000.0 | control | cabrio | 2004.0 | manuell | 163.0 | slk | 150000.0 | 11.0 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0.0 | 53567.0 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | 0.0 | control | cabrio | 1983.0 | manuell | 70.0 | golf | 150000.0 | 2.0 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0.0 | 8209.0 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | 9700.0 | control | kleinwagen | 2012.0 | automatik | 88.0 | jazz | 100000.0 | 11.0 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0.0 | 84385.0 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | 5900.0 | test | kombi | 1992.0 | automatik | 150.0 | 80 | 150000.0 | 12.0 | benzin | audi | nein | 2016-03-19 00:00:00 | 0.0 | 36100.0 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | 5500.0 | control | limousine | 2003.0 | manuell | 116.0 | c_klasse | 150000.0 | 2.0 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0.0 | 33739.0 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | 900.0 | control | limousine | 1996.0 | automatik | 136.0 | e_klasse | 150000.0 | 9.0 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0.0 | 24405.0 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | 11000.0 | test | kleinwagen | 2011.0 | manuell | 90.0 | polo | 70000.0 | 11.0 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0.0 | 48455.0 | 2016-04-07 01:45:12 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | 2000.0 | control | kombi | 1998.0 | manuell | 115.0 | astra | 150000.0 | 12.0 | benzin | opel | nein | 2016-03-15 00:00:00 | 0.0 | 86859.0 | 2016-04-05 17:21:46 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | 15900.0 | control | limousine | 2010.0 | automatik | 218.0 | 300c | 125000.0 | 11.0 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0.0 | 73527.0 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | 21990.0 | control | limousine | 2013.0 | manuell | 150.0 | a3 | 50000.0 | 11.0 | diesel | audi | nein | 2016-03-22 00:00:00 | 0.0 | 94362.0 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | 9550.0 | control | coupe | 2001.0 | manuell | 231.0 | 3er | 150000.0 | 10.0 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0.0 | 83646.0 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | 150.0 | test | kleinwagen | 1997.0 | manuell | 0.0 | polo | 150000.0 | 5.0 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0.0 | 21244.0 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | 17500.0 | test | limousine | 2012.0 | manuell | 156.0 | a_klasse | 30000.0 | 12.0 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0.0 | 58239.0 | 2016-04-06 22:46:57 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | 4800.0 | control | kleinwagen | 2009.0 | manuell | 120.0 | andere | 125000.0 | 9.0 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0.0 | 68642.0 | 2016-03-13 01:44:51 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | 5000.0 | control | kombi | 2001.0 | automatik | 299.0 | a6 | 150000.0 | 1.0 | benzin | audi | nein | 2016-03-22 00:00:00 | 0.0 | 46537.0 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900.0 | control | limousine | 2011.0 | automatik | 239.0 | q5 | 100000.0 | 1.0 | diesel | audi | nein | 2016-03-27 00:00:00 | 0.0 | 82131.0 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980.0 | control | cabrio | 1996.0 | manuell | 75.0 | astra | 150000.0 | 5.0 | benzin | opel | nein | 2016-03-28 00:00:00 | 0.0 | 44807.0 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200.0 | test | cabrio | 2014.0 | automatik | 69.0 | 500 | 5000.0 | 11.0 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0.0 | 73430.0 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900.0 | control | kombi | 2013.0 | manuell | 150.0 | a3 | 40000.0 | 11.0 | diesel | audi | nein | 2016-03-08 00:00:00 | 0.0 | 35683.0 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | 1250.0 | control | limousine | 1996.0 | manuell | 101.0 | vectra | 150000.0 | 1.0 | benzin | opel | nein | 2016-03-13 00:00:00 | 0.0 | 45897.0 | 2016-04-06 21:18:48 |
35042 rows × 20 columns
# get the distribution
(autos['registration_year']
.value_counts(normalize = True)
.sort_index(ascending = True))
1910.0 0.000187 1927.0 0.000021 1929.0 0.000021 1931.0 0.000021 1934.0 0.000042 1937.0 0.000083 1938.0 0.000021 1939.0 0.000021 1941.0 0.000042 1943.0 0.000021 1948.0 0.000021 1950.0 0.000062 1951.0 0.000042 1952.0 0.000021 1953.0 0.000021 1954.0 0.000042 1955.0 0.000042 1956.0 0.000104 1957.0 0.000042 1958.0 0.000083 1959.0 0.000146 1960.0 0.000687 1961.0 0.000125 1962.0 0.000083 1963.0 0.000187 1964.0 0.000250 1965.0 0.000354 1966.0 0.000458 1967.0 0.000562 1968.0 0.000541 ... 1987.0 0.001562 1988.0 0.002957 1989.0 0.003770 1990.0 0.008226 1991.0 0.007414 1992.0 0.008122 1993.0 0.009268 1994.0 0.013745 1995.0 0.027324 1996.0 0.030073 1997.0 0.042236 1998.0 0.051087 1999.0 0.062438 2000.0 0.069852 2001.0 0.056273 2002.0 0.052753 2003.0 0.056794 2004.0 0.057002 2005.0 0.062792 2006.0 0.056377 2007.0 0.047984 2008.0 0.046464 2009.0 0.043673 2010.0 0.033260 2011.0 0.034030 2012.0 0.027553 2013.0 0.016786 2014.0 0.013850 2015.0 0.008310 2016.0 0.027408 Name: registration_year, Length: 78, dtype: float64
autos['registration_year'].describe()
count 48016.000000 mean 2002.806002 std 7.306212 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Few comments on the above results:
No we explore variations across brands. Use aggregation technique for this.
First, get the frequency and absolute number of each brand
.
autos['brand'].value_counts(normalize=True)
volkswagen 0.212117 bmw 0.110026 opel 0.108172 mercedes_benz 0.095364 audi 0.086409 ford 0.069768 renault 0.047359 peugeot 0.029532 fiat 0.025866 seat 0.018181 skoda 0.016036 mazda 0.015141 nissan 0.015099 smart 0.013912 citroen 0.013912 toyota 0.012475 sonstige_autos 0.010892 hyundai 0.009851 volvo 0.009247 mini 0.008643 mitsubishi 0.008143 honda 0.007852 kia 0.007102 alfa_romeo 0.006623 porsche 0.006102 suzuki 0.005915 chevrolet 0.005706 chrysler 0.003665 dacia 0.002562 daihatsu 0.002562 jeep 0.002249 subaru 0.002187 land_rover 0.002041 saab 0.001604 jaguar 0.001583 trabant 0.001562 daewoo 0.001500 rover 0.001354 lancia 0.001083 lada 0.000604 Name: brand, dtype: float64
autos['brand'].value_counts()
volkswagen 10185 bmw 5283 opel 5194 mercedes_benz 4579 audi 4149 ford 3350 renault 2274 peugeot 1418 fiat 1242 seat 873 skoda 770 mazda 727 nissan 725 smart 668 citroen 668 toyota 599 sonstige_autos 523 hyundai 473 volvo 444 mini 415 mitsubishi 391 honda 377 kia 341 alfa_romeo 318 porsche 293 suzuki 284 chevrolet 274 chrysler 176 dacia 123 daihatsu 123 jeep 108 subaru 105 land_rover 98 saab 77 jaguar 76 trabant 75 daewoo 72 rover 65 lancia 52 lada 29 Name: brand, dtype: int64
I will take all brands with >= 1% of the values. So will start at sonstige_autos (which appears 523 times, a number that is reasonable enough to do some stats on). This is a total of 17 brands.
mean_price_per_brand = {}
for brnd in autos['brand'].value_counts().index[:17]:
select = autos[autos['brand'] == brnd]
mean_price = select['price'].mean()
mean_price_per_brand[brnd] = mean_price
sorted_dict_mean = (sorted(mean_price_per_brand.items(),
key = lambda kv: kv[1],
reverse = True))
sorted_dict_mean
[('sonstige_autos', 10805.078393881453), ('audi', 9093.65003615329), ('mercedes_benz', 8485.239571958942), ('bmw', 8102.536248343744), ('skoda', 6334.91948051948), ('volkswagen', 5231.081983308787), ('toyota', 5115.33388981636), ('nissan', 4664.891034482758), ('seat', 4296.492554410081), ('mazda', 4010.7716643741405), ('citroen', 3699.935628742515), ('ford', 3652.095223880597), ('smart', 3542.706586826347), ('peugeot', 3039.4682651622), ('opel', 2876.716403542549), ('fiat', 2711.8011272141707), ('renault', 2395.4164467897976)]
Most expensive cars are the 'sonstige' ones! These are cars, if I understand the results of my Google search, that cannot be put in a single category.
Below 'sonstige': With the exception of Opel, the most expensive cars on average are the German ones (no surprise here). Least expensive on average are the Fiat and Renault.
Since 'sonstige' cars are not of a particular brand, let us delete them from the dictionary using pop
.
# pop out the zeroth key
sorted_dict_mean.pop(0)
sorted_dict_mean
[('audi', 9093.65003615329), ('mercedes_benz', 8485.239571958942), ('bmw', 8102.536248343744), ('skoda', 6334.91948051948), ('volkswagen', 5231.081983308787), ('toyota', 5115.33388981636), ('nissan', 4664.891034482758), ('seat', 4296.492554410081), ('mazda', 4010.7716643741405), ('citroen', 3699.935628742515), ('ford', 3652.095223880597), ('smart', 3542.706586826347), ('peugeot', 3039.4682651622), ('opel', 2876.716403542549), ('fiat', 2711.8011272141707), ('renault', 2395.4164467897976)]
For the top 6 brands, we will use aggregation to see if there is any correlation between mean price and mean mileage.
We will combine the data from both 'brands' and 'odometer_km' into a single dataframe with a shared index. To this end, we will use the pandas series constructor pd.Series(dicionary)
and pd.DataFrame(series, columns=['our column here']
.
# store top 6 brand names in a list
top_6brands = []
for x in sorted_dict_mean[:6]:
top_6brands.append(x[0])
# check
top_6brands
['audi', 'mercedes_benz', 'bmw', 'skoda', 'volkswagen', 'toyota']
# get mean prices and mileage
mean_price_per_brand = {}
mean_km_per_brand = {}
for brnd in top_6brands:
select_cars = autos[autos['brand'] == brnd]
mean_price = select_cars['price'].mean()
mean_km = select_cars['odometer_km'].mean()
mean_price_per_brand[brnd] = mean_price
mean_km_per_brand[brnd] = mean_km
# convert dictionaries to series
mpb = pd.Series(mean_price_per_brand)
mkb = pd.Series(mean_km_per_brand)
# convert 1st series to a datafraem
df_top6_mpb_mkb = (pd.DataFrame(mpb,
columns = ['mean_price']))
# add the 2nd series as a new column to the dataframe
df_top6_mpb_mkb['mean_km'] = mkb
(df_top6_mpb_mkb.sort_values(by = 'mean_price',
ascending = False))
mean_price | mean_km | |
---|---|---|
audi | 9093.650036 | 129287.780188 |
mercedes_benz | 8485.239572 | 130856.082114 |
bmw | 8102.536248 | 132431.383684 |
skoda | 6334.919481 | 110954.545455 |
volkswagen | 5231.081983 | 128724.104075 |
toyota | 5115.333890 | 115709.515860 |
The top 3 brands in terms of price have a higher average mileage compaed to the lowest 3 ones. The difference between the top 3 and lower 3 is about 10,000 to 20,000 km. However, the top 3 are more expensive. Also, Volkswagen's average mileage is about 18000 km higher than Skoda's but about \$1000 cheaper. So there is no clear correlation between price and mileage.
It looks the average price is more determined by the brand name if the difference in mileage is within 20,000 km. This, to me, makes sense. Looking at the numbers, the average mileage for these top 6 cars is above 110,000 km. This means each car is > 11 years old, using a reasonable estimate usage of 10,000 km/yr. At this age, brand name should be the controlling factor for the average price.
The following cateegorical data use german words: seller, offer_type, gearbox, fuel_type and unrepaired_damage. We will use map
to to convert them to English.
First, let us get the German words used.
print('seller words:\n', autos['seller'].value_counts())
(print('\n', 'offer_type words:\n',
autos['offer_type'].value_counts()))
(print('\n', 'gearbox words:\n',
autos['gearbox'].value_counts()))
(print('\n', 'fuel_type words:\n',
autos['fuel_type'].value_counts()))
(print('\n', 'unrepaired_damage words:\n',
autos['unrepaired_damage'].value_counts()))
seller words: privat 48015 gewerblich 1 Name: seller, dtype: int64 offer_type words: Angebot 48015 Gesuch 1 Name: offer_type, dtype: int64 gearbox words: manuell 35553 automatik 10042 Name: gearbox, dtype: int64 fuel_type words: benzin 29241 diesel 14232 lpg 671 cng 72 hybrid 37 andere 20 elektro 19 Name: fuel_type, dtype: int64 unrepaired_damage words: nein 34249 ja 4784 Name: unrepaired_damage, dtype: int64
Now create all the needed dictionaries to do the mapping.
# seller
autos['seller'] = (autos['seller']
.map({'privat':'private',
'gewerblich': 'commercial'}))
# offer_type
autos['offer_type'] = (autos['offer_type']
.map({'Angebot':'offer',
'Gesuch': 'request'}))
# gearbox
autos['gearbox'] = (autos['gearbox']
.map({'manuell':'manual',
'automatik': 'automatic'}))
# fuel_type
autos['fuel_type'] = (autos['fuel_type']
.map({'benzin':'petrol',
'diesel': 'diesel',
'lpg':'liquified_petroleum_gas',
'cng':'compressed_natural_gas',
'hybrid':'hybrid',
'andere':'other',
'elektro':'electric'}))
# unrepaired_damage
autos['unrepaired_damage'] = (autos['unrepaired_damage']
.map({'nein':'no',
'ja': 'yes'}))
autos['date_crawled'] = autos['date_crawled'].str.replace('-','')
autos['ad_created'] = autos['ad_created'].str.replace('-','')
autos['last_seen'] = autos['last_seen'].str.replace('-','')
# check
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | private | offer | 5000.0 | control | bus | 2004.0 | manual | 158.0 | andere | 150000.0 | 3.0 | liquified_petroleum_gas | peugeot | no | 20160326 00:00:00 | 0.0 | 79588.0 | 20160406 06:45:54 |
1 | 20160404 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | private | offer | 8500.0 | control | limousine | 1997.0 | automatic | 286.0 | 7er | 150000.0 | 6.0 | petrol | bmw | no | 20160404 00:00:00 | 0.0 | 71034.0 | 20160406 14:45:08 |
2 | 20160326 18:57:24 | Volkswagen_Golf_1.6_United | private | offer | 8990.0 | test | limousine | 2009.0 | manual | 102.0 | golf | 70000.0 | 7.0 | petrol | volkswagen | no | 20160326 00:00:00 | 0.0 | 35394.0 | 20160406 20:15:37 |
3 | 20160312 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | offer | 4350.0 | control | kleinwagen | 2007.0 | automatic | 71.0 | fortwo | 70000.0 | 6.0 | petrol | smart | no | 20160312 00:00:00 | 0.0 | 33729.0 | 20160315 03:16:28 |
4 | 20160401 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | private | offer | 1350.0 | test | kombi | 2003.0 | manual | 0.0 | focus | 150000.0 | 7.0 | petrol | ford | no | 20160401 00:00:00 | 0.0 | 39218.0 | 20160401 14:38:50 |
Now we find the most common brand/model combinations. We use aggregation. Note that when one does a describe()
on a column of strings, then one gets the most common string. Below is an example which show that the Volkswagen brand is the most common brand.
# result of describe() on 'brand' column
print(autos['brand'].describe())
# extracting the 'top' (most common) result
autos['brand'].describe()['top']
count 48016 unique 40 top volkswagen freq 10185 Name: brand, dtype: object
'volkswagen'
Now we use aggregation. A couple of points 1st:
value_counts().index
to get these names.# get rid of 'sonstigue_autos' and 'andere'
new_autos = autos[~(autos['brand'] == 'sonstige_autos')]
new_autos = new_autos[~(new_autos['model'] == 'andere')]
# aggregate
most_common_brand_model = {}
for brnd in new_autos['brand'].value_counts().index:
select = new_autos[new_autos['brand'] == brnd]
model = select['model'].describe()
most_common_brand_model[brnd] = model['top']
for brnd in most_common_brand_model:
print(brnd, most_common_brand_model[brnd])
dacia sandero toyota yaris jaguar x_type lada niva seat ibiza opel corsa alfa_romeo 156 mazda 3_reihe trabant 601 renault twingo mitsubishi colt land_rover freelander peugeot 2_reihe audi a4 smart fortwo rover freelander hyundai i_reihe daewoo matiz citroen berlingo jeep grand mini cooper suzuki swift bmw 3er ford focus chrysler voyager honda civic kia sorento lancia ypsilon volkswagen golf mercedes_benz c_klasse chevrolet matiz volvo v70 fiat punto saab 900 daihatsu cuore porsche 911 subaru legacy skoda octavia nissan micra
Here, we split the odometer_km
into groups and use aggregation to see if average prices follow any pattern with mileage.
autos['odometer_km'].value_counts()
150000.0 31023 125000.0 4959 100000.0 2109 90000.0 1696 80000.0 1396 70000.0 1199 60000.0 1137 50000.0 1006 5000.0 910 40000.0 800 30000.0 769 20000.0 763 10000.0 249 Name: odometer_km, dtype: int64
# aggregate
avr_price_mileage= {}
for km in autos['odometer_km'].value_counts().index:
delta = 10000
if (km == 150000) | (km == 125000):
delta = 25000
elif (km < 10000):
delta = 5000
select = (autos[(autos['odometer_km'] < km)
& (autos['odometer_km'] >= km - delta)])
avr_price = select['price'].mean()
avr_price_mileage[km] = avr_price
sorted_dict = (sorted(avr_price_mileage.items(),
key = lambda kv: kv[1],
reverse = True))
sorted_dict
[(5000.0, nan), (20000.0, 19890.5983935743), (30000.0, 17940.720838794234), (40000.0, 16414.45513654096), (50000.0, 15441.445), (60000.0, 13633.68588469185), (70000.0, 12286.970096745823), (80000.0, 10817.819849874895), (90000.0, 9575.700573065902), (100000.0, 8350.228183962265), (125000.0, 7936.031768610716), (10000.0, 7654.626373626374), (150000.0, 6086.20790481952)]
Yep, the higher mileage, the lower is the average price!
Finally, the last task: How much cheaper are cars with damage than their non-damaged counterparts? Probably it is easiest to get the damaged cars and find their mean, repeat for the undamaged ones and then find the difference.
damaged = autos[autos['unrepaired_damage'] == "yes"]
avr_damaged = damaged['price'].mean()
undamaged = autos[autos['unrepaired_damage'] == 'no']
avr_undamaged = undamaged['price'].mean()
difference = int(avr_undamaged - avr_damaged)
print(difference)
4950
Undamaged cars are ~$5000 more expensive.