The aim of this project is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
dateCrawled - When this ad was first crawled. All field-values are taken from this date. name - Name of the car. seller - Whether the seller is private or a dealer. offerType - The type of listing price - The price on the ad to sell the car. abtest - Whether the listing is included in an A/B test. vehicleType - The vehicle Type. yearOfRegistration - The year in which the car was first registered. gearbox - The transmission type. powerPS - The power of the car in PS. model - The car model name. kilometer - How many kilometers the car has driven. monthOfRegistration - The month in which the car was first registered. fuelType - What type of fuel the car uses. brand - The brand of the car. notRepairedDamage - If the car has a damage which is not yet repaired. dateCreated - The date on which the eBay listing was created. nrOfPictures - The number of pictures in the ad. postalCode- The postal code for the location of the vehicle. lastSeenOnline - When the crawler saw this ad last online
# let's start by importing the libraries we need and reading the
# dataset into pandas
import pandas as pd
import numpy as np
#we then read our csv file and assign a variable name
autos = pd.read_csv('autos.csv', encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the info provided, it was observed that the data contains null objects. There are 50,000 entries with 20 columns most of which are strings. There are some columns containing both numeric and non-numeric words. The column names use camelcase instead of Python's snakecase which means we can't just replace spaces with underscores.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#we will now change a few of our column names and convert to snakecase by replacing with underscore where necessary
column_names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = column_names
print(autos.head()) #to look at the current state of our dataframe
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gear_box power_ps model odometer registration_month fuel_type \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand unrepaired_damage ad_created nr_of_pictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postal_code last_seen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
All columns were converted to snake case and name of some columns were changed. This was done to ensure uniformity of data thereby, limiting errors.
autos.describe(include = 'all') #we use include=all to get both categorical and numerical columns
autos["price"].unique()
array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'], dtype=object)
From the description above, I observed that the "price" and "odometer" column have special cases and stored as string instead of float/integer. seller and offer_type columns have same values.
autos["price"] = (autos["price"].str.replace("$","").str.replace(",","").astype(float))
print(autos["price"].dtype)
float64
print(autos["odometer"].unique())
['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km']
autos["odometer"] = (autos["odometer"].str.replace("km","")
.str.replace(",","").astype(float))
#let's rename our "odometer" column to "odometer_km"
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace=True)
print(autos.columns)
print(autos["odometer_km"].dtype)
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object') float64
chained to .head() if there are lots of values. Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
print(autos["price"].unique().shape)
print(autos["price"].describe())
print(autos["price"].value_counts().sort_index(ascending=True).head())
print(autos["price"].value_counts().sort_index(ascending=False).head())
#to remove outliers
autos = autos[autos["price"].between(1,27322222)]
print(autos["price"].value_counts().sort_index(ascending=False))
(2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 Name: price, dtype: int64 99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 Name: price, dtype: int64 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 194000.0 1 190000.0 1 180000.0 1 175000.0 1 169999.0 1 169000.0 1 163991.0 1 163500.0 1 155000.0 1 151990.0 1 145000.0 1 ... 70.0 10 66.0 1 65.0 5 60.0 9 59.0 1 55.0 2 50.0 49 49.0 4 47.0 1 45.0 4 40.0 6 35.0 1 30.0 7 29.0 1 25.0 5 20.0 4 18.0 1 17.0 3 15.0 2 14.0 1 13.0 2 12.0 3 11.0 2 10.0 7 9.0 1 8.0 1 5.0 2 3.0 1 2.0 3 1.0 156 Name: price, Length: 2355, dtype: int64
There are 2,357 unique prices for the cars
There are 1,421 cars with $0 price and we should consider removing the rows
The price of the most expensive car 100 million which is very high.
We also found some outliers which are prices above $27,322,222 and below $1 which were removed as seen above.
print(autos["odometer_km"].unique().shape)
print(autos["odometer_km"].describe())
print(autos["odometer_km"].value_counts())
(13,) count 48578.000000 mean 125765.675820 std 39793.193823 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 150000.0 31421 125000.0 5058 100000.0 2116 90000.0 1734 80000.0 1415 70000.0 1217 60000.0 1155 50000.0 1014 5000.0 837 40000.0 816 30000.0 780 20000.0 762 10000.0 253 Name: odometer_km, dtype: int64
In this column, it is observed that there are more vehicles with high mileage than low mileage.
Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.
print(autos["date_crawled"]
.str[:10].value_counts(normalize = True, dropna=False).sort_index())
print(autos["ad_created"].str[:10].value_counts(normalize = True, dropna=False).sort_index())
print(autos["last_seen"].str[:10].value_counts(normalize = True, dropna=False).sort_index())
#the series.sort_index() is used to rank date in ascending order
2016-03-05 0.025320 2016-03-06 0.014039 2016-03-07 0.036004 2016-03-08 0.033328 2016-03-09 0.033101 2016-03-10 0.032175 2016-03-11 0.032566 2016-03-12 0.036930 2016-03-13 0.015666 2016-03-14 0.036539 2016-03-15 0.034275 2016-03-16 0.029602 2016-03-17 0.031640 2016-03-18 0.012907 2016-03-19 0.034769 2016-03-20 0.037877 2016-03-21 0.037404 2016-03-22 0.032998 2016-03-23 0.032216 2016-03-24 0.029334 2016-03-25 0.031599 2016-03-26 0.032196 2016-03-27 0.031084 2016-03-28 0.034851 2016-03-29 0.034131 2016-03-30 0.033678 2016-03-31 0.031846 2016-04-01 0.033678 2016-04-02 0.035469 2016-04-03 0.038598 2016-04-04 0.036519 2016-04-05 0.013092 2016-04-06 0.003170 2016-04-07 0.001400 Name: date_crawled, dtype: float64 2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033163 2016-03-10 0.031887 2016-03-11 0.032896 2016-03-12 0.036766 2016-03-13 0.017004 2016-03-14 0.035181 2016-03-15 0.034007 2016-03-16 0.030117 2016-03-17 0.031290 2016-03-18 0.013586 2016-03-19 0.033678 2016-03-20 0.037939 2016-03-21 0.037610 2016-03-22 0.032813 2016-03-23 0.032052 2016-03-24 0.029273 2016-03-25 0.031743 2016-03-26 0.032257 2016-03-27 0.030981 2016-03-28 0.034975 2016-03-29 0.034069 2016-03-30 0.033493 2016-03-31 0.031887 2016-04-01 0.033678 2016-04-02 0.035139 2016-04-03 0.038845 2016-04-04 0.036889 2016-04-05 0.011816 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64 2016-03-05 0.001070 2016-03-06 0.004323 2016-03-07 0.005393 2016-03-08 0.007431 2016-03-09 0.009613 2016-03-10 0.010663 2016-03-11 0.012372 2016-03-12 0.023797 2016-03-13 0.008893 2016-03-14 0.012598 2016-03-15 0.015871 2016-03-16 0.016448 2016-03-17 0.028079 2016-03-18 0.007349 2016-03-19 0.015830 2016-03-20 0.020647 2016-03-21 0.020647 2016-03-22 0.021368 2016-03-23 0.018527 2016-03-24 0.019762 2016-03-25 0.019206 2016-03-26 0.016798 2016-03-27 0.015645 2016-03-28 0.020874 2016-03-29 0.022356 2016-03-30 0.024764 2016-03-31 0.023797 2016-04-01 0.022788 2016-04-02 0.024929 2016-04-03 0.025197 2016-04-04 0.024476 2016-04-05 0.124768 2016-04-06 0.221808 2016-04-07 0.131912 Name: last_seen, dtype: float64
I observed that for the three columns above crawling is uniform and occurs on a daily basis.
print(autos["registration_year"].describe())
count 48578.000000 mean 2004.753119 std 88.632571 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
I observed that the minimum value is 1000, before cars were invented and maximum value is 9999, many years into the future.
print(autos["registration_year"].unique())
[2004 1997 2009 2007 2003 2006 1995 1998 2000 2017 2010 1999 1982 1990 2015 2014 1996 1992 2002 2012 2011 2005 2008 1985 2016 1994 1986 2001 2018 2013 1972 1993 1988 1989 1973 1967 1976 4500 1987 1991 1983 1960 1969 1950 1978 1980 1984 1963 1977 1961 1968 1934 1965 1971 1966 1979 1981 1970 1974 1910 1975 5000 4100 2019 1956 9999 6200 1964 1959 1958 1800 1948 1931 1943 1941 1962 1927 1937 1929 1000 1957 1952 1111 1955 1939 8888 1954 1938 2800 5911 1953 1951 4800 1001 9000]
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside 1900-2016 interval.
autos = autos[autos["registration_year"].between(1900,2016)]
#print(autos)
print(autos["registration_year"].value_counts(normalize=True).head(20))
print(autos["registration_year"].max())
2000 0.067592 2005 0.062880 1999 0.062066 2004 0.057890 2003 0.057804 2006 0.057205 2001 0.056476 2002 0.053243 1998 0.050608 2007 0.048766 2008 0.047439 2009 0.044676 1997 0.041784 2011 0.034760 2010 0.034032 1996 0.029405 2012 0.028056 1995 0.026300 2016 0.026129 2013 0.017198 Name: registration_year, dtype: float64 2016
It appear that most of the cars were registered in 2000
#print(autos["brand"].unique())
print(autos["brand"].value_counts(normalize=True))
volkswagen 0.211278 bmw 0.110040 opel 0.107577 mercedes_benz 0.096441 audi 0.086546 ford 0.069926 renault 0.047139 peugeot 0.029834 fiat 0.025636 seat 0.018269 skoda 0.016405 nissan 0.015270 mazda 0.015185 smart 0.014157 citroen 0.014028 toyota 0.012700 hyundai 0.010023 sonstige_autos 0.009873 volvo 0.009145 mini 0.008760 mitsubishi 0.008224 honda 0.007839 kia 0.007068 alfa_romeo 0.006639 porsche 0.006125 suzuki 0.005932 chevrolet 0.005697 chrysler 0.003512 dacia 0.002634 daihatsu 0.002506 jeep 0.002270 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001563 daewoo 0.001499 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
The brand data contains a list of car brands. From our analysis, Volkswagen has the highest count hence, more popular. I have decided to aggregate the top 5% brand.
brand_count = autos["brand"].value_counts(normalize=True)
aggregate = brand_count[brand_count > 0.05].index
print(aggregate)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
mean_brand_price = {}
for brand in aggregate:
brand_only = autos[autos["brand"] == brand]
mean_price = brand_only["price"].mean()
mean_brand_price[brand] =int(mean_price)
print(mean_brand_price)
mbp = pd.Series(mean_brand_price)
{'ford': 7456, 'bmw': 8571, 'opel': 5432, 'audi': 9336, 'volkswagen': 6729, 'mercedes_benz': 8628}
we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.
1.Audi, BMW and Mercedes Benz are more expensive 2.Ford and Opel are less expensive 3.Volkswagen is in between
brand_mean_mileage = {}
for brands in aggregate:
mean_brands = autos[autos["brand"] == brands]
mean_mileage = mean_brands["odometer_km"].mean()
brand_mean_mileage[brands]= int(mean_mileage)
print(brand_mean_mileage)
bmm = pd.Series(brand_mean_mileage)
print(bmm)
{'ford': 124243, 'bmw': 132575, 'opel': 129314, 'audi': 129157, 'volkswagen': 128713, 'mercedes_benz': 130788} audi 129157 bmw 132575 ford 124243 mercedes_benz 130788 opel 129314 volkswagen 128713 dtype: int64
To create a dataframe from the first series object using the dataframe constructor:
df2 = pd.DataFrame(bmm, columns = ["mean_mileage"])
print(df2)
df = pd.DataFrame(mbp,columns = ["mean_price"])
df
mean_mileage audi 129157 bmw 132575 ford 124243 mercedes_benz 130788 opel 129314 volkswagen 128713
mean_price | |
---|---|
audi | 9336 |
bmw | 8571 |
ford | 7456 |
mercedes_benz | 8628 |
opel | 5432 |
volkswagen | 6729 |
We will now combine the two series, to come up with a data frame
df["mean_mileage"] = bmm
df
mean_price | mean_mileage | |
---|---|---|
audi | 9336 | 129157 |
bmw | 8571 | 132575 |
ford | 7456 | 124243 |
mercedes_benz | 8628 | 130788 |
opel | 5432 | 129314 |
volkswagen | 6729 | 128713 |
From the analysis above, it is observed that the most expensive cars have the highest mileage