It was a lazy Saturday morning and I was about to turn my Netflix on when my imaginary boss called.
Imaginary Boss: "Hello, Imaginary Subordinate? I want to buy a used car made in Europe."
Me: "...huh?"
Imaginary Boss: "Yea. Any suggestion?"
Me: "Um, mini cooper?"
Imaginary Boss: "Too cute."
Me: "Erm. Audi?"
Imaginary Boss: "Good choice. What else? I want German cars."
Me: "BMW. Mercedes-Benz. Everybody loves them."
Imaginary Boss: "Benz... hmmm.. I like that. Well, why don't you just do a quick search on eBay and analyze any cool cars you could find there. And DON'T intentionally look for a mini cooper."
Me: "Yessir."
So, first, I need to find the German cars. There's a dataset on Kaggle that lists used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. Sounds neat!
Because I want to keep things simple but life is tough, I have to deal with another version of this dataset, which is a dirtied sample of 50,000 data points from the full dataset.
Let's go!
import pandas as pd
autos = pd.read_csv('autos.csv')
UnicodeDecodeErrorTraceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 23: invalid continuation byte During handling of the above exception, another exception occurred: UnicodeDecodeErrorTraceback (most recent call last) <ipython-input-2-dcac9b32b3b2> in <module>() 1 import pandas as pd ----> 2 autos = pd.read_csv('autos.csv') /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 707 skip_blank_lines=skip_blank_lines) 708 --> 709 return _read(filepath_or_buffer, kwds) 710 711 parser_f.__name__ = name /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 453 454 try: --> 455 data = parser.read(nrows) 456 finally: 457 parser.close() /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows) 1067 raise ValueError('skipfooter not supported for iteration') 1068 -> 1069 ret = self._engine.read(nrows) 1070 1071 if self.options.get('as_recarray'): /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows) 1837 def read(self, nrows=None): 1838 try: -> 1839 data = self._reader.read(nrows) 1840 except StopIteration: 1841 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 23: invalid continuation byte
Ooops. Bad start.
Let's try another encoding.
autos = pd.read_csv('autos.csv', encoding="Latin-1")
Yup. Perfect.
Now, mini cooper Benz, where are you?
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
Ooh look, my mini cooper is waiting! Sorry, Imaginary Boss, I just can't skip irresistible cute stuff. But sure, no worries, I'm a professional, I'll find your Benz rightaway.
OK, now let's familiarize ourselves this data set a little bit. I'm not a big car fan so this is going to be a tough work.
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Oh boy. I don't even know that transmission type is called 'gearbox'.
Anyway, let's change the camelCases into snake_cases to make it more Python-y, then change the column names so my imaginary boss won't get confused if he wants to see the raw data, and gather every information that I can get from this dataset.
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 16:37:21 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
What's that 'nr_of_pictures' with NaN and 0.0s? And 'seller' and 'offer_type'? Looks suspicious.
autos["nr_of_pictures"].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
autos["seller"].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos["offer_type"].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
Well, I don't need those 3 columns since the value are 0s and mostly 1s.
autos = autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1)
Let's focus on the numerical data.
There's no numeric values, you said? All are texts because of certain symbols and characters? You're right LOL
How about I just turn them into numeric, especially on the 'price' and 'odometer' columns which consist of a dollar sign, comma, and "km" characters. I'll also change the column names to reflect the unit after I removed them.
autos["price"] = (autos["price"]
.str.replace("$","")
.str.replace(",","")
.astype(int)
)
autos.rename({"price": "price_usd"}, axis=1, inplace=True)
autos["price_usd"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price_usd, dtype: int64
autos["odometer"] = (autos["odometer"]
.str.replace("km","")
.str.replace(",","")
.astype(int)
)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
OK. So, what's interesting in 'price' and 'odometer'?
For sure, I want:
And let's see if there's any nonsensical price or mileage in our data set.
print(autos["odometer_km"].unique().shape)
print(autos["odometer_km"].describe())
autos["odometer_km"].value_counts().sort_index(ascending=False)
(13,) count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
It's interesting to see the 'value counts' decline are aligned with the 'odometer km'... except for 5000 km, in which the number somehow spiked.
print(autos["price_usd"].unique().shape)
print(autos["price_usd"].describe())
print(autos["price_usd"].value_counts().sort_index(ascending=False).head(20))
autos["price_usd"].value_counts().sort_index(ascending=False).tail(20)
(2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_usd, dtype: int64
35 1 30 7 29 1 25 5 20 4 18 1 17 3 15 2 14 1 13 2 12 3 11 2 10 7 9 1 8 1 5 2 3 1 2 3 1 156 0 1421 Name: price_usd, dtype: int64
There are 1421 cars with 0 price. I wish this number is correct, but stop daydreaming and just remove these data.
I don't shop that often at eBay, but it's an auction site so it makes sense to have $1 prices on the low end. But for the high end prices, I'll set the cutoff at $350,000 because the price seems tripled afterwards.
Let's see the data between $1 - $351,000.
autos = autos[autos["price_usd"].between(1,351000)]
autos["price_usd"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_usd, dtype: float64
Other than prices, does time play an interesting role here?
Let's explore it and see if we can find something interesting.
autos[['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
print(autos['date_crawled']
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
I don't see much interesting thing here. It's like the there was a daily crawl but that's it, the percentage are quite inconsistent.
print(autos['ad_created']
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
Ooookay, there's like a huge percentage gap there between Feburary and March 2016 in a period of just 2 weeks. Interesting!
print(autos['last_seen']
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
The percentage spikes on the last three days, but I'm not sure why.
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Oh... I can see some people registered their cars on year 1000 and even 9999. I, for sure, am going to buy cars with any time-traveling features LOL
Again, let's not getting lost in time here, and fix those nonsensical data.
Let's pick a good cutoff point of a realistic and valid year for car registration. How about... 1886 for the minimum value? It's near the birth year of Mercedes-Benz car according to Wikipedia. My imaginary boss would be happy with this. And above 2016 for the maximum value, obviously.
autos = autos[autos["registration_year"].between(1886,2016)]
autos["registration_year"].value_counts(normalize=True).head(10)
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 Name: registration_year, dtype: float64
autos["brand"]. value_counts(normalize=True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
Mini spotted. Yeah!
OK, focus. I saw 'mercedes_benz', so I'm good. Seems like German cars dominated the top brands.
Also, this is the first time I heard about these car brands: 'skoda', 'sonstige_autos', 'dacia', 'saab', 'trabant', 'lancia', and 'lada'. So I decided to google them and... I'm totally missing out. These cars are super cool! Check out this pretty lava blue Skoda Superb!!
Well, let's analyze some cars including my cute Mini, that Mercedes-Benz for my imaginary boss, and Skoda my new favorite.
brand_counts = autos["brand"].value_counts(normalize=True)
cool_cars = brand_counts[brand_counts > .0085].index
print(cool_cars)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'], dtype='object')
brand_mean_prices = {}
for brand in cool_cars:
brand_only = autos[autos["brand"] == brand]
mean_price = brand_only["price_usd"].mean()
brand_mean_prices[brand] = int(mean_price)
brand_mean_prices
{'audi': 9336, 'bmw': 8332, 'citroen': 3779, 'fiat': 2813, 'ford': 3749, 'hyundai': 5365, 'mazda': 4112, 'mercedes_benz': 8628, 'mini': 10613, 'nissan': 4743, 'opel': 2975, 'peugeot': 3094, 'renault': 2474, 'seat': 4397, 'skoda': 6368, 'smart': 3580, 'sonstige_autos': 12338, 'toyota': 5167, 'volkswagen': 5402, 'volvo': 4946}
My Imaginary Boss is definitely going to kill me if I told him my cute Mini is, like, more expensive than his target Mercedes-Benz.
Anyway, let's also find out the mean mileage for these cars.
brand_mean_mileage = {}
for brand in cool_cars:
brand_only = autos[autos["brand"] == brand]
mean_mileage = brand_only["odometer_km"].mean()
brand_mean_mileage[brand] = int(mean_mileage)
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)
cool_cars_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
cool_cars_info["mean_prices"] = mean_prices
cool_cars_info
mean_mileage | mean_prices | |
---|---|---|
volvo | 138067 | 4946 |
bmw | 132572 | 8332 |
mercedes_benz | 130788 | 8628 |
opel | 129310 | 2975 |
audi | 129157 | 9336 |
volkswagen | 128707 | 5402 |
renault | 128071 | 2474 |
peugeot | 127153 | 3094 |
mazda | 124464 | 4112 |
ford | 124266 | 3749 |
seat | 121131 | 4397 |
citroen | 119694 | 3779 |
nissan | 118330 | 4743 |
fiat | 117121 | 2813 |
toyota | 115944 | 5167 |
skoda | 110848 | 6368 |
hyundai | 106442 | 5365 |
smart | 99326 | 3580 |
sonstige_autos | 89956 | 12338 |
mini | 88105 | 10613 |
Seems like the top brand cars with higher prices have more mileages (except my Mini, Y U SO EXPENSIVE T_T).
I suppose I'll just give the average price and mileage of the Benz to my imaginary boss, but since he's monolingual who happened to be occasionally curious with raw data, let's translate the German words in this data into English, just to be safe.
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
autos["gearbox"].unique()
array(['manuell', 'automatik', nan], dtype=object)
autos["fuel_type"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
autos["unrepaired_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
autos['vehicle_type'] = autos['vehicle_type'].str.replace('kleinwagen', 'small car').str.replace('kombi', 'combi').str.replace('cabrio', 'convertible').str.replace('andere', 'other')
autos['gearbox'] = autos['gearbox'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol').str.replace('elektro', 'electric').str.replace('andere', 'other')
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja', 'yes').str.replace('nein', 'no')
autos.head()
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | small car | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | combi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
Cool. Now I know some words in German!
Just for the fun of it, let's also convert the dates to be uniform numeric data, so "2016-03-21"
becomes the integer 20160321
.
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-","").astype(int)
autos["ad_created"] = autos["ad_created"].str[:10].str.replace("-","").astype(int)
autos["last_seen"] = autos["last_seen"].str[:10].str.replace("-","").astype(int)
autos.head()
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 20160326 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 20160404 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 20160326 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | small car | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 20160312 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | combi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 20160401 | 39218 | 20160401 |
I can also see a combination of brand name and model type separated with underscores in the name
column.
Let's see what's the common combos for these cars, especially the Benz.
brand_model_combo = autos.groupby(["brand","model"]).count()
brand_model_combo
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | odometer_km | registration_month | fuel_type | unrepaired_damage | ad_created | postal_code | last_seen | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
brand | model | |||||||||||||||
alfa_romeo | 145 | 4 | 4 | 4 | 4 | 2 | 4 | 2 | 4 | 4 | 4 | 2 | 2 | 4 | 4 | 4 |
147 | 80 | 80 | 80 | 80 | 73 | 80 | 80 | 80 | 80 | 80 | 76 | 68 | 80 | 80 | 80 | |
156 | 88 | 88 | 88 | 88 | 84 | 88 | 86 | 88 | 88 | 88 | 84 | 78 | 88 | 88 | 88 | |
159 | 32 | 32 | 32 | 32 | 31 | 32 | 32 | 32 | 32 | 32 | 31 | 28 | 32 | 32 | 32 | |
andere | 60 | 60 | 60 | 60 | 56 | 60 | 57 | 60 | 60 | 60 | 54 | 52 | 60 | 60 | 60 | |
spider | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 27 | 32 | 32 | 32 | |
audi | 100 | 57 | 57 | 57 | 57 | 55 | 57 | 55 | 57 | 57 | 57 | 52 | 46 | 57 | 57 | 57 |
200 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
80 | 198 | 198 | 198 | 198 | 179 | 198 | 188 | 198 | 198 | 198 | 179 | 117 | 198 | 198 | 198 | |
90 | 8 | 8 | 8 | 8 | 7 | 8 | 8 | 8 | 8 | 8 | 7 | 5 | 8 | 8 | 8 | |
a1 | 82 | 82 | 82 | 82 | 82 | 82 | 81 | 82 | 82 | 82 | 82 | 81 | 82 | 82 | 82 | |
a2 | 42 | 42 | 42 | 42 | 40 | 42 | 40 | 42 | 42 | 42 | 41 | 36 | 42 | 42 | 42 | |
a3 | 825 | 825 | 825 | 825 | 779 | 825 | 787 | 825 | 825 | 825 | 797 | 696 | 825 | 825 | 825 | |
a4 | 1231 | 1231 | 1231 | 1231 | 1206 | 1231 | 1178 | 1231 | 1231 | 1231 | 1163 | 1023 | 1231 | 1231 | 1231 | |
a5 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 125 | 117 | 126 | 126 | 126 | |
a6 | 797 | 797 | 797 | 797 | 778 | 797 | 779 | 797 | 797 | 797 | 767 | 699 | 797 | 797 | 797 | |
a8 | 69 | 69 | 69 | 69 | 66 | 69 | 69 | 69 | 69 | 69 | 66 | 61 | 69 | 69 | 69 | |
andere | 216 | 216 | 216 | 216 | 214 | 216 | 215 | 216 | 216 | 216 | 216 | 207 | 216 | 216 | 216 | |
q3 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 27 | 28 | 28 | 28 | |
q5 | 62 | 62 | 62 | 62 | 62 | 62 | 62 | 62 | 62 | 62 | 62 | 56 | 62 | 62 | 62 | |
q7 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | |
tt | 144 | 144 | 144 | 144 | 140 | 144 | 142 | 144 | 144 | 144 | 141 | 129 | 144 | 144 | 144 | |
bmw | 1er | 521 | 521 | 521 | 521 | 502 | 521 | 509 | 521 | 521 | 521 | 510 | 481 | 521 | 521 | 521 |
3er | 2615 | 2615 | 2615 | 2615 | 2526 | 2615 | 2541 | 2615 | 2615 | 2615 | 2484 | 2207 | 2615 | 2615 | 2615 | |
5er | 1132 | 1132 | 1132 | 1132 | 1103 | 1132 | 1116 | 1132 | 1132 | 1132 | 1082 | 953 | 1132 | 1132 | 1132 | |
6er | 30 | 30 | 30 | 30 | 29 | 30 | 27 | 30 | 30 | 30 | 28 | 23 | 30 | 30 | 30 | |
7er | 126 | 126 | 126 | 126 | 123 | 126 | 125 | 126 | 126 | 126 | 121 | 104 | 126 | 126 | 126 | |
andere | 38 | 38 | 38 | 38 | 34 | 38 | 34 | 38 | 38 | 38 | 31 | 33 | 38 | 38 | 38 | |
i3 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
m_reihe | 43 | 43 | 43 | 43 | 43 | 43 | 42 | 43 | 43 | 43 | 42 | 42 | 43 | 43 | 43 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
volkswagen | andere | 96 | 96 | 96 | 96 | 90 | 96 | 88 | 96 | 96 | 96 | 86 | 79 | 96 | 96 | 96 |
beetle | 123 | 123 | 123 | 123 | 119 | 123 | 120 | 123 | 123 | 123 | 117 | 117 | 123 | 123 | 123 | |
bora | 100 | 100 | 100 | 100 | 97 | 100 | 100 | 100 | 100 | 100 | 95 | 84 | 100 | 100 | 100 | |
caddy | 204 | 204 | 204 | 204 | 184 | 204 | 197 | 204 | 204 | 204 | 195 | 173 | 204 | 204 | 204 | |
cc | 18 | 18 | 18 | 18 | 17 | 18 | 17 | 18 | 18 | 18 | 17 | 17 | 18 | 18 | 18 | |
eos | 66 | 66 | 66 | 66 | 65 | 66 | 65 | 66 | 66 | 66 | 65 | 63 | 66 | 66 | 66 | |
fox | 82 | 82 | 82 | 82 | 79 | 82 | 79 | 82 | 82 | 82 | 81 | 74 | 82 | 82 | 82 | |
golf | 3707 | 3707 | 3707 | 3707 | 3414 | 3707 | 3545 | 3707 | 3707 | 3707 | 3464 | 2983 | 3707 | 3707 | 3707 | |
jetta | 38 | 38 | 38 | 38 | 33 | 38 | 36 | 38 | 38 | 38 | 34 | 26 | 38 | 38 | 38 | |
kaefer | 57 | 57 | 57 | 57 | 49 | 57 | 51 | 57 | 57 | 57 | 52 | 36 | 57 | 57 | 57 | |
lupo | 322 | 322 | 322 | 322 | 298 | 322 | 307 | 322 | 322 | 322 | 298 | 253 | 322 | 322 | 322 | |
passat | 1349 | 1349 | 1349 | 1349 | 1306 | 1349 | 1307 | 1349 | 1349 | 1349 | 1262 | 1088 | 1349 | 1349 | 1349 | |
phaeton | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 27 | 31 | 31 | 31 | |
polo | 1609 | 1609 | 1609 | 1609 | 1484 | 1609 | 1539 | 1609 | 1609 | 1609 | 1465 | 1260 | 1609 | 1609 | 1609 | |
scirocco | 85 | 85 | 85 | 85 | 81 | 85 | 83 | 85 | 85 | 85 | 77 | 77 | 85 | 85 | 85 | |
sharan | 222 | 222 | 222 | 222 | 210 | 222 | 194 | 222 | 222 | 222 | 209 | 179 | 222 | 222 | 222 | |
tiguan | 118 | 118 | 118 | 118 | 114 | 118 | 111 | 118 | 118 | 118 | 116 | 110 | 118 | 118 | 118 | |
touareg | 94 | 94 | 94 | 94 | 92 | 94 | 90 | 94 | 94 | 94 | 90 | 86 | 94 | 94 | 94 | |
touran | 433 | 433 | 433 | 433 | 415 | 433 | 403 | 433 | 433 | 433 | 421 | 358 | 433 | 433 | 433 | |
transporter | 674 | 674 | 674 | 674 | 650 | 674 | 638 | 674 | 674 | 674 | 633 | 562 | 674 | 674 | 674 | |
up | 51 | 51 | 51 | 51 | 49 | 51 | 50 | 51 | 51 | 51 | 49 | 46 | 51 | 51 | 51 | |
volvo | 850 | 28 | 28 | 28 | 28 | 27 | 28 | 27 | 28 | 28 | 28 | 28 | 23 | 28 | 28 | 28 |
andere | 82 | 82 | 82 | 82 | 75 | 82 | 77 | 82 | 82 | 82 | 78 | 71 | 82 | 82 | 82 | |
c_reihe | 28 | 28 | 28 | 28 | 27 | 28 | 27 | 28 | 28 | 28 | 27 | 27 | 28 | 28 | 28 | |
s60 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 15 | 17 | 17 | 17 | |
v40 | 87 | 87 | 87 | 87 | 84 | 87 | 86 | 87 | 87 | 87 | 83 | 74 | 87 | 87 | 87 | |
v50 | 29 | 29 | 29 | 29 | 28 | 29 | 29 | 29 | 29 | 29 | 28 | 25 | 29 | 29 | 29 | |
v60 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | |
v70 | 91 | 91 | 91 | 91 | 87 | 91 | 87 | 91 | 91 | 91 | 84 | 76 | 91 | 91 | 91 | |
xc_reihe | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 |
290 rows × 15 columns
common_combo = brand_model_combo["name"].sort_values(ascending=False)
common_combo
brand model volkswagen golf 3707 bmw 3er 2615 volkswagen polo 1609 opel corsa 1592 volkswagen passat 1349 opel astra 1348 audi a4 1231 mercedes_benz c_klasse 1136 bmw 5er 1132 mercedes_benz e_klasse 958 audi a3 825 a6 797 ford focus 762 fiesta 722 volkswagen transporter 674 renault twingo 615 peugeot 2_reihe 600 smart fortwo 550 opel vectra 544 mercedes_benz a_klasse 539 bmw 1er 521 ford mondeo 479 renault clio 473 mercedes_benz andere 439 volkswagen touran 433 fiat punto 415 opel zafira 394 ford ka 349 renault megane 335 seat ibiza 328 ... land_rover range_rover 9 daewoo andere 9 renault r19 9 audi 90 8 daewoo nubira 8 trabant andere 8 smart andere 7 lada andere 7 chrysler crossfire 6 seat exeo 6 volkswagen amarok 6 dacia lodgy 5 saab 9000 5 lancia delta 5 land_rover range_rover_evoque 5 fiat croma 4 daihatsu materia 4 land_rover andere 4 alfa_romeo 145 4 daihatsu charade 3 lada samara 3 volvo v60 3 dacia andere 2 rover freelander 2 lancia kappa 2 bmw i3 1 rover rangerover 1 ford b_max 1 audi 200 1 rover discovery 1 Name: name, Length: 290, dtype: int64
If my imaginary boss is in the mood to be in the #TeamUnique, he better avoid the c_klass
Benz!
Even though my car knowledge is extremely lacking, my common sense told me that higher mileage means cheaper price.
Let's see whether my common sense is correct.
odometer_group = autos.groupby("odometer_km").count()
odometer_group
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
odometer_km | ||||||||||||||||
5000 | 785 | 785 | 785 | 785 | 586 | 785 | 593 | 785 | 667 | 785 | 567 | 785 | 438 | 785 | 785 | 785 |
10000 | 241 | 241 | 241 | 241 | 229 | 241 | 221 | 241 | 214 | 241 | 222 | 241 | 206 | 241 | 241 | 241 |
20000 | 742 | 742 | 742 | 742 | 714 | 742 | 712 | 742 | 696 | 742 | 704 | 742 | 657 | 742 | 742 | 742 |
30000 | 760 | 760 | 760 | 760 | 738 | 760 | 733 | 760 | 728 | 760 | 729 | 760 | 693 | 760 | 760 | 760 |
40000 | 797 | 797 | 797 | 797 | 780 | 797 | 768 | 797 | 757 | 797 | 767 | 797 | 734 | 797 | 797 | 797 |
50000 | 993 | 993 | 993 | 993 | 964 | 993 | 967 | 993 | 957 | 993 | 955 | 993 | 922 | 993 | 993 | 993 |
60000 | 1128 | 1128 | 1128 | 1128 | 1096 | 1128 | 1100 | 1128 | 1089 | 1128 | 1085 | 1128 | 1031 | 1128 | 1128 | 1128 |
70000 | 1187 | 1187 | 1187 | 1187 | 1148 | 1187 | 1157 | 1187 | 1143 | 1187 | 1144 | 1187 | 1061 | 1187 | 1187 | 1187 |
80000 | 1375 | 1375 | 1375 | 1375 | 1331 | 1375 | 1338 | 1375 | 1338 | 1375 | 1324 | 1375 | 1242 | 1375 | 1375 | 1375 |
90000 | 1673 | 1673 | 1673 | 1673 | 1608 | 1673 | 1618 | 1673 | 1621 | 1673 | 1606 | 1673 | 1482 | 1673 | 1673 | 1673 |
100000 | 2058 | 2058 | 2058 | 2058 | 1964 | 2058 | 1957 | 2058 | 1948 | 2058 | 1939 | 2058 | 1791 | 2058 | 2058 | 2058 |
125000 | 4857 | 4857 | 4857 | 4857 | 4674 | 4857 | 4695 | 4857 | 4700 | 4857 | 4619 | 4857 | 4241 | 4857 | 4857 | 4857 |
150000 | 30085 | 30085 | 30085 | 30085 | 28145 | 30085 | 28712 | 30085 | 28630 | 30085 | 27702 | 30085 | 23876 | 30085 | 30085 | 30085 |
odometer_price = autos.groupby("odometer_km")
odometer_price["price_usd"].mean().sort_values(ascending=False)
odometer_km 10000 20550.867220 20000 18448.477089 30000 16608.836842 40000 15499.568381 50000 13812.173212 60000 12385.004433 70000 10927.182814 80000 9721.947636 5000 8873.515924 90000 8465.025105 100000 8132.697279 125000 6214.022030 150000 3767.927107 Name: price_usd, dtype: float64
Alright! Seems like I'm right haha... The highest mileage got the cheapest price here.
Damaged cars are cheaper than non-damaged cars. That's the norm. But, by how much?
damage_group = autos.groupby("unrepaired_damage").count()
damage_group
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
unrepaired_damage | ||||||||||||||||
no | 33834 | 33834 | 33834 | 33834 | 33074 | 33834 | 33175 | 33834 | 32826 | 33834 | 33834 | 32593 | 33834 | 33834 | 33834 | 33834 |
yes | 4540 | 4540 | 4540 | 4540 | 4244 | 4540 | 4381 | 4540 | 4262 | 4540 | 4540 | 4131 | 4540 | 4540 | 4540 | 4540 |
damage_price = autos.groupby("unrepaired_damage")
damage_price_mean = damage_price["price_usd"].mean()
damage_price_mean
unrepaired_damage no 7164.033103 yes 2241.146035 Name: price_usd, dtype: float64
damage_difference = damage_price_mean["no"] - damage_price_mean["yes"]
print("On average, cars with damage are USD {:.2f}".format(damage_difference)
+ " cheaper than their non-damaged counterparts.")
On average, cars with damage are USD 4922.89 cheaper than their non-damaged counterparts.
Well, I'm not sure whether ~$5000 is worth it. I think it depends on the damage too.
Me: "Hello, Imaginary Boss?"
Imaginary Boss: "So, how's the Mini? Did you have fun analyzing cars? I hope you already know what a gearbox is now LOLOLOLOL"
Me: "Wha- huh?! Stop laughing! That's not funny!!"
Imaginary Boss: "I know you're going to get totally distracted with the Mini. And besides, I decided to buy a brand-new Mercedes-Benz, anyway."
Me: "Wait, what?!"
Imaginary Boss: "OK, tell me something interesting from the analysis that you just did. Like, totally interesting. Don't tell me common sense stuff like your Mini is too expensive for a car that is just too cute; BMW, Mercedes-Benz and VW are among the top European car brands, damaged cars are cheaper than non-damaged cars, and how higher mileage could make car price cheaper. No, no, no, don't tell me about all those stuff."
Me: "Erm. OK. How about this? I like Skoda now."
Imaginary Boss: "Whoa. Whoa whoa. You're into that Czech car? Now, that's interesting! I should give you more assignments on things that don't interest you LOL
Me: "Nope, thank you. You duped me. You owe me a Mini for that."
Imaginary Boss: "Okay. I think the convertible model suits you. Or maybe that fancy clubman model. What color do you want?"
Me: "Oh, stop it."
Imaginary Boss: "LOL."