We will use the fundamentals of NumPy and Pandas.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
We can see that this dataset contains 20 columns, and 15 of them are strings. Also, it can be seen that 5 columns out of 20 has Null values. One thing to mention is that column names use camelCase (like variables) instead of snake_case, so one of our job is to replace capital letters with underscores(_).
My first analysis
We can see that 5 columns out of 20 has Null values. For example, NotRepairedDamage, fuelType, vehicleType etc.
According to first five rows, we can see that all the data are in German and a name column has a long string in its input.
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We are changing column names to snake_case to make it easier to read.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-04 16:40:33 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
According to autos.describe(), we can see that seller and offer_type has only 2 options and one candidate can be dropped. Some columns show NaN values in unique, top and freq.
First, we are going to change the price and odometer columns by removing characters from values and converting it to int.
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '').astype(int)
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)
autos
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900 | control | limousine | 2011 | automatik | 239 | q5 | 100000 | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980 | control | cabrio | 1996 | manuell | 75 | astra | 150000 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200 | test | cabrio | 2014 | automatik | 69 | 500 | 5000 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900 | control | kombi | 2013 | manuell | 150 | a3 | 40000 | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | 1250 | control | limousine | 1996 | manuell | 101 | vectra | 150000 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
Let's remove outliers (unrealistically high or low values) from our dataset, starting from odometer_km column.
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['odometer_km'].sort_index()
0 150000 1 150000 2 70000 3 70000 4 150000 ... 49995 100000 49996 150000 49997 5000 49998 40000 49999 150000 Name: odometer_km, Length: 50000, dtype: int64
What we see is that odometer has 13 unique values and almost 65% of the cars went 150000 km.
autos['price'].unique().shape
(2357,)
autos['price'].describe().round(2)
count 50000.00 mean 9840.04 std 481104.38 min 0.00 25% 1100.00 50% 2950.00 75% 7200.00 max 99999999.00 Name: price, dtype: float64
autos.loc[autos['price'] > 150000]
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
514 | 2016-03-17 09:53:08 | Ford_Focus_Turnier_1.6_16V_Style | privat | Angebot | 999999 | test | kombi | 2009 | manuell | 101 | focus | 125000 | 4 | benzin | ford | nein | 2016-03-17 00:00:00 | 0 | 12205 | 2016-04-06 07:17:35 |
2897 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | privat | Angebot | 11111111 | test | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 0 | 94469 | 2016-03-12 22:45:27 |
7814 | 2016-04-04 11:53:31 | Ferrari_F40 | privat | Angebot | 1300000 | control | coupe | 1992 | NaN | 0 | NaN | 50000 | 12 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 0 | 60598 | 2016-04-05 11:34:11 |
10500 | 2016-03-17 12:56:38 | Porsche_991 | privat | Angebot | 155000 | test | coupe | 2013 | NaN | 476 | 911 | 20000 | 11 | NaN | porsche | nein | 2016-03-17 00:00:00 | 0 | 90768 | 2016-03-26 23:16:41 |
11137 | 2016-03-29 23:52:57 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | privat | Angebot | 10000000 | control | coupe | 1960 | manuell | 368 | NaN | 100000 | 1 | benzin | sonstige_autos | nein | 2016-03-29 00:00:00 | 0 | 73033 | 2016-04-06 21:18:11 |
11433 | 2016-03-30 21:44:21 | Audi_R8_V10_plus_5.2_FSI_quattro_S_tronic_Akra... | privat | Angebot | 175000 | test | coupe | 2016 | automatik | 610 | andere | 5000 | 1 | benzin | audi | nein | 2016-03-30 00:00:00 | 0 | 94244 | 2016-04-05 11:46:33 |
12682 | 2016-03-28 22:48:01 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | privat | Angebot | 265000 | control | coupe | 2016 | automatik | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-28 00:00:00 | 0 | 70193 | 2016-04-05 03:44:51 |
14715 | 2016-03-30 08:37:24 | Rolls_Royce_Phantom_Drophead_Coupe | privat | Angebot | 345000 | control | cabrio | 2012 | automatik | 460 | NaN | 20000 | 8 | benzin | sonstige_autos | nein | 2016-03-30 00:00:00 | 0 | 73525 | 2016-04-07 00:16:26 |
17140 | 2016-03-05 17:39:22 | Mercedes_Benz_SLS_AMG_Roadster_AMG_SPEEDSHIFT_DCT | privat | Angebot | 180000 | test | cabrio | 2012 | automatik | 571 | andere | 50000 | 2 | benzin | mercedes_benz | nein | 2016-03-05 00:00:00 | 0 | 14193 | 2016-03-20 19:44:36 |
18509 | 2016-03-27 18:43:02 | Porsche_991_GT3 | privat | Angebot | 169000 | control | coupe | 2013 | automatik | 476 | 911 | 20000 | 9 | benzin | porsche | nein | 2016-03-27 00:00:00 | 0 | 83629 | 2016-04-05 15:15:29 |
20351 | 2016-03-22 17:47:59 | BMW_Z8_roadster | privat | Angebot | 190000 | control | cabrio | 2000 | manuell | 400 | z_reihe | 60000 | 9 | benzin | bmw | nein | 2016-03-22 00:00:00 | 0 | 14473 | 2016-04-02 11:44:28 |
22673 | 2016-03-05 18:40:13 | Porsche_911_991_GT3_LED_Licht_Chrono_Navi_Club... | privat | Angebot | 163991 | test | coupe | 2014 | automatik | 476 | 911 | 10000 | 6 | benzin | porsche | nein | 2016-03-05 00:00:00 | 0 | 51702 | 2016-03-09 16:45:30 |
22947 | 2016-03-22 12:54:19 | Bmw_530d_zum_ausschlachten | privat | Angebot | 1234566 | control | kombi | 1999 | automatik | 190 | NaN | 150000 | 2 | diesel | bmw | NaN | 2016-03-22 00:00:00 | 0 | 17454 | 2016-04-02 03:17:32 |
24384 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | privat | Angebot | 11111111 | test | NaN | 1995 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 0 | 18519 | 2016-03-21 14:40:18 |
27371 | 2016-03-09 15:45:47 | Fiat_Punto | privat | Angebot | 12345678 | control | NaN | 2017 | NaN | 95 | punto | 150000 | 0 | NaN | fiat | NaN | 2016-03-09 00:00:00 | 0 | 96110 | 2016-03-09 15:45:47 |
28090 | 2016-03-08 19:50:06 | Tesla_Model_X_P90D_Signature_Sondermodel__Neuw... | privat | Angebot | 194000 | test | suv | 2016 | automatik | 0 | NaN | 5000 | 3 | elektro | sonstige_autos | nein | 2016-03-08 00:00:00 | 0 | 97990 | 2016-03-16 04:15:31 |
32840 | 2016-03-22 08:55:27 | Mercedes_Benz_G_500_7G_TRONIC_Cabrio_Desingno_... | privat | Angebot | 169999 | control | cabrio | 2012 | automatik | 387 | g_klasse | 40000 | 3 | benzin | mercedes_benz | nein | 2016-03-22 00:00:00 | 0 | 42655 | 2016-04-05 20:47:17 |
33638 | 2016-03-24 12:54:19 | Porsche_911_GT3_RS | privat | Angebot | 151990 | test | coupe | 2007 | manuell | 415 | 911 | 20000 | 10 | benzin | porsche | nein | 2016-03-24 00:00:00 | 0 | 40479 | 2016-04-07 03:46:46 |
34723 | 2016-03-23 16:37:29 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | privat | Angebot | 299000 | test | coupe | 1977 | manuell | 260 | 911 | 100000 | 7 | benzin | porsche | nein | 2016-03-23 00:00:00 | 0 | 61462 | 2016-04-06 16:44:50 |
35923 | 2016-04-03 07:56:23 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | privat | Angebot | 295000 | test | cabrio | 2015 | automatik | 400 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-04-03 00:00:00 | 0 | 74078 | 2016-04-03 08:56:20 |
36818 | 2016-03-27 18:37:37 | Porsche_991 | privat | Angebot | 350000 | control | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 0 | 70499 | 2016-03-27 18:37:37 |
37585 | 2016-03-29 11:38:54 | Volkswagen_Jetta_GT | privat | Angebot | 999990 | test | limousine | 1985 | manuell | 111 | jetta | 150000 | 12 | benzin | volkswagen | ja | 2016-03-29 00:00:00 | 0 | 50997 | 2016-03-29 11:38:54 |
37840 | 2016-03-21 10:50:12 | Porsche_997 | privat | Angebot | 220000 | test | coupe | 2008 | manuell | 415 | 911 | 30000 | 7 | benzin | porsche | nein | 2016-03-21 00:00:00 | 0 | 69198 | 2016-04-06 04:46:14 |
38299 | 2016-03-28 22:25:25 | Glas_BMW_mit_Wasser | privat | Angebot | 250000 | test | NaN | 2015 | NaN | 0 | x_reihe | 5000 | 0 | NaN | bmw | NaN | 2016-03-28 00:00:00 | 0 | 60489 | 2016-03-28 22:25:25 |
39377 | 2016-03-08 23:53:51 | Tausche_volvo_v40_gegen_van | privat | Angebot | 12345678 | control | NaN | 2018 | manuell | 95 | v40 | 150000 | 6 | NaN | volvo | nein | 2016-03-08 00:00:00 | 0 | 14542 | 2016-04-06 23:17:31 |
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | privat | Angebot | 99999999 | control | limousine | 1999 | automatik | 224 | s_klasse | 150000 | 9 | benzin | mercedes_benz | NaN | 2016-03-22 00:00:00 | 0 | 73525 | 2016-04-06 05:15:30 |
40918 | 2016-03-20 18:40:05 | Porsche_911_991_GT3_RS | privat | Angebot | 198000 | test | coupe | 2015 | automatik | 500 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-03-20 00:00:00 | 0 | 51491 | 2016-03-21 21:46:36 |
42221 | 2016-03-08 20:39:05 | Leasinguebernahme | privat | Angebot | 27322222 | control | limousine | 2014 | manuell | 163 | c4 | 40000 | 2 | diesel | citroen | NaN | 2016-03-08 00:00:00 | 0 | 76532 | 2016-03-08 20:39:05 |
43049 | 2016-03-21 19:53:52 | 2_VW_Busse_T3 | privat | Angebot | 999999 | test | bus | 1981 | manuell | 70 | transporter | 150000 | 1 | benzin | volkswagen | NaN | 2016-03-21 00:00:00 | 0 | 99880 | 2016-03-28 17:18:28 |
43668 | 2016-03-16 18:47:26 | Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts... | privat | Angebot | 197000 | control | coupe | 1998 | manuell | 450 | 911 | 150000 | 3 | NaN | porsche | nein | 2016-03-16 00:00:00 | 0 | 46147 | 2016-04-07 02:44:47 |
45387 | 2016-03-18 16:41:22 | Porsche_911_GT3__Clubsp._Lift_nur_bis_25.03._v... | privat | Angebot | 163500 | test | coupe | 2015 | automatik | 476 | 911 | 10000 | 9 | benzin | porsche | nein | 2016-03-18 00:00:00 | 0 | 74078 | 2016-03-18 16:41:22 |
47337 | 2016-04-05 10:25:38 | BMW_Z8_roadster | privat | Angebot | 259000 | test | cabrio | 2001 | manuell | 400 | z_reihe | 20000 | 6 | benzin | bmw | nein | 2016-04-05 00:00:00 | 0 | 61462 | 2016-04-05 12:07:32 |
47598 | 2016-03-31 18:56:54 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | privat | Angebot | 12345678 | control | limousine | 2001 | manuell | 101 | vectra | 150000 | 3 | benzin | opel | nein | 2016-03-31 00:00:00 | 0 | 4356 | 2016-03-31 18:56:54 |
47634 | 2016-04-04 21:25:21 | Ferrari_FXX | privat | Angebot | 3890000 | test | coupe | 2006 | NaN | 799 | NaN | 5000 | 7 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 0 | 60313 | 2016-04-05 12:07:37 |
Right after we took the price values above 150000, it can be seen that most of the rows has bizarre values. For example, 39705th row has 99999999 as a price. According to Simple Germany, used cars in average cost 18750 euros. Let's get rid of those rows. Who sells his/her own car free? So, let's see the values between 100 and 150000 dollars.
autos = autos[autos['price'].between(100, 150000)]
Something strange happens in registration_year, it's impossible that car can be registrated in 1000 or in 9999. Let's have the years' range of 1900-2016.
autos['registration_year'].describe()
count 48204.000000 mean 2004.728964 std 87.915364 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos = autos[autos['registration_year'].between(1900, 2016)]
autos['registration_year'].describe()
count 46332.000000 mean 2002.936933 std 7.124766 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos['registration_month'].describe()
count 46332.000000 mean 5.846996 std 3.661500 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
autos[autos['registration_month']==0].shape
(3865, 20)
As you can see, we have the 4296 cars with the month value of 0. Let's erase these values.
autos = autos[~(autos['registration_month'] == 0)]
autos['registration_month'].describe()
count 42467.000000 mean 6.379141 std 3.351426 min 1.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
Now, we are going to convert string values in date_crawled, last_seen, ad_created to numerical data. 25 July, don't now what to do with these columns
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].value_counts(normalize=True, dropna=False).unique()
array([7.06430876e-05, 4.70953917e-05, 2.35476959e-05])
autos['date_crawled'].sort_values().unique()
array(['2016-03-05 14:06:30', '2016-03-05 14:06:40', '2016-03-05 14:07:21', ..., '2016-04-07 14:36:44', '2016-04-07 14:36:55', '2016-04-07 14:36:56'], dtype=object)