This dataset was originally scrapped as a a compilation of data on used cars from eBay Kleinanzeigen, a classified section of the German eBay website.
With a few modifications, the dataset in use here contains 50,000 sampled data points from this originally web scrapped dataset.
The aim of this project is to clean the data and analyze the included used car listings.
Column name | Description |
---|---|
"dateCrawled" | When this ad was first crawled. All field-values are taken from this date. |
"name" | Name of the car |
"seller" | Whether the seller is private or a dealer. |
"offerType" | The type of listing |
"price" | The price on the ad to sell the car. |
"abtest" | Whether the listing is included in an A/B test. |
"vehicleType" | The vehicle Type |
"yearOfRegistration" | The year in which the car was first registered |
"gearbox" | The transmission type |
"powerPS" | The power of the car in PS |
"model" | The car model name |
"kilometer" | How many kilometers the car has driven. |
"monthOfRegistration" | The month in which the car was first registered |
"fuelType" | What type of fuel the car uses |
"brand" | The brand of the car |
"notRepairedDamage" | If the car has a damage which is not yet repaired |
"dateCreated" | The date on which the eBay listing was created |
"nrOfPictures" | The number of pictures in the ad |
"postalCode" | The postal code for the location of the vehicle |
"lastSeenOnline" | When the crawler saw this ad last online. |
import numpy as np; import pandas as pd
autos = pd.read_csv('autos.csv')
# A Brief look at the dataset as loaded into pandas
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Column name | Description |
---|---|
"dateCrawled" | Saved as an 'object' type instead of datetime object |
"name" | The names are not clear as they are mixed with website directories |
"seller" | Spelling errors |
"offerType" | NA |
"price" | Price is saved as an object instead of a numeric type |
"abtest" | NA |
"vehicleType" | It contains null-values & column should be renamed |
"yearOfRegistration" | The year is saved as an 'object' type instead of a numeric datetime object & should be renamed |
"gearbox" | Spelling errors & contains null values |
"powerPS" | NA |
"model" | Contains null values |
"kilometer" | Column name instead spelt as 'odemeter' & values are saved as string object types instead of a numeric type |
"monthOfRegistration" | Column name needs to be re-named |
"fuelType" | Contains null values |
"brand" | NA |
"notRepairedDamage" | Not in english and contains null values |
"dateCreated" | Saved as an object instead of datetime object |
"nrOfPictures" | Column name to be renamed |
"postalCode" | NA |
"lastSeenOnline" | Saved as an object instead of datetime object |
This helps create more structure and readability as we parse through the dataset
auto_column_copy = list(autos.columns)
# First, some specific renaming of particular columns
auto_column_copy[7] = 'registrationYear'
auto_column_copy[12] = 'registrationMonth'
auto_column_copy[15] = 'unrepairedDamage'
auto_column_copy[-4] = 'adCreated'
auto_column_copy
['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'registrationYear', 'gearbox', 'powerPS', 'model', 'odometer', 'registrationMonth', 'fuelType', 'brand', 'unrepairedDamage', 'adCreated', 'nrOfPictures', 'postalCode', 'lastSeen']
Convertin all column names to snake_case
# Defining a function that uses list comprehension to convert a single string entered into snake case
# and then merges this list of string characters thereafter using the ''.join() function.
# The .lstrip() function then strips away any leading underscores ('_') from the word
# in cases where the entered string may have begun with a captial letter.
def snake_case(col_name):
return ''.join(['_'+ i.lower() if i.isupper() else i for i in col_name]).lstrip('_')
new_column = [snake_case(name) for name in auto_column_copy]
# One adjustment for the column with labelled -> power_p_s
new_column[9] = 'power_ps'
new_column
['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']
# Resetting the dataset column names to the new column names and checking the head
autos.columns = new_column
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
'
Initially we look for the following:
Text columns where all or almost all values are the same. These can often be dropped as they dont have useful information for analysis
Examples of numeric data stored as text which can be cleaned and converted to its correct datatype
# Looking at initial descriptive statics for each column in the dataset
autos.describe(include ='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
# Taking a quick look at individual columns
autos['odometer'].value_counts(dropna = False)
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
autos['price'].head()
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
From this brief exploration we can see that the 'price' and 'odemeter' columns are numeric values stored as text
For each column we will:
# autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)
autos['price'].head()
<ipython-input-105-1be2fff3cb57>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int32
# autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype(int)
# autos.rename({'odometer':"odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['price'].nunique()
2357
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Here we can see that the price column appears to have a 'Positive or Right Skew' with its maximum value being of the magnitude 10^5 times greater than values within 75% of the dataset.
autos['odometer_km'].nunique()
13
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset.
# Removing outliers in the Price column and checking the new statistical distribution
autos.loc[autos['price'].between(0,30000), 'price'].describe()
count 49206.000000 mean 5025.773483 std 5679.154441 min 0.000000 25% 1100.000000 50% 2850.000000 75% 6900.000000 max 30000.000000 Name: price, dtype: float64
autos.loc[autos['price'].between(0,30000), 'price'].nunique()
1985
# Removing outliers in the odometer_km column and checking its statistical distribution
autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km'].describe()
count 49033.000000 mean 128113.719332 std 36631.381785 min 10000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km'].nunique()
12
# autos['odometer_km'] = autos[autos['odometer_km'].between(10000,150000)]['odometer_km']
# autos['price'] = autos[autos['price'].between(0,30000)]['price']
# The new Dataframe
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 49206.000000 | 50000.000000 | 50000.000000 | 49033.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 5025.773483 | 2005.073280 | 116.355920 | 128113.719332 | 5.723360 | 0.0 | 50813.627300 |
std | 5679.154441 | 105.712813 | 209.216627 | 36631.381785 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000 | 1000.000000 | 0.000000 | 10000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1100.000000 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2850.000000 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 6900.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 30000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 49206 non-null float64 5 abtest 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 49033 non-null float64 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 nr_of_pictures 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.6+ MB
With the initial read in of the data, columns - 'date_crawled', 'last_seen' and 'ad_created' are all identified as string vlaues by pandas.
Because these 3 columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.
As the other two time columns - 'registration_month' & 'registration_year' are already in a numerical format, their distribution can be understood using the Series.describe()
method
autos.loc[0:5, ['date_crawled', 'ad_created', 'last_seen']]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
# INVESTIGATING THE DATE_CRAWLED COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the
# 'normalize = True' and 'dropna = False' keyword parameters
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize= True, dropna = False)
date_crawled.sort_index()
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
# INVESTIGATING THE AD_CREATED COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the
# 'normalize = True' and 'dropna = False' keyword parameters.
ad_created = autos['ad_created'].str[:10].value_counts(normalize= True, dropna = False)
ad_created.sort_index()
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 ... 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
# INVESTIGATING THE LAST_SEEN COLUMN
# To inlcude missing values in the distribution and to use percentages instead of counts we include the
# 'normalize = True' and 'dropna = False' keyword parameters
last_seen = autos['last_seen'].str[:10].value_counts(normalize= True, dropna = False)
last_seen.sort_index()
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64
sorted(autos['registration_year'].unique())
[1000, 1001, 1111, 1500, 1800, 1910, 1927, 1929, 1931, 1934, 1937, 1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2800, 4100, 4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996, 9999]
autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
From this exploration of the registration_year
column we can see that some odd values are eminent:
Our Solution
Because a manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created
column which represents date_of_listing only has data on years up to 2016.
Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic.
# Counting the number of listings from the 'ad_created' column, that fall outside the 1900-2016 registration year interval.
autos.loc[-(autos['registration_year'].between(1900, 2016))].describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 1963.000000 | 1972.000000 | 1972.000000 | 1916.000000 | 1972.000000 | 1972.0 | 1972.000000 |
mean | 3367.427407 | 2060.353448 | 98.954361 | 133987.473904 | 4.641988 | 0.0 | 47836.475152 |
std | 3955.896136 | 528.204087 | 426.692546 | 32032.014822 | 3.912620 | 0.0 | 25302.249083 |
min | 0.000000 | 1000.000000 | 0.000000 | 10000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 900.000000 | 2017.000000 | 0.000000 | 125000.000000 | 1.000000 | 0.0 | 27577.500000 |
50% | 1900.000000 | 2017.000000 | 75.000000 | 150000.000000 | 4.000000 | 0.0 | 46006.500000 |
75% | 4390.000000 | 2018.000000 | 120.000000 | 150000.000000 | 8.000000 | 0.0 | 66127.750000 |
max | 29699.000000 | 9999.000000 | 16011.000000 | 150000.000000 | 12.000000 | 0.0 | 99974.000000 |
indices_to_remove = autos.loc[-(autos['registration_year'].between(1900, 2016))].index
indices_to_remove
Int64Index([ 10, 55, 65, 68, 84, 113, 164, 197, 253, 348, ... 49662, 49689, 49696, 49731, 49770, 49796, 49841, 49880, 49910, 49935], dtype='int64', length=1972)
Removing rows not within this highest and lowest acceptable values for the 'registration_year' column
autos_19_2016 = autos.drop(indices_to_remove)
autos_19_2016.describe(include= 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48028 | 48028 | 48028 | 48028 | 47243.000000 | 48028 | 44903 | 48028.00000 | 45604 | 48028.000000 | 45560 | 47117.000000 | 48028.000000 | 44301 | 48028 | 39040 | 48028 | 48028.0 | 48028.000000 | 48028 |
unique | 46375 | 37019 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 244 | NaN | NaN | 7 | 40 | 2 | 74 | NaN | NaN | 38102 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 76 | 48027 | 48027 | NaN | 24756 | 12859 | NaN | 35560 | NaN | 3815 | NaN | NaN | 29248 | 10188 | 34255 | 1878 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 5094.679635 | NaN | NaN | 2002.80351 | NaN | 117.070417 | NaN | 127874.864699 | 5.767760 | NaN | NaN | NaN | NaN | 0.0 | 50935.867327 | NaN |
std | NaN | NaN | NaN | NaN | 5729.225244 | NaN | NaN | 7.31085 | NaN | 195.151278 | NaN | 36786.714694 | 3.696802 | NaN | NaN | NaN | NaN | 0.0 | 25792.079828 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000 | NaN | NaN | 1910.00000 | NaN | 0.000000 | NaN | 10000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1100.000000 | NaN | NaN | 1999.00000 | NaN | 71.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30459.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 2900.000000 | NaN | NaN | 2003.00000 | NaN | 107.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49696.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 6999.000000 | NaN | NaN | 2008.00000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71665.000000 | NaN |
max | NaN | NaN | NaN | NaN | 30000.000000 | NaN | NaN | 2016.00000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos_19_2016['registration_year'].value_counts(normalize = True).sort_index()
1910 0.000187 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000042 ... 2012 0.027546 2013 0.016782 2014 0.013867 2015 0.008308 2016 0.027401 Name: registration_year, Length: 78, dtype: float64
# autos_19_2016['ad_created'].str[:10].value_counts(normalize = True).sort_index()
autos_19_2016['ad_created'].str[:10].unique()
array(['2016-03-26', '2016-04-04', '2016-03-12', '2016-04-01', '2016-03-21', '2016-03-20', '2016-03-16', '2016-03-22', '2016-03-31', '2016-03-23', '2016-03-29', '2016-03-17', '2016-03-05', '2016-03-06', '2016-03-28', '2016-03-10', '2016-04-03', '2016-03-27', '2016-03-19', '2016-04-02', '2016-03-14', '2016-03-15', '2016-04-05', '2016-03-11', '2016-03-07', '2016-03-08', '2016-03-09', '2016-03-25', '2016-03-18', '2016-03-30', '2016-03-24', '2016-03-13', '2016-04-06', '2016-03-04', '2016-04-07', '2016-02-24', '2016-03-02', '2016-02-29', '2016-01-03', '2015-11-10', '2016-03-01', '2016-02-28', '2016-02-19', '2016-03-03', '2016-02-25', '2016-02-23', '2016-02-14', '2016-02-12', '2016-01-27', '2016-02-05', '2016-02-01', '2016-02-02', '2016-02-17', '2016-02-21', '2016-02-09', '2015-08-10', '2015-06-11', '2016-02-11', '2016-01-14', '2016-01-10', '2016-02-08', '2015-12-05', '2016-02-27', '2016-02-22', '2016-01-13', '2015-09-09', '2016-02-20', '2016-01-16', '2015-12-30', '2016-02-07', '2016-01-07', '2016-02-26', '2016-02-18', '2016-02-16'], dtype=object)
autos_19_2016['last_seen'].str[:10].value_counts(normalize = True).sort_index()
2016-03-05 0.001083 2016-03-06 0.004206 2016-03-07 0.005351 2016-03-08 0.007662 2016-03-09 0.009994 2016-03-10 0.010785 2016-03-11 0.012514 2016-03-12 0.023757 2016-03-13 0.008766 2016-03-14 0.012826 2016-03-15 0.016011 2016-03-16 0.016241 2016-03-17 0.027942 2016-03-18 0.007308 2016-03-19 0.015553 2016-03-20 0.020634 2016-03-21 0.020696 2016-03-22 0.021071 2016-03-23 0.018406 2016-03-24 0.019489 2016-03-25 0.018926 2016-03-26 0.016948 2016-03-27 0.016032 2016-03-28 0.020717 2016-03-29 0.022112 2016-03-30 0.024715 2016-03-31 0.023715 2016-04-01 0.023257 2016-04-02 0.024652 2016-04-03 0.025319 2016-04-04 0.024236 2016-04-05 0.124885 2016-04-06 0.222433 2016-04-07 0.131756 Name: last_seen, dtype: float64
autos_19_2016['date_crawled'].str[:10].value_counts(normalize = True).sort_index()
2016-03-05 0.025256 2016-03-06 0.014075 2016-03-07 0.036146 2016-03-08 0.033522 2016-03-09 0.033356 2016-03-10 0.032190 2016-03-11 0.032356 2016-03-12 0.036708 2016-03-13 0.015741 2016-03-14 0.036395 2016-03-15 0.034022 2016-03-16 0.029379 2016-03-17 0.031711 2016-03-18 0.012972 2016-03-19 0.034792 2016-03-20 0.037957 2016-03-21 0.037499 2016-03-22 0.032814 2016-03-23 0.032335 2016-03-24 0.029254 2016-03-25 0.031607 2016-03-26 0.032294 2016-03-27 0.030753 2016-03-28 0.034605 2016-03-29 0.034188 2016-03-30 0.033751 2016-03-31 0.031877 2016-04-01 0.033918 2016-04-02 0.035479 2016-04-03 0.038873 2016-04-04 0.036645 2016-04-05 0.012992 2016-04-06 0.003102 2016-04-07 0.001437 Name: date_crawled, dtype: float64
autos_19_2016.loc[(autos_19_2016['price'].notnull()) & (autos_19_2016['odometer_km'].notnull())].isnull().sum()
date_crawled 0 name 0 seller 0 offer_type 0 price 0 abtest 0 vehicle_type 2840 registration_year 0 gearbox 2138 power_ps 0 model 2261 odometer_km 0 registration_month 0 fuel_type 3412 brand 0 unrepaired_damage 8506 ad_created 0 nr_of_pictures 0 postal_code 0 last_seen 0 dtype: int64
Observations after removing rows outside of registration_year = 1900-2016
With the exception of the registration column, it still appears that the years for the other 3 date columns: date_crawled
, last_seen
, ad_created
mainly comprise of car listings in 2015-2016.
RECOMMENDATION: Consider trimming the dataset to only include registration_years 2000-2016 instead
# CLEANING UP FURTHER BY MINIMISING NULL VALUES
# Removing rows with null values for the 'price' and 'odometer_km' columns
# Also removing the 'unrepaired_damage' column as there is an abundance of null values
autos_19_2016_cleaner = autos_19_2016.loc[(autos_19_2016['price'].notnull()) & (autos_19_2016['odometer_km'].notnull())]
autos_19_2016_cleaner.drop(['unrepaired_damage'], axis = 1, inplace = True)
autos_19_2016_cleaner.info()
When working on data with cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.
Reminder on the process of aggregation
Identify the unique values we want to aggregate by.
Create an empty dictionary to store our aggregate data.
Loop over the unique values and for each - Subset the dataframe by the unique values - caluclate the mean of whichever column we're intersted in-Assign the unique_val/mean to the dict as k/v.
# Step 1 - Identifying the number of unique brand values
autos_19_2016_cleaner['brand'].nunique()
40
# Step 2 - Exploring the unique brand values
autos_19_2016_cleaner['brand'].value_counts(normalize = True)
volkswagen 0.214218 opel 0.109535 bmw 0.109125 mercedes_benz 0.094528 audi 0.085040 ford 0.070615 renault 0.048105 peugeot 0.030273 fiat 0.026047 seat 0.018522 skoda 0.016279 mazda 0.015352 nissan 0.015287 smart 0.014296 citroen 0.014209 toyota 0.012722 hyundai 0.010091 volvo 0.009487 sonstige_autos 0.009013 mini 0.008711 mitsubishi 0.008194 honda 0.007935 kia 0.007310 alfa_romeo 0.006684 suzuki 0.006016 chevrolet 0.005520 chrysler 0.003666 porsche 0.003019 dacia 0.002609 daihatsu 0.002523 subaru 0.002221 jeep 0.002135 saab 0.001660 land_rover 0.001617 jaguar 0.001531 daewoo 0.001509 rover 0.001380 trabant 0.001337 lancia 0.001100 lada 0.000582 Name: brand, dtype: float64
# Step 3 - Selecting the top 20 brands
auto_brands_20 = autos_19_2016_cleaner['brand'].value_counts().head(20).index
auto_brands_20
Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart', 'citroen', 'toyota', 'hyundai', 'volvo', 'sonstige_autos', 'mini'], dtype='object')
# Step 4 - Finding the aggregate mean_selling_price for cars in each of the top 20 brands
brand_mean_prices = {}
for brand in auto_brands_20:
mean_price = round(autos_19_2016_cleaner.loc[autos_19_2016_cleaner['brand'] == brand, 'price'].mean())
# Creating a new key_value pair with this mean price
brand_mean_prices[brand] = mean_price
brand_mean_prices
{'volkswagen': 5038, 'opel': 2859, 'bmw': 7247, 'mercedes_benz': 7357, 'audi': 7773, 'ford': 3407, 'renault': 2319, 'peugeot': 3044, 'fiat': 2674, 'seat': 4265, 'skoda': 6271, 'mazda': 3815, 'nissan': 4607, 'smart': 3561, 'citroen': 3621, 'toyota': 4989, 'hyundai': 5282, 'volvo': 4781, 'sonstige_autos': 7417, 'mini': 10334}
# Step 5 - Sorting the values in the brand dictionary according to the highest mean price
brand_mean_prices = dict(sorted(brand_mean_prices.items(), key = lambda data: data[1], reverse = True ))
brand_mean_prices
{'mini': 10334, 'audi': 7773, 'sonstige_autos': 7417, 'mercedes_benz': 7357, 'bmw': 7247, 'skoda': 6271, 'hyundai': 5282, 'volkswagen': 5038, 'toyota': 4989, 'volvo': 4781, 'nissan': 4607, 'seat': 4265, 'mazda': 3815, 'citroen': 3621, 'smart': 3561, 'ford': 3407, 'peugeot': 3044, 'opel': 2859, 'fiat': 2674, 'renault': 2319}
Observation From Results
From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini
has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi
, sostige
, mercedes
and bmw
.
For the top 20 brands, we would like to use yet another aggregation process to understand the average mileage
for those cars and if there's any visible link with its mean price
.
Using these two aggregated series objects we can then combine then into a single subset dataframe (with a shared index).
# Step 1 - Calculating the mean mileage of the top 20 car brands
brand_mean_mileages = {}
for brand in auto_brands_20:
mean_mileage = round(autos_19_2016_cleaner.loc[autos_19_2016_cleaner['brand'] == brand, 'odometer_km'].mean())
# Creating a new key_value pair with this mean mileage
brand_mean_mileages[brand] = mean_mileage
# Step 2 - Sorting the values in the brand dictionary according to the highest mean mileage
brand_mean_milieages = dict(sorted(brand_mean_mileages.items(), key = lambda data: data[1], reverse = True ))
brand_mean_mileages
{'volkswagen': 131497, 'opel': 131987, 'bmw': 136564, 'mercedes_benz': 134396, 'audi': 133746, 'ford': 126649, 'renault': 130547, 'peugeot': 128355, 'fiat': 119694, 'seat': 123463, 'skoda': 113040, 'mazda': 127170, 'nissan': 121051, 'smart': 100309, 'citroen': 121002, 'toyota': 117314, 'hyundai': 107596, 'volvo': 139568, 'sonstige_autos': 103481, 'mini': 90829}
# Step 3 - Converting both aggregate dictionaries into Series objects using the pd.Series() constructor
mean_prices_top20 = pd.Series(brand_mean_prices)
mean_milieage_top20 = pd.Series(brand_mean_mileages)
# Step 4 - Creating a dataframe using the mean_prices series
top_20_brands = pd.DataFrame(mean_prices_top20, columns=['mean_price'])
top_20_brands
mean_price | |
---|---|
mini | 10334 |
audi | 7773 |
sonstige_autos | 7417 |
mercedes_benz | 7357 |
bmw | 7247 |
skoda | 6271 |
hyundai | 5282 |
volkswagen | 5038 |
toyota | 4989 |
volvo | 4781 |
nissan | 4607 |
seat | 4265 |
mazda | 3815 |
citroen | 3621 |
smart | 3561 |
ford | 3407 |
peugeot | 3044 |
opel | 2859 |
fiat | 2674 |
renault | 2319 |
# Step 5 - Appending the mean_mileage data in the other series as a column of this new dataframe
top_20_brands['mean_milieage'] = mean_milieage_top20
top_20_brands.sort_values('mean_milieage', ascending = False )
mean_price | mean_milieage | |
---|---|---|
volvo | 4781 | 139568 |
bmw | 7247 | 136564 |
mercedes_benz | 7357 | 134396 |
audi | 7773 | 133746 |
opel | 2859 | 131987 |
volkswagen | 5038 | 131497 |
renault | 2319 | 130547 |
peugeot | 3044 | 128355 |
mazda | 3815 | 127170 |
ford | 3407 | 126649 |
seat | 4265 | 123463 |
nissan | 4607 | 121051 |
citroen | 3621 | 121002 |
fiat | 2674 | 119694 |
toyota | 4989 | 117314 |
skoda | 6271 | 113040 |
hyundai | 5282 | 107596 |
sonstige_autos | 7417 | 103481 |
smart | 3561 | 100309 |
mini | 10334 | 90829 |
# Exploring the data simoltaneously to determine the relationship between 'price' and 'mileage' if any
print(top_20_brands['mean_price'].idxmax())
print(top_20_brands['mean_price'].idxmin())
mini renault
print(top_20_brands['mean_milieage'].idxmax())
print(top_20_brands['mean_milieage'].idxmin())
volvo mini
Conclusion
With the exception of the 'mini' car brand, the mean_price
and mean_milieage
appear to have a small correlation where brands like bmw,
mercedes_benz, audi, opel and volkswagen (top 5) appear to be higher in both mean price and milieage.