The aim of this project is to clean the data and analyze the included used car listings.
The dataset was originally scraped and uploaded to Kaggle. From this original scraped dataset, 50000 data points have been sampled and the sata has been 'dirtied' by Dataquest for the purposes of this project.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from
this date.
name
- Name of the car.
seller
- Whether the seller is private or a dealer.
offerType
- The type of listing
price
- The price on the ad to sell the car.
abtest
- Whether the listing is included in an A/B test.
vehicleType
- The vehicle Type.
yearOfRegistration
- The year in which the car was first registered.
gearbox
- The transmission type.
powerPS
- The power of the car in PS.
model
- The car model name.
kilometer
- How many kilometers the car has driven.
monthOfRegistration
- The month in which the car was first registered.
fuelType
- What type of fuel the car uses.
brand
- The brand of the car.
notRepairedDamage
- If the car has a damage which is not yet repaired.
dateCreated
- The date on which the eBay listing was created.
nrOfPictures
- The number of pictures in the ad.
postalCode
- The postal code for the location of the vehicle.
lastSeenOnline
- When the crawler saw this ad last online.
#import libraries
import numpy as np
import pandas as pd
#read dataset to dataframe object. Try various common encodings.
try:
autos = pd.read_csv('/content/drive/My Drive/Datasets/autos.csv')
print('Read file: \'UTF-8\'')
except UnicodeDecodeError:
try:
autos = pd.read_csv('/content/drive/My Drive/Datasets/autos.csv', encoding='Latin-1')
print('Read file: \'Latin-1\'')
except UnicodeDecodeError:
try:
autos = pd.read_csv('/content/drive/My Drive/Datasets/autos.csv', encoding='Windows-1252')
print('Read file: \'Windows-1252\'')
except UnicodeDecodeError:
print('Cannot read file: other encoding')
Read file: 'Latin-1'
#ipynb file is able to render parts of pandas objects. Note: print(pd.df) will not render this view of the object
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Initial observations:
vehicleType
, gearbox
, model
, fuelType
and notRepairedDamage
columns.object
or int64
types. Some columns can be cast from object
to a numeric or boolean dtype.Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
!pip install stringcase
import stringcase
print(autos.columns, '\n')
#create copy of df.columns attribute
new_columns = autos.columns.copy()
#use stringcase.snakecase function to change all column names to snakecase
new_columns = list(map(stringcase.snakecase, new_columns))
print(new_columns)
Requirement already satisfied: stringcase in /usr/local/lib/python3.6/dist-packages (1.2.0) Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object') ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'year_of_registration', 'gearbox', 'power_p_s', 'model', 'odometer', 'month_of_registration', 'fuel_type', 'brand', 'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code', 'last_seen']
#rename some columns
# yearOfRegistration to registration_year
# monthOfRegistration to registration_month
# notRepairedDamage to unrepaired_damage
# dateCreated to ad_created
new_columns[new_columns.index('year_of_registration')] = 'registration_year'
new_columns[new_columns.index('month_of_registration')] = 'registration_month'
new_columns[new_columns.index('not_repaired_damage')] = 'unrepaired_damage'
new_columns[new_columns.index('date_created')] = 'ad_created'
print(new_columns, '\n')
autos.columns = new_columns
['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']
Changes to column names have been made for consistency, readability, brevity of code and because snake case is a beautiful Python convention!
Lets do some further Christopher Columbus...
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-04 16:40:33 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Further observations:
seller
and offer_type
columns each contain only two unique values and in each column all but one value are the same!price
column and the odometer
column contain text values and would more usefully store numeric values.The unrepaired_damage
column contains only two non-null unique entires so entries could more usefully be converted to boolean values.
vehicle_type
, gearbox
, model
, fuel_type
and unrepaired_damage
columns each contain missing values.nr_of_pictures
column contains numeric values but all values are 0.0year_of_registration
column has a max
entry of 9999.000000 and a min
entry of 1000.000000print(autos.price.unique())
print(autos.price.unique().shape[0])
print()
print(autos.odometer.unique())
print(autos.odometer.unique().shape[0])
['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] 2357 ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km'] 13
# cast price and odometer columns to numeric dtype
autos['price'] = (autos['price'].
str.replace('$', '').
str.replace(',', '').
astype(int)
)
autos['odometer'] = (autos['odometer'].
str.replace('km', '').
str.replace(',', '').
astype(int)
)
autos.rename(columns={'price':'price_$', 'odometer':'odometer_km'}, inplace=True)
Lets examine the values in the price_$
and odometer_km
columns more closely:
# descriptive statistics for price_$ column
autos['price_$'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_$, dtype: float64
From the invocation of the df.describe()
method earlier, we know there are 2357 unique values in the column. The maximum value is $100 million and the minimum is $0. The interquartile range is between $1100 and $7200. The mean value is around $9840 but this has been increased by some very large values in the dataset.
# value counts for most common values and for highest and lowest values in price_$ column
print(autos['price_$'].value_counts().head(20))
print()
print(autos['price_$'].value_counts().sort_index(ascending=False).head(20))
print()
print(autos['price_$'].value_counts().sort_index(ascending=True).head(20))
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 Name: price_$, dtype: int64 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_$, dtype: int64 0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price_$, dtype: int64
There are 1421 ads with an asking price of $0. This is around 2.8% of the dataset, we will drop these rows as legitimate eBay listings tend to have a starting price of at least $1 in an auction. We will keep rows with a value of $1 or more as there are a number of entries of $35 or below. We will remove rows with entries over 350000 as these values seem excessively high for a car price, and it appears the values climb quite consistently to this value before jumping quickly to a small number of highly spread out values.
Many of the values appear to have been rounded - this could be because sellers have to choose from a set of pre-set options, or given that there are 2357 unique entries it might be more likely that this facet of the data is due to people's natural tendency to round prices.
autos = autos.loc[autos['price_$'].between(1, 350000), :]
# descriptive statistics for price_$ column with outliers removed
autos['price_$'].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_$, dtype: float64
From earlier we know there are only 13 unique values in the odometer_km
column.
print(autos['odometer_km'].value_counts().head(20))
print()
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 5000 836 40000 815 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
All these values appear to have been rounded so there was likely a limited selection of options for this field on the eBay site. Over 60% of listings have the max value of 150000km. There could be a significant number of entires that actually have more mileage than this but maybe this was the highest option, however this is a fairly high mileage for a car so we can assume most were close to this value. We will not remove any rows after examining this column.
# drop the seller, offer_type and nr_of_pictures columns from autos dataframe
autos.drop(columns=['seller', 'offer_type', 'nr_of_pictures'], inplace=True, errors='ignore')
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py:3997: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors,
unrepaired_damage
column to bool dtype¶print(autos.unrepaired_damage.unique())
['nein' nan 'ja']
# cast unrepaired_damage column to boolean dtype
autos['unrepaired_damage'] = (autos['unrepaired_damage'].map({'nein':0, 'ja':1}, na_action='ignore').
astype(bool)
)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
print(autos.columns)
Index(['date_crawled', 'name', 'price_$', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
There are 5 columns that should represent date values:
date_crawled
from crawlerregistration_year
registration_month
ad_created
last_seen
from crawlerautos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48565 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 48565 non-null object 1 name 48565 non-null object 2 price_$ 48565 non-null int64 3 abtest 48565 non-null object 4 vehicle_type 43979 non-null object 5 registration_year 48565 non-null int64 6 gearbox 46222 non-null object 7 power_p_s 48565 non-null int64 8 model 46107 non-null object 9 odometer_km 48565 non-null int64 10 registration_month 48565 non-null int64 11 fuel_type 44535 non-null object 12 brand 48565 non-null object 13 unrepaired_damage 48565 non-null bool 14 ad_created 48565 non-null object 15 postal_code 48565 non-null int64 16 last_seen 48565 non-null object dtypes: bool(1), int64(6), object(10) memory usage: 6.3+ MB
# explore columns with values currently stored as strings
autos.loc[:, ['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
Entries in all these columns begin with a datestamp in the format 'yyyy-mm-dd'
# distribution of date values in 'date_crawled' column
print(autos['date_crawled'].
str[:10].
value_counts(normalize=True, dropna=False).
sort_index()
)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
Dates in the date_crawled
column range between 2016-03-05
and 2016-04-07
. There is a very uniform spread of dates within this range up to 2016-04-05
.
# distribution of date values in 'ad_created' column
print(autos['ad_created'].
str[:10].
value_counts(normalize=True, dropna=False).
sort_index()
)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
# view first half in ascending order
print(autos['ad_created'].
str[:10].
value_counts(normalize=True, dropna=False).
sort_index().
head(38)
)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 2016-02-22 0.000021 2016-02-23 0.000082 2016-02-24 0.000041 2016-02-25 0.000062 2016-02-26 0.000041 2016-02-27 0.000124 2016-02-28 0.000206 2016-02-29 0.000165 Name: ad_created, dtype: float64
# view first half in descending order
print(autos['ad_created'].
str[:10].
value_counts(normalize=True, dropna=False).
sort_index(ascending=False).
head(38)
)
2016-04-07 0.001256 2016-04-06 0.003253 2016-04-05 0.011819 2016-04-04 0.036858 2016-04-03 0.038855 2016-04-02 0.035149 2016-04-01 0.033687 2016-03-31 0.031875 2016-03-30 0.033501 2016-03-29 0.034037 2016-03-28 0.034984 2016-03-27 0.030989 2016-03-26 0.032266 2016-03-25 0.031751 2016-03-24 0.029280 2016-03-23 0.032060 2016-03-22 0.032801 2016-03-21 0.037579 2016-03-20 0.037949 2016-03-19 0.033687 2016-03-18 0.013590 2016-03-17 0.031278 2016-03-16 0.030125 2016-03-15 0.034016 2016-03-14 0.035190 2016-03-13 0.017008 2016-03-12 0.036755 2016-03-11 0.032904 2016-03-10 0.031895 2016-03-09 0.033151 2016-03-08 0.033316 2016-03-07 0.034737 2016-03-06 0.015320 2016-03-05 0.022897 2016-03-04 0.001483 2016-03-03 0.000865 2016-03-02 0.000103 2016-03-01 0.000103 Name: ad_created, dtype: float64
There is a large range of ad_created
dates and there are many dates with a very low relative frequency. Most of the dates are distributed in the one month period between 2016-03-05
to 2016-04-05
. All dates outside this range have a far lower relative frequency than those inside.
# distribution of date values in 'last_seen' column
print(autos['last_seen'].
str[:10].
value_counts(normalize=True, dropna=False).
sort_index()
)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
The dates in this column allow us to infer when a car was sold as presumably this is why a listing is removed.
The latest 3 dates contain around 47% of the values in the last_seen
column and 6-10x the proportion of values from the preceeding days. This is more likely to be due to the crawler ending it's operational period than the occurance of a sudden spike in sales on these days.
# explore the 'registration_year' numeric column values
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
This column allows us to ifer the age of a car. The min value is 1000 and the max value is 9999 which do not make sense! We need to remove the 'flinstones mobiles' and the 'quantum teleportation mobiles'!
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
(~autos['registration_year'].between(1900, 2016)).sum() / (autos.shape[0])
0.038793369710697
As we lose less than 4% of the rows by removing rows with a registration_year
value outside this range - dropping them from the dataframe seems like a sensible action to take.
autos = autos.loc[autos['registration_year'].between(1900, 2016), :]
top_reg_years = autos['registration_year'].value_counts(normalize=True).head(20)
print(top_reg_years.sort_index(ascending=False))
print(top_reg_years.sort_index(ascending=False).sum())
2016 0.026135 2013 0.017202 2012 0.028063 2011 0.034768 2010 0.034040 2009 0.044665 2008 0.047450 2007 0.048778 2006 0.057197 2005 0.062895 2004 0.057904 2003 0.057818 2002 0.053255 2001 0.056468 2000 0.067608 1999 0.062060 1998 0.050620 1997 0.041794 1996 0.029412 1995 0.026285 Name: registration_year, dtype: float64 0.9044150725134423
Over 90% of the cars were registered in the last 20 years.
# value counts in 'brand' column
brand_rel_freqs = autos['brand'].value_counts(normalize=True)
print(brand_rel_freqs)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
I am going to focus on the top 10 brands in this list and calculate the mean price for cars belonging to each of these brands.
#create dict storing average price for cars of each brand
brand_average_price = {}
for b in brand_rel_freqs.index[:10]:
value = autos.loc[autos['brand'] == b, 'price_$'].mean()
brand_average_price[b] = int(value)
print(sorted(brand_average_price.items(), key=lambda x: x[1], reverse=True))
[('audi', 9336), ('mercedes_benz', 8628), ('bmw', 8332), ('volkswagen', 5402), ('seat', 4397), ('ford', 3749), ('peugeot', 3094), ('opel', 2975), ('fiat', 2813), ('renault', 2474)]
Of the ten most popular brands in the dataset the most expensive on average are: Audi, Mercedes Benz and BMW. There is a signifcant price gap between these brands and the rest, with only Volkswagen and Seat in between. Note that Volkswagen is by far the most popular brand with over 20% of the listings, almost twice the proportion of the second most popular BMW. This makes it clear, without need for further calculations, that Volkswagen has the largest 'total listing price' of all brands in the dataset.
#create dict storing average mileage for cars of each brand
brand_average_mileage = {}
for b in brand_rel_freqs.index[:10]:
value = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
brand_average_mileage[b] = int(value)
print(sorted(brand_average_mileage.items(), key=lambda x: x[1], reverse=True))
[('bmw', 132572), ('mercedes_benz', 130788), ('opel', 129310), ('audi', 129157), ('volkswagen', 128707), ('renault', 128071), ('peugeot', 127153), ('ford', 124266), ('seat', 121131), ('fiat', 117121)]
# convert dicts to series objects
average_price_series = pd.Series(brand_average_price)
average_mileage_series = pd.Series(brand_average_mileage)
# create dataframe from series objects and order rows by average mileage values
frame = {'mean_odometer_km':average_mileage_series, 'mean_price_$':average_price_series}
means_df = pd.DataFrame(frame).sort_values(by=['mean_odometer_km'], axis=0, ascending=False)
means_df
mean_odometer_km | mean_price_$ | |
---|---|---|
bmw | 132572 | 8332 |
mercedes_benz | 130788 | 8628 |
opel | 129310 | 2975 |
audi | 129157 | 9336 |
volkswagen | 128707 | 5402 |
renault | 128071 | 2474 |
peugeot | 127153 | 3094 |
ford | 124266 | 3749 |
seat | 121131 | 4397 |
fiat | 117121 | 2813 |
From the rendered dataframe we can see that average mileage does not vary as much as price by brand: all brands have an average mileage within 10% of the average mileage of the top brand. The four most expensive brands on average are in the top 5 for average mileage: this indicates perhaps some trend to more expensive vehicles having higher mileage.
autos.to_csv('/content/drive/My Drive/Datasets/autos_cleaned.csv', index=False)