In this project, we'll be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.
You can find the original dataset here.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which which year the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which which year the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.brand/model
combinations.Average mileage / price ratio on major car brands.
Main brands | Mean Mileage | Mean Price |
---|---|---|
bmw | 132,572 | 8,332 |
mercedes_benz | 130,788 | 8,628 |
opel | 129,310 | 2,975 |
audi | 129,157 | 9,336 |
volkswagen | 128,707 | 5,402 |
ford | 124,266 | 3,749 |
Top 10 - Most common brand/model
combinations.
Position | Brand | Model | Frequency (%) |
---|---|---|---|
1 | volkswagen | golf | 7.94 |
2 | bmw | 3er | 5.60 |
3 | volkswagen | polo | 3.45 |
4 | opel | corsa | 3.41 |
5 | volkswagen | passat | 2.89 |
6 | opel | astra | 2.89 |
7 | audi | a4 | 2.64 |
8 | mercedes_benz | c_klasse | 2.43 |
9 | bmw | 5er | 2.42 |
10 | mercedes_benz | e_klasse | 2.05 |
Average price based on data grouped by mileage.
Class N° | Odometer Km | Frequency | AVG Mileage | AVG Price |
---|---|---|---|---|
1 | (4999.999, 13530.0] | 1,026 | 6,174 | 11,616 |
2 | (13530.0, 22060.0] | 742 | 20,000 | 18,448 |
3 | (22060.0, 30590.0] | 760 | 30,000 | 16,608 |
4 | (30590.0, 39120.0] | 0 | 0 | 0 |
5 | (39120.0, 47650.0] | 797 | 40,000 | 15,499 |
6 | (47650.0, 56180.0] | 993 | 50,000 | 13,812 |
7 | (56180.0, 64710.0] | 1,128 | 60,000 | 12,385 |
8 | (64710.0, 73240.0] | 1,187 | 70,000 | 10,927 |
9 | (73240.0, 81770.0] | 1,375 | 80,000 | 9,721 |
10 | (81770.0, 90300.0] | 1,673 | 90,000 | 8,465 |
11 | (90300.0, 98830.0] | 0 | 0 | 0 |
12 | (98830.0, 107360.0] | 2,058 | 100,000 | 8,132 |
13 | (107360.0, 115890.0] | 0 | 0 | 0 |
14 | (115890.0, 124420.0] | 0 | 0 | 0 |
15 | (124420.0, 132950.0] | 4,857 | 125,000 | 6,214 |
16 | (132950.0, 141480.0] | 0 | 0 | 0 |
17 | (141480.0, 150010.0] | 30,085 | 150,000 | 3,767 |
Average price on cars with repaired damage and their unrepaired counterparts.
Unrepaired Damage | AVG Price |
---|---|
No | 7,164 |
Yes | 2,241 |
Let's start by importing the libraries we need and reading the dataset into pandas.
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Observations:
We'll modify the column names to make working with the data easier.
We will print an array with the names of the columns in the dataset.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Modifying column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We explore the data to detect areas that need to be cleaned.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-10 15:36:24 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Observations:
seller
and offer_type
columns are in text format and almost all of their values are the same.num_photos
seems to have all its values equal to zero.price
and odometer
columns should be cleaned and converted to the int
data type.Let's explore the columns seller
, offer_type
and num_photos
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['num_photos'].value_counts()
0 50000 Name: num_photos, dtype: int64
We observe that the column num_photos
has only the value 0. We will eliminate this column, plus the other two that basically have a single value.
autos = autos.drop(['num_photos', 'seller', 'offer_type'], axis=1)
Now, we clean the price
and odometer
columns and convert them to the int data type.
autos['price'] = (autos['price']
.str.replace("$", "")
.str.replace(",", "")
.astype(int)
)
print("PRICE - FIRST FIVE VALUES")
autos['price'].head()
PRICE - FIRST FIVE VALUES
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos['odometer'] = (autos['odometer']
.str.replace("km", "")
.str.replace(",", "")
.astype(int)
)
# Renaming the column "odometer" to "odometer_km"
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
print("ODOMETER (KM) - FIRST FIVE VALUES")
autos['odometer_km'].head()
ODOMETER (KM) - FIRST FIVE VALUES
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
Before continuing, we'll implement some functions that allow us to have a better view of the values. This will make it easier for us to explore and analyze the data.
# Returns the number (as a string) in thousands separation format
def sep_format(value):
return '{:,}'.format(value)
# Returns the number (as a string) rounded to "n" decimal places and formats it as thousands
def float_format(value, n=2):
value = round(value, n)
return '{:,}'.format(value)
# This function will make it easier for us to get the details of a particular column
def get_col_details(df, col_name, round_vals=2, show_unique_num=True, show_describe=True, show_val_counts=True, sort_index=None, sort_values=None, normalize=False, ascending=False, bins=None, dropna=True, return_val_counts=False):
unique_vals = df[col_name].unique().shape[0] # Gets the number of unique values
# Removes the special characters from the column name to use it as the results title
col = col_name.replace('_', ' ').title()
# Checks if we want to show the number of unique values
if show_unique_num:
print(col, '- Unique Values: {:,}'.format(unique_vals), '\n')
# Checks if we want to show the statistics of the column
if show_describe:
col_desc = df[col_name].describe()
# Formats the results
col_desc = col_desc.apply(float_format, args=[round_vals])
print(col + ' - Statistics:', col_desc, sep='\n')
# Checks if we want to show the frequency table of the column
if show_val_counts:
col_val_counts = df[col_name].value_counts(normalize=normalize, ascending=ascending, bins=bins, dropna=dropna)
title_freq = 'Frequency'
# Checks if we want to show the frequencies as percentages
if normalize:
col_val_counts *= 100 # Multiplies each percentage value by 100.
title_freq += ' (%)'
# Sorts index in ascending or descending order if it's true or false
if sort_index != None:
col_val_counts = col_val_counts.sort_index(ascending=sort_index)
# Sorts values in ascending or descending order if it's true or false
if sort_values != None:
col_val_counts = col_val_counts.sort_values(ascending=sort_values)
# Gets the indexes on the table and converts them to a Serie
index_series = pd.Series(col_val_counts.index)
# If the indices are float data type, it rounds them and formats into thousands
if index_series.dtype == float:
index_series = index_series.apply(float_format, args=[round_vals])
# If the indices are int data type, it formats them into thousands
elif index_series.dtype == int:
index_series = index_series.apply(sep_format)
# Gets the values on the table and converts them to a Serie
freq_series = pd.Series(col_val_counts.values)
total_head, total_tail = 0, 0
n = None
# If the unique values are less than or equal to 40, sums all the values in the table
if unique_vals <= 40:
total_head = freq_series.sum()
# If the unique values are less than or equal to 80, sums the first and last 20 values in the table
elif unique_vals <= 80:
n = 20
total_head = freq_series.head(20).sum()
total_tail = freq_series.tail(20).sum()
# If the unique values are greater than 80, sums the first and last 30 values in the table
else:
n = 30
total_head = freq_series.head(30).sum()
total_tail = freq_series.tail(30).sum()
# If it returns the frequencies, just rounds the values without applying any format.
if return_val_counts:
if freq_series.dtype == float:
freq_series = freq_series.round(decimals=round_vals)
else: # Otherwise, rounds the values and format thousands (as a string)
if freq_series.dtype == float:
freq_series = freq_series.apply(float_format, args=[round_vals])
elif freq_series.dtype == int:
freq_series = freq_series.apply(sep_format)
# Converts the indices and frequencies to a DataFrame
dfreq = pd.DataFrame({col:index_series, title_freq:freq_series})
# If "n" is equal to "None", prints the entire table
if n == None:
print('', dfreq, '-' * 45, 'Total frequency: {}'.format(float_format(total_head, round_vals)), sep='\n')
else: # Otherwise, prints head(n) and tail(n)
freq_head = dfreq.head(n)
print('\nFIRST {} VALUES'.format(n), freq_head, '-' * 45, sep='\n')
print('Total frequency:', float_format(total_head, round_vals), '\n')
freq_tail = dfreq.tail(n)
print('\nLAST {} VALUES'.format(n), freq_tail, '-' * 45, sep='\n')
print('Total frequency:', float_format(total_tail, round_vals), '\n')
# Checks if shoulds send the table/s
if return_val_counts:
if n == None:
return dfreq
else:
return freq_head, freq_tail
Now, we start by exploring the values of the column odometer_km
get_col_details(autos, 'odometer_km')
Odometer Km - Unique Values: 13 Odometer Km - Statistics: count 50,000.0 mean 125,732.7 std 40,042.21 min 5,000.0 25% 125,000.0 50% 150,000.0 75% 150,000.0 max 150,000.0 Name: odometer_km, dtype: object Odometer Km Frequency 0 150,000 32,424 1 125,000 5,170 2 100,000 2,169 3 90,000 1,757 4 80,000 1,436 5 70,000 1,230 6 60,000 1,164 7 50,000 1,027 8 5,000 967 9 40,000 819 10 30,000 789 11 20,000 784 12 10,000 264 --------------------------------------------- Total frequency: 50,000
We can see that those cars that have higher mileage predominate.
Next, we'll review the values of the column price
:
get_col_details(autos, 'price', sort_index=False)
Price - Unique Values: 2,357 Price - Statistics: count 50,000.0 mean 9,840.04 std 481,104.38 min 0.0 25% 1,100.0 50% 2,950.0 75% 7,200.0 max 99,999,999.0 Name: price, dtype: object FIRST 30 VALUES Price Frequency 0 99,999,999 1 1 27,322,222 1 2 12,345,678 3 3 11,111,111 2 4 10,000,000 1 5 3,890,000 1 6 1,300,000 1 7 1,234,566 1 8 999,999 2 9 999,990 1 10 350,000 1 11 345,000 1 12 299,000 1 13 295,000 1 14 265,000 1 15 259,000 1 16 250,000 1 17 220,000 1 18 198,000 1 19 197,000 1 20 194,000 1 21 190,000 1 22 180,000 1 23 175,000 1 24 169,999 1 25 169,000 1 26 163,991 1 27 163,500 1 28 155,000 1 29 151,990 1 --------------------------------------------- Total frequency: 34 LAST 30 VALUES Price Frequency 2327 66 1 2328 65 5 2329 60 9 2330 59 1 2331 55 2 2332 50 49 2333 49 4 2334 47 1 2335 45 4 2336 40 6 2337 35 1 2338 30 7 2339 29 1 2340 25 5 2341 20 4 2342 18 1 2343 17 3 2344 15 2 2345 14 1 2346 13 2 2347 12 3 2348 11 2 2349 10 7 2350 9 1 2351 8 1 2352 5 2 2353 3 1 2354 2 3 2355 1 156 2356 0 1,421 --------------------------------------------- Total frequency: 1,706
The results show us several cars listed under \$ 50, including 1,421 to \$ 0 that represent 2.84% of the total cars listed. We'll discard these rows later.
On the price list with high values, we have 14 of them at around \$ 1 million or more. However, after these records, more uniform values are observed.
We know that eBay is an auction website, so there may be items where the starting bid is \$ 1. We'll keep the \$ 1 items, but we'll remove all items over \$ 350,000; as it seems that prices are constantly increasing up to that number and then increasing to less realistic numbers.
autos = autos[autos['price'].between(1, 351000)]
get_col_details(autos, 'price', show_val_counts=False)
Price - Unique Values: 2,346 Price - Statistics: count 48,565.0 mean 5,888.94 std 9,059.85 min 1.0 25% 1,200.0 50% 3,000.0 75% 7,490.0 max 350,000.0 Name: price, dtype: object
There are 5 columns representing date values. Some of these columns were created by the crawler, others come from the website itself.
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteThe columns date_crawled
, last_seen
and ad_created
are stored as string. While the columns registration_month
and registration_year
are stored as int data type.
We start by exploring the date_crawled
, last_seen
and ad_created
columns.
autos[['date_crawled', 'ad_created', 'last_seen']][:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We note that the first 10 characters represent the day (eg, 2016-03-12
). To understand the date range, we're going to extract only the date values, then generate a distribution and sort it by index or frequency.
We'll start with the column date_crawled
autos['only_date_crawled'] = autos['date_crawled'].str[:10]
get_col_details(autos, 'only_date_crawled', sort_index=True, show_describe=False, normalize=True, dropna=False)
Only Date Crawled - Unique Values: 34 Only Date Crawled Frequency (%) 0 2016-03-05 2.53 1 2016-03-06 1.4 2 2016-03-07 3.6 3 2016-03-08 3.33 4 2016-03-09 3.31 5 2016-03-10 3.22 6 2016-03-11 3.26 7 2016-03-12 3.69 8 2016-03-13 1.57 9 2016-03-14 3.65 10 2016-03-15 3.43 11 2016-03-16 2.96 12 2016-03-17 3.16 13 2016-03-18 1.29 14 2016-03-19 3.48 15 2016-03-20 3.79 16 2016-03-21 3.74 17 2016-03-22 3.3 18 2016-03-23 3.22 19 2016-03-24 2.93 20 2016-03-25 3.16 21 2016-03-26 3.22 22 2016-03-27 3.11 23 2016-03-28 3.49 24 2016-03-29 3.41 25 2016-03-30 3.37 26 2016-03-31 3.18 27 2016-04-01 3.37 28 2016-04-02 3.55 29 2016-04-03 3.86 30 2016-04-04 3.65 31 2016-04-05 1.31 32 2016-04-06 0.32 33 2016-04-07 0.14 --------------------------------------------- Total frequency: 100.0
The site was crawled every day from March 5 through the first week of April 2016. The distribution of listings crawled each day is roughly uniform.
Next, we'll review the column last_seen
autos['last_seen_date'] = autos['last_seen'].str[:10]
get_col_details(autos, 'last_seen_date', sort_index=True, show_describe=False, normalize=True, dropna=False)
Last Seen Date - Unique Values: 34 Last Seen Date Frequency (%) 0 2016-03-05 0.11 1 2016-03-06 0.43 2 2016-03-07 0.54 3 2016-03-08 0.74 4 2016-03-09 0.96 5 2016-03-10 1.07 6 2016-03-11 1.24 7 2016-03-12 2.38 8 2016-03-13 0.89 9 2016-03-14 1.26 10 2016-03-15 1.59 11 2016-03-16 1.65 12 2016-03-17 2.81 13 2016-03-18 0.74 14 2016-03-19 1.58 15 2016-03-20 2.07 16 2016-03-21 2.06 17 2016-03-22 2.14 18 2016-03-23 1.85 19 2016-03-24 1.98 20 2016-03-25 1.92 21 2016-03-26 1.68 22 2016-03-27 1.56 23 2016-03-28 2.09 24 2016-03-29 2.23 25 2016-03-30 2.48 26 2016-03-31 2.38 27 2016-04-01 2.28 28 2016-04-02 2.49 29 2016-04-03 2.52 30 2016-04-04 2.45 31 2016-04-05 12.48 32 2016-04-06 22.18 33 2016-04-07 13.19 --------------------------------------------- Total frequency: 100.0
We note that the last three days contain a disproportionate number of last seen values. These are approximately 8x (on average) the values of the previous days, it's very likely that these are related to the crawling period ending and don't necessarily indicate massive car sales.
Now, we'll review the column ad_created
autos['ad_created_date'] = autos['ad_created'].str[:10]
get_col_details(autos, 'ad_created_date', round_vals=4, sort_index=True, show_describe=False, normalize=True, dropna=False)
Ad Created Date - Unique Values: 76 FIRST 20 VALUES Ad Created Date Frequency (%) 0 2015-06-11 0.0021 1 2015-08-10 0.0021 2 2015-09-09 0.0021 3 2015-11-10 0.0021 4 2015-12-05 0.0021 5 2015-12-30 0.0021 6 2016-01-03 0.0021 7 2016-01-07 0.0021 8 2016-01-10 0.0041 9 2016-01-13 0.0021 10 2016-01-14 0.0021 11 2016-01-16 0.0021 12 2016-01-22 0.0021 13 2016-01-27 0.0062 14 2016-01-29 0.0021 15 2016-02-01 0.0021 16 2016-02-02 0.0041 17 2016-02-05 0.0041 18 2016-02-07 0.0021 19 2016-02-08 0.0021 --------------------------------------------- Total frequency: 0.0515 LAST 20 VALUES Ad Created Date Frequency (%) 56 2016-03-19 3.3687 57 2016-03-20 3.7949 58 2016-03-21 3.7579 59 2016-03-22 3.2801 60 2016-03-23 3.206 61 2016-03-24 2.928 62 2016-03-25 3.1751 63 2016-03-26 3.2266 64 2016-03-27 3.0989 65 2016-03-28 3.4984 66 2016-03-29 3.4037 67 2016-03-30 3.3501 68 2016-03-31 3.1875 69 2016-04-01 3.3687 70 2016-04-02 3.5149 71 2016-04-03 3.8855 72 2016-04-04 3.6858 73 2016-04-05 1.1819 74 2016-04-06 0.3253 75 2016-04-07 0.1256 --------------------------------------------- Total frequency: 59.3637
Most of ad created dates fall within 1-2 months of the listing date, but we did notice some old values, with the oldest at around 9 months.
Let's check the column registration_year
get_col_details(autos, 'registration_year', show_val_counts=False)
Registration Year - Unique Values: 95 Registration Year - Statistics: count 48,565.0 mean 2,004.76 std 88.64 min 1,000.0 25% 1,999.0 50% 2,004.0 75% 2,008.0 max 9,999.0 Name: registration_year, dtype: object
The registration year of the car tells us how old it's. However, we notice some strange values. The minimum value is 1'000, this is long before cars were invented and the maximum value is 9'999, many years into the future.
We'll consider only those cars whose registration years fall within 1900 - 2016; since, at the beginning of the 20th century, cars began to be mass-produced.
Before performing the filter, we determine what percentage of the dataset have invalid values in this column:
total_cars = autos.shape[0]
total_wrong_reg_year = total_cars - autos['registration_year'].between(1900, 2016).sum()
perc_wrong_reg_year = (total_wrong_reg_year / total_cars) * 100
print("Percentage of wrong data in 'registration_year' column: {:.2f}".format(perc_wrong_reg_year))
Percentage of wrong data in 'registration_year' column: 3.88
Since these records only represent 3.88% of our data, we'll delete these rows.
# Filters the data
autos = autos[autos['registration_year'].between(1900, 2016)]
get_col_details(autos, 'registration_year', round_vals=4, sort_index=True, show_describe=False, normalize=True)
Registration Year - Unique Values: 78 FIRST 20 VALUES Registration Year Frequency (%) 0 1,910 0.0107 1 1,927 0.0021 2 1,929 0.0021 3 1,931 0.0021 4 1,934 0.0043 5 1,937 0.0086 6 1,938 0.0021 7 1,939 0.0021 8 1,941 0.0043 9 1,943 0.0021 10 1,948 0.0021 11 1,950 0.0064 12 1,951 0.0043 13 1,952 0.0021 14 1,953 0.0021 15 1,954 0.0043 16 1,955 0.0043 17 1,956 0.0086 18 1,957 0.0043 19 1,958 0.0086 --------------------------------------------- Total frequency: 0.0878 LAST 20 VALUES Registration Year Frequency (%) 58 1,997 4.1794 59 1,998 5.062 60 1,999 6.206 61 2,000 6.7608 62 2,001 5.6468 63 2,002 5.3255 64 2,003 5.7818 65 2,004 5.7904 66 2,005 6.2895 67 2,006 5.7197 68 2,007 4.8778 69 2,008 4.745 70 2,009 4.4665 71 2,010 3.404 72 2,011 3.4768 73 2,012 2.8063 74 2,013 1.7202 75 2,014 1.4203 76 2,015 0.8397 77 2,016 2.6135 --------------------------------------------- Total frequency: 87.1318
We can see that most of the vehicles were registered for the first time in the last 20 years or so (87.13%).
brand_fq = get_col_details(autos, 'brand', round_vals=3, show_describe=False, normalize=True, return_val_counts=True)
Brand - Unique Values: 40 Brand Frequency (%) 0 volkswagen 21.126 1 bmw 11.004 2 opel 10.758 3 mercedes_benz 9.646 4 audi 8.657 5 ford 6.990 6 renault 4.715 7 peugeot 2.984 8 fiat 2.564 9 seat 1.827 10 skoda 1.641 11 nissan 1.527 12 mazda 1.519 13 smart 1.416 14 citroen 1.401 15 toyota 1.270 16 hyundai 1.003 17 sonstige_autos 0.981 18 volvo 0.915 19 mini 0.876 20 mitsubishi 0.823 21 honda 0.784 22 kia 0.707 23 alfa_romeo 0.664 24 porsche 0.613 25 suzuki 0.593 26 chevrolet 0.570 27 chrysler 0.351 28 dacia 0.263 29 daihatsu 0.251 30 jeep 0.227 31 subaru 0.214 32 land_rover 0.210 33 saab 0.165 34 jaguar 0.156 35 daewoo 0.150 36 trabant 0.139 37 rover 0.133 38 lancia 0.107 39 lada 0.058 --------------------------------------------- Total frequency: 100.0
We note that German manufacturers occupy the top five brands, with 61.19% of the overall lists. Volkswagen is the most popular brand, with approximately double the cars for sale of the next two brands combined.
We also note that many brands don't have a significant percentage of listings, so we'll limit our analysis to brands that represent more than 5% of total listings.
common_brands = brand_fq[brand_fq['Frequency (%)'] > 5]['Brand']
print('Principal Brands:\n', common_brands, sep='\n')
Principal Brands: 0 volkswagen 1 bmw 2 opel 3 mercedes_benz 4 audi 5 ford Name: Brand, dtype: object
Before continuing, we'll implement a function that will allow us to obtain the average value of a column based on a specific field.
def get_col_mean_by(df, col, mean_col, unique_values):
dict_mean_vals = {}
for value in unique_values:
only_value = df[df[col] == value]
mean = only_value[mean_col].mean()
dict_mean_vals[value] = int(mean)
# Sorts the dictionary by values
dict_mean_vals = dict(sorted(dict_mean_vals.items(), key=lambda x: x[1], reverse=True))
# Converts the dictionary to a series (applies thousands format)
serie = pd.Series(dict_mean_vals).apply(sep_format)
return serie
Now with the function already implemented, we can get the average prices for the main brands.
brand_mean_prices = get_col_mean_by(autos, 'brand', 'price', common_brands.values)
print('Top Brands - Mean Price:\n', *brand_mean_prices.items(), sep='\n')
Top Brands - Mean Price: ('audi', '9,336') ('mercedes_benz', '8,628') ('bmw', '8,332') ('volkswagen', '5,402') ('ford', '3,749') ('opel', '2,975')
Of the top 5 brands, we can classify the results based on price:
We'll start by exploring the average mileage for the major car brands.
brand_mean_mileages = get_col_mean_by(autos, 'brand', 'odometer_km', common_brands.values)
print('Top Brands - Mean Mileage:\n', *brand_mean_mileages.items(), sep='\n')
Top Brands - Mean Mileage: ('bmw', '132,572') ('mercedes_benz', '130,788') ('opel', '129,310') ('audi', '129,157') ('volkswagen', '128,707') ('ford', '124,266')
We note that German car brands have higher mileage (top five) with an average of 130,107 km; 4.49% above the ford brand.
Now, let's see if within the major brands there is any relationship between average mileage and average price.
brand_mileage_info = pd.DataFrame(brand_mean_mileages, columns=['Mean Mileage'])
brand_mileage_info['Mean Price'] = brand_mean_prices
brand_mileage_info
Mean Mileage | Mean Price | |
---|---|---|
bmw | 132,572 | 8,332 |
mercedes_benz | 130,788 | 8,628 |
opel | 129,310 | 2,975 |
audi | 129,157 | 9,336 |
volkswagen | 128,707 | 5,402 |
ford | 124,266 | 3,749 |
Within the list of the main brands there is a slight tendency for the more expensive vehicles to have a higher mileage.
We'll start by exploring the values of the column model
.
get_col_details(autos, 'model', sort_values=True, show_describe=False)
Model - Unique Values: 245 FIRST 30 VALUES Model Frequency 0 rangerover 1 1 b_max 1 2 200 1 3 i3 1 4 kappa 2 5 charade 3 6 samara 3 7 v60 3 8 materia 4 9 croma 4 10 145 4 11 9000 5 12 range_rover_evoque 5 13 delta 5 14 lodgy 5 15 crossfire 6 16 exeo 6 17 amarok 6 18 nubira 8 19 90 8 20 r19 9 21 range_rover 9 22 move 9 23 mii 10 24 300c 10 25 lybra 10 26 lanos 10 27 gl 10 28 terios 10 29 sirion 11 --------------------------------------------- Total frequency: 179 LAST 30 VALUES Model Frequency 214 megane 335 215 ka 349 216 zafira 394 217 punto 415 218 touran 433 219 clio 473 220 mondeo 479 221 3_reihe 486 222 1er 521 223 a_klasse 539 224 vectra 544 225 fortwo 550 226 2_reihe 600 227 twingo 615 228 transporter 674 229 fiesta 722 230 focus 762 231 a6 797 232 a3 825 233 e_klasse 958 234 5er 1,132 235 c_klasse 1,136 236 a4 1,231 237 astra 1,348 238 passat 1,349 239 corsa 1,592 240 polo 1,609 241 3er 2,615 242 andere 3,373 243 golf 3,707 --------------------------------------------- Total frequency: 30,563
Now, we'll create a new column, in which its values are the concatenation of the values of the brand
and model
columns. Then we'll examine their values.
autos['brand_model'] = autos['brand'] + ' / ' + autos['model']
get_col_details(autos, 'brand_model', sort_values=True, show_describe=False)
Brand Model - Unique Values: 291 FIRST 30 VALUES Brand Model Frequency 0 bmw / i3 1 1 ford / b_max 1 2 rover / rangerover 1 3 audi / 200 1 4 rover / discovery 1 5 rover / freelander 2 6 lancia / kappa 2 7 dacia / andere 2 8 daihatsu / charade 3 9 volvo / v60 3 10 lada / samara 3 11 fiat / croma 4 12 daihatsu / materia 4 13 alfa_romeo / 145 4 14 land_rover / andere 4 15 land_rover / range_rover_evoque 5 16 saab / 9000 5 17 dacia / lodgy 5 18 lancia / delta 5 19 volkswagen / amarok 6 20 seat / exeo 6 21 chrysler / crossfire 6 22 lada / andere 7 23 smart / andere 7 24 daewoo / nubira 8 25 audi / 90 8 26 trabant / andere 8 27 renault / r19 9 28 daihatsu / move 9 29 daewoo / andere 9 --------------------------------------------- Total frequency: 139 LAST 30 VALUES Brand Model Frequency 260 seat / ibiza 328 261 renault / megane 335 262 ford / ka 349 263 opel / zafira 394 264 fiat / punto 415 265 volkswagen / touran 433 266 mercedes_benz / andere 439 267 renault / clio 473 268 ford / mondeo 479 269 bmw / 1er 521 270 mercedes_benz / a_klasse 539 271 opel / vectra 544 272 smart / fortwo 550 273 peugeot / 2_reihe 600 274 renault / twingo 615 275 volkswagen / transporter 674 276 ford / fiesta 722 277 ford / focus 762 278 audi / a6 797 279 audi / a3 825 280 mercedes_benz / e_klasse 958 281 bmw / 5er 1,132 282 mercedes_benz / c_klasse 1,136 283 audi / a4 1,231 284 opel / astra 1,348 285 volkswagen / passat 1,349 286 opel / corsa 1,592 287 volkswagen / polo 1,609 288 bmw / 3er 2,615 289 volkswagen / golf 3,707 --------------------------------------------- Total frequency: 27,471
The list of the last 30 values represents 58.73% of the total cars listed. The volkswagen / golf combination is the most common, 41.76% above the second-place combination bmw / 3rd.
The list with the highest values has an average frequency of 916. For our purpose, we'll only consider those values that exceed the average frequency. Coincidentally, the last 10 values in the list are the ones that meet this criteria.
We'll avoid using our previously implemented function since we only need the first 10 highest values.
top_brand_model = autos['brand_model'].value_counts().sort_values(ascending=False).head(10)
print('Top Ten - Most Common [Brand / Model] Cars:\n')
for count, bm in enumerate(top_brand_model.index):
print(count + 1, bm)
Top Ten - Most Common [Brand / Model] Cars: 1 volkswagen / golf 2 bmw / 3er 3 volkswagen / polo 4 opel / corsa 5 volkswagen / passat 6 opel / astra 7 audi / a4 8 mercedes_benz / c_klasse 9 bmw / 5er 10 mercedes_benz / e_klasse
We'll divide the values in the odometer_km
column into groups by using Sturge's rule to calculate the number of classes that our frequency table should contain.
# Total rows number
n = autos.shape[0]
# Calculating the range of the data
min_odometer = autos['odometer_km'].min()
max_odometer = autos['odometer_km'].max()
range_ = max_odometer - min_odometer
# Calculating the approximate number of classes
k = 1 + 3.32 * np.log10(n)
k = round(k)
# Calculating the approximate size of the class interval
h = range_ / k
h = round(h, -1) # Round up
# Obtains the list with the limits of each class
i = 0
interval = min_odometer
odometer_intervals = []
while i <= k:
odometer_intervals.append(int(interval))
interval += h
i += 1
print('Odometer (km) - Limits for the bins:\n', *odometer_intervals, sep='\n')
Odometer (km) - Limits for the bins: 5000 13530 22060 30590 39120 47650 56180 64710 73240 81770 90300 98830 107360 115890 124420 132950 141480 150010
With the limits list obtained, we can now build the frequency table.
mileage_group = get_col_details(autos, 'odometer_km', sort_index=True, show_unique_num=False,
show_describe=False, bins=odometer_intervals, return_val_counts=True)
Odometer Km Frequency 0 (4999.999, 13530.0] 1026 1 (13530.0, 22060.0] 742 2 (22060.0, 30590.0] 760 3 (30590.0, 39120.0] 0 4 (39120.0, 47650.0] 797 5 (47650.0, 56180.0] 993 6 (56180.0, 64710.0] 1128 7 (64710.0, 73240.0] 1187 8 (73240.0, 81770.0] 1375 9 (81770.0, 90300.0] 1673 10 (90300.0, 98830.0] 0 11 (98830.0, 107360.0] 2058 12 (107360.0, 115890.0] 0 13 (115890.0, 124420.0] 0 14 (124420.0, 132950.0] 4857 15 (132950.0, 141480.0] 0 16 (141480.0, 150010.0] 30085 --------------------------------------------- Total frequency: 46,681
As a result, we got a DataFrame with the frequency table. Next, we'll get the lists of average mileage and average prices for each class and add them to the DataFrame.
mileages_avg_prices = []
avg_mileage = []
for index in range(len(odometer_intervals) - 1):
lower_limit = odometer_intervals[index]
upper_limit = odometer_intervals[index + 1]
only_class = autos[(autos['odometer_km'] >= lower_limit) & (autos['odometer_km'] < upper_limit)]
mean_price = only_class['price'].mean()
mean_mileage = only_class['odometer_km'].mean()
if pd.isna(mean_price):
mean_price = 0
if pd.isna(mean_mileage):
mean_mileage = 0
mileages_avg_prices.append(int(mean_price))
avg_mileage.append(int(mean_mileage))
# Converts the lists into series
mileages_prices_serie = pd.Series(mileages_avg_prices)
avg_mileages_serie = pd.Series(avg_mileage)
# Formats the series and adds them to the frequency table
mileage_group['AVG Mileage'] = avg_mileages_serie.apply(sep_format)
mileage_group['AVG Price'] = mileages_prices_serie.apply(sep_format)
# Formats the frequency values
mileage_group['Frequency'] = mileage_group['Frequency'].apply(sep_format)
print(mileage_group)
Odometer Km Frequency AVG Mileage AVG Price 0 (4999.999, 13530.0] 1,026 6,174 11,616 1 (13530.0, 22060.0] 742 20,000 18,448 2 (22060.0, 30590.0] 760 30,000 16,608 3 (30590.0, 39120.0] 0 0 0 4 (39120.0, 47650.0] 797 40,000 15,499 5 (47650.0, 56180.0] 993 50,000 13,812 6 (56180.0, 64710.0] 1,128 60,000 12,385 7 (64710.0, 73240.0] 1,187 70,000 10,927 8 (73240.0, 81770.0] 1,375 80,000 9,721 9 (81770.0, 90300.0] 1,673 90,000 8,465 10 (90300.0, 98830.0] 0 0 0 11 (98830.0, 107360.0] 2,058 100,000 8,132 12 (107360.0, 115890.0] 0 0 0 13 (115890.0, 124420.0] 0 0 0 14 (124420.0, 132950.0] 4,857 125,000 6,214 15 (132950.0, 141480.0] 0 0 0 16 (141480.0, 150010.0] 30,085 150,000 3,767
We can note, with the exception of the second class, that the average mileage is inversely related to the average price; that is, as the average mileage increases, the average price decreases.
This is normal in the sale of used cars, since if it has higher mileage, commonly the price is usually lower.
Let's start by exploring the values of the column unrepaired_damage
get_col_details(autos, 'unrepaired_damage', show_describe=False, dropna=False)
Unrepaired Damage - Unique Values: 3 Unrepaired Damage Frequency 0 nein 33,834 1 NaN 8,307 2 ja 4,540 --------------------------------------------- Total frequency: 46,681
We see that in addition to having null values, the other values are in German. We'll translate these values and replace them.
autos['unrepaired_damage'] = (autos['unrepaired_damage'].
str.replace('nein', 'No').
str.replace('ja', 'Yes'))
get_col_details(autos, 'unrepaired_damage', show_describe=False, dropna=False)
Unrepaired Damage - Unique Values: 3 Unrepaired Damage Frequency 0 No 33,834 1 NaN 8,307 2 Yes 4,540 --------------------------------------------- Total frequency: 46,681
Now, we calculate the average price for each value.
For our purpose we'll ignore the null values.
unrepaired_damage_avg_price = get_col_mean_by(autos, 'unrepaired_damage', 'price', ['No', 'Yes'])
unrep_price_info = pd.DataFrame(unrepaired_damage_avg_price, columns=['AVG Price'])
unrep_price_info.index.name = 'Unrepaired Damage'
unrep_price_info
AVG Price | |
---|---|
Unrepaired Damage | |
No | 7,164 |
Yes | 2,241 |
As we expected, those cars that show unrepaired damage tend to have the lowest prices on average.