Exploring eBay Used Car Sales Data¶

Introduction¶

In this project, we'll be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.

You can find the original dataset here.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which which year the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which which year the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Goals¶

Determine which are the main brands. In addition, check in the set if there is a relationship between the average mileage and the average price.
Find the most common brand/model combinations.
Determine if average prices follow any pattern based on mileage.
How much cheaper are cars with damage than their non-damaged counterparts?

Summary of Results¶

Average mileage / price ratio on major car brands.

Main brands	Mean Mileage	Mean Price
bmw	132,572	8,332
mercedes_benz	130,788	8,628
opel	129,310	2,975
audi	129,157	9,336
volkswagen	128,707	5,402
ford	124,266	3,749

Top 10 - Most common brand/model combinations.

Position	Brand	Model	Frequency (%)
1	volkswagen	golf	7.94
2	bmw	3er	5.60
3	volkswagen	polo	3.45
4	opel	corsa	3.41
5	volkswagen	passat	2.89
6	opel	astra	2.89
7	audi	a4	2.64
8	mercedes_benz	c_klasse	2.43
9	bmw	5er	2.42
10	mercedes_benz	e_klasse	2.05

Average price based on data grouped by mileage.

Class N°	Odometer Km	Frequency	AVG Mileage	AVG Price
1	(4999.999, 13530.0]	1,026	6,174	11,616
2	(13530.0, 22060.0]	742	20,000	18,448
3	(22060.0, 30590.0]	760	30,000	16,608
4	(30590.0, 39120.0]	0	0	0
5	(39120.0, 47650.0]	797	40,000	15,499
6	(47650.0, 56180.0]	993	50,000	13,812
7	(56180.0, 64710.0]	1,128	60,000	12,385
8	(64710.0, 73240.0]	1,187	70,000	10,927
9	(73240.0, 81770.0]	1,375	80,000	9,721
10	(81770.0, 90300.0]	1,673	90,000	8,465
11	(90300.0, 98830.0]	0	0	0
12	(98830.0, 107360.0]	2,058	100,000	8,132
13	(107360.0, 115890.0]	0	0	0
14	(115890.0, 124420.0]	0	0	0
15	(124420.0, 132950.0]	4,857	125,000	6,214
16	(132950.0, 141480.0]	0	0	0
17	(141480.0, 150010.0]	30,085	150,000	3,767

Average price on cars with repaired damage and their unrepaired counterparts.

Unrepaired Damage	AVG Price
No	7,164
Yes	2,241

Initial exploration of the data set¶

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:

import numpy as np
import pandas as pd

In [2]:

autos = pd.read_csv('autos.csv', encoding='Latin-1')

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [3]:

autos.head()

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Observations:

The dataset consists of 50,000 rows and 20 columns.
We see that some columns contain null values but none have more than 19.7% null values.
Some columns contain dates stored as strings.

We'll modify the column names to make working with the data easier.

Cleaning Column Names¶

We will print an array with the names of the columns in the dataset.

In [4]:

print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We'll change the style of the column names from camelcase to snakecase and also modify some words to more accurately describe the columns.

In [5]:

# Modifying column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
                'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
                'odometer', 'registration_month', 'fuel_type', 'brand',
                'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
                'last_seen']

autos.head()

Out[5]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Dropping Unnecessary Columns and Cleaning Values¶

We explore the data to detect areas that need to be cleaned.

In [6]:

autos.describe(include='all')

Out[6]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_photos	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-10 15:36:24	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Observations:

Both the seller and offer_type columns are in text format and almost all of their values are the same.
The column num_photos seems to have all its values equal to zero.
The price and odometer columns should be cleaned and converted to the int data type.

Let's explore the columns seller, offer_type and num_photos

In [7]:

autos['seller'].value_counts()

Out[7]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [8]:

autos['offer_type'].value_counts()

Out[8]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [9]:

autos['num_photos'].value_counts()

Out[9]:

0    50000
Name: num_photos, dtype: int64

We observe that the column num_photos has only the value 0. We will eliminate this column, plus the other two that basically have a single value.

In [10]:

autos = autos.drop(['num_photos', 'seller', 'offer_type'], axis=1)

Now, we clean the price and odometer columns and convert them to the int data type.

In [11]:

autos['price'] = (autos['price']
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int)
                 )

print("PRICE - FIRST FIVE VALUES")
autos['price'].head()

PRICE - FIRST FIVE VALUES

Out[11]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [12]:

autos['odometer'] = (autos['odometer']
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                    )

# Renaming the column "odometer" to "odometer_km"
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

print("ODOMETER (KM) - FIRST FIVE VALUES")
autos['odometer_km'].head()

ODOMETER (KM) - FIRST FIVE VALUES

Out[12]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

Exploring the Odometer and Price Columns¶

Before continuing, we'll implement some functions that allow us to have a better view of the values. This will make it easier for us to explore and analyze the data.

In [13]:

# Returns the number (as a string) in thousands separation format
def sep_format(value):
    return '{:,}'.format(value)

# Returns the number (as a string) rounded to "n" decimal places and formats it as thousands
def float_format(value, n=2):
    value = round(value, n)
    return '{:,}'.format(value)

# This function will make it easier for us to get the details of a particular column
def get_col_details(df, col_name, round_vals=2, show_unique_num=True, show_describe=True, show_val_counts=True, sort_index=None, sort_values=None, normalize=False, ascending=False, bins=None, dropna=True, return_val_counts=False):
    unique_vals = df[col_name].unique().shape[0] # Gets the number of unique values
    
    # Removes the special characters from the column name to use it as the results title
    col = col_name.replace('_', ' ').title()
    
    # Checks if we want to show the number of unique values
    if show_unique_num:
        print(col, '- Unique Values: {:,}'.format(unique_vals), '\n')
    
    # Checks if we want to show the statistics of the column
    if show_describe:
        col_desc = df[col_name].describe()
        # Formats the results
        col_desc = col_desc.apply(float_format, args=[round_vals])
        print(col + ' - Statistics:', col_desc, sep='\n')
    
    # Checks if we want to show the frequency table of the column
    if show_val_counts:
        col_val_counts = df[col_name].value_counts(normalize=normalize, ascending=ascending, bins=bins, dropna=dropna)
        title_freq = 'Frequency'
        # Checks if we want to show the frequencies as percentages
        if normalize:
            col_val_counts *= 100 # Multiplies each percentage value by 100.
            title_freq += ' (%)'
        # Sorts index in ascending or descending order if it's true or false
        if sort_index != None:
            col_val_counts = col_val_counts.sort_index(ascending=sort_index)
        # Sorts values in ascending or descending order if it's true or false
        if sort_values != None:
            col_val_counts = col_val_counts.sort_values(ascending=sort_values)
        # Gets the indexes on the table and converts them to a Serie
        index_series = pd.Series(col_val_counts.index)
        # If the indices are float data type, it rounds them and formats into thousands
        if index_series.dtype == float:
            index_series = index_series.apply(float_format, args=[round_vals])
        # If the indices are int data type, it formats them into thousands
        elif index_series.dtype == int:
            index_series = index_series.apply(sep_format)
        # Gets the values on the table and converts them to a Serie
        freq_series = pd.Series(col_val_counts.values)
        
        total_head, total_tail = 0, 0
        n = None
        
        # If the unique values are less than or equal to 40, sums all the values in the table
        if unique_vals <= 40:
            total_head = freq_series.sum()
        # If the unique values are less than or equal to 80, sums the first and last 20 values in the table
        elif unique_vals <= 80:
            n = 20
            total_head = freq_series.head(20).sum()
            total_tail = freq_series.tail(20).sum()
        # If the unique values are greater than 80, sums the first and last 30 values in the table
        else:
            n = 30
            total_head = freq_series.head(30).sum()
            total_tail = freq_series.tail(30).sum()
        # If it returns the frequencies, just rounds the values without applying any format.
        if return_val_counts:
            if freq_series.dtype == float:
                freq_series = freq_series.round(decimals=round_vals)
        else: # Otherwise, rounds the values and format thousands (as a string)
            if freq_series.dtype == float:
                freq_series = freq_series.apply(float_format, args=[round_vals])
            elif freq_series.dtype == int:
                freq_series = freq_series.apply(sep_format)
        # Converts the indices and frequencies to a DataFrame
        dfreq = pd.DataFrame({col:index_series, title_freq:freq_series})
        # If "n" is equal to "None", prints the entire table
        if n == None:
            print('', dfreq, '-' * 45, 'Total frequency: {}'.format(float_format(total_head, round_vals)), sep='\n')
        else: # Otherwise, prints head(n) and tail(n)
            freq_head = dfreq.head(n)
            print('\nFIRST {} VALUES'.format(n), freq_head, '-' * 45, sep='\n')
            print('Total frequency:', float_format(total_head, round_vals), '\n')
            freq_tail = dfreq.tail(n)
            print('\nLAST {} VALUES'.format(n), freq_tail, '-' * 45, sep='\n')
            print('Total frequency:', float_format(total_tail, round_vals), '\n')
        # Checks if shoulds send the table/s
        if return_val_counts:
            if n == None:
                return dfreq
            else:
                return freq_head, freq_tail

Now, we start by exploring the values of the column odometer_km

In [14]:

get_col_details(autos, 'odometer_km')

Odometer Km - Unique Values: 13 

Odometer Km - Statistics:
count     50,000.0
mean     125,732.7
std      40,042.21
min        5,000.0
25%      125,000.0
50%      150,000.0
75%      150,000.0
max      150,000.0
Name: odometer_km, dtype: object

   Odometer Km Frequency
0      150,000    32,424
1      125,000     5,170
2      100,000     2,169
3       90,000     1,757
4       80,000     1,436
5       70,000     1,230
6       60,000     1,164
7       50,000     1,027
8        5,000       967
9       40,000       819
10      30,000       789
11      20,000       784
12      10,000       264
---------------------------------------------
Total frequency: 50,000

We can see that those cars that have higher mileage predominate.

Next, we'll review the values of the column price:

In [15]:

get_col_details(autos, 'price', sort_index=False)

Price - Unique Values: 2,357 

Price - Statistics:
count        50,000.0
mean         9,840.04
std        481,104.38
min               0.0
25%           1,100.0
50%           2,950.0
75%           7,200.0
max      99,999,999.0
Name: price, dtype: object

FIRST 30 VALUES
         Price Frequency
0   99,999,999         1
1   27,322,222         1
2   12,345,678         3
3   11,111,111         2
4   10,000,000         1
5    3,890,000         1
6    1,300,000         1
7    1,234,566         1
8      999,999         2
9      999,990         1
10     350,000         1
11     345,000         1
12     299,000         1
13     295,000         1
14     265,000         1
15     259,000         1
16     250,000         1
17     220,000         1
18     198,000         1
19     197,000         1
20     194,000         1
21     190,000         1
22     180,000         1
23     175,000         1
24     169,999         1
25     169,000         1
26     163,991         1
27     163,500         1
28     155,000         1
29     151,990         1
---------------------------------------------
Total frequency: 34 


LAST 30 VALUES
     Price Frequency
2327    66         1
2328    65         5
2329    60         9
2330    59         1
2331    55         2
2332    50        49
2333    49         4
2334    47         1
2335    45         4
2336    40         6
2337    35         1
2338    30         7
2339    29         1
2340    25         5
2341    20         4
2342    18         1
2343    17         3
2344    15         2
2345    14         1
2346    13         2
2347    12         3
2348    11         2
2349    10         7
2350     9         1
2351     8         1
2352     5         2
2353     3         1
2354     2         3
2355     1       156
2356     0     1,421
---------------------------------------------
Total frequency: 1,706

The results show us several cars listed under \$ 50, including 1,421 to \$ 0 that represent 2.84% of the total cars listed. We'll discard these rows later.

On the price list with high values, we have 14 of them at around \$ 1 million or more. However, after these records, more uniform values are observed.

We know that eBay is an auction website, so there may be items where the starting bid is \$ 1. We'll keep the \$ 1 items, but we'll remove all items over \$ 350,000; as it seems that prices are constantly increasing up to that number and then increasing to less realistic numbers.

In [16]:

autos = autos[autos['price'].between(1, 351000)]

get_col_details(autos, 'price', show_val_counts=False)

Price - Unique Values: 2,346 

Price - Statistics:
count     48,565.0
mean      5,888.94
std       9,059.85
min            1.0
25%        1,200.0
50%        3,000.0
75%        7,490.0
max      350,000.0
Name: price, dtype: object

Exploring the date columns¶

There are 5 columns representing date values. Some of these columns were created by the crawler, others come from the website itself.

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

The columns date_crawled, last_seen and ad_created are stored as string. While the columns registration_month and registration_year are stored as int data type.

We start by exploring the date_crawled, last_seen and ad_created columns.

In [17]:

autos[['date_crawled', 'ad_created', 'last_seen']][:5]

Out[17]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

We note that the first 10 characters represent the day (eg, 2016-03-12). To understand the date range, we're going to extract only the date values, then generate a distribution and sort it by index or frequency.

We'll start with the column date_crawled

In [18]:

autos['only_date_crawled'] = autos['date_crawled'].str[:10]

get_col_details(autos, 'only_date_crawled', sort_index=True, show_describe=False, normalize=True, dropna=False)

Only Date Crawled - Unique Values: 34 


   Only Date Crawled Frequency (%)
0         2016-03-05          2.53
1         2016-03-06           1.4
2         2016-03-07           3.6
3         2016-03-08          3.33
4         2016-03-09          3.31
5         2016-03-10          3.22
6         2016-03-11          3.26
7         2016-03-12          3.69
8         2016-03-13          1.57
9         2016-03-14          3.65
10        2016-03-15          3.43
11        2016-03-16          2.96
12        2016-03-17          3.16
13        2016-03-18          1.29
14        2016-03-19          3.48
15        2016-03-20          3.79
16        2016-03-21          3.74
17        2016-03-22           3.3
18        2016-03-23          3.22
19        2016-03-24          2.93
20        2016-03-25          3.16
21        2016-03-26          3.22
22        2016-03-27          3.11
23        2016-03-28          3.49
24        2016-03-29          3.41
25        2016-03-30          3.37
26        2016-03-31          3.18
27        2016-04-01          3.37
28        2016-04-02          3.55
29        2016-04-03          3.86
30        2016-04-04          3.65
31        2016-04-05          1.31
32        2016-04-06          0.32
33        2016-04-07          0.14
---------------------------------------------
Total frequency: 100.0

The site was crawled every day from March 5 through the first week of April 2016. The distribution of listings crawled each day is roughly uniform.

Next, we'll review the column last_seen

In [19]:

autos['last_seen_date'] = autos['last_seen'].str[:10]

get_col_details(autos, 'last_seen_date', sort_index=True, show_describe=False, normalize=True, dropna=False)

Last Seen Date - Unique Values: 34 


   Last Seen Date Frequency (%)
0      2016-03-05          0.11
1      2016-03-06          0.43
2      2016-03-07          0.54
3      2016-03-08          0.74
4      2016-03-09          0.96
5      2016-03-10          1.07
6      2016-03-11          1.24
7      2016-03-12          2.38
8      2016-03-13          0.89
9      2016-03-14          1.26
10     2016-03-15          1.59
11     2016-03-16          1.65
12     2016-03-17          2.81
13     2016-03-18          0.74
14     2016-03-19          1.58
15     2016-03-20          2.07
16     2016-03-21          2.06
17     2016-03-22          2.14
18     2016-03-23          1.85
19     2016-03-24          1.98
20     2016-03-25          1.92
21     2016-03-26          1.68
22     2016-03-27          1.56
23     2016-03-28          2.09
24     2016-03-29          2.23
25     2016-03-30          2.48
26     2016-03-31          2.38
27     2016-04-01          2.28
28     2016-04-02          2.49
29     2016-04-03          2.52
30     2016-04-04          2.45
31     2016-04-05         12.48
32     2016-04-06         22.18
33     2016-04-07         13.19
---------------------------------------------
Total frequency: 100.0

We note that the last three days contain a disproportionate number of last seen values. These are approximately 8x (on average) the values of the previous days, it's very likely that these are related to the crawling period ending and don't necessarily indicate massive car sales.

Now, we'll review the column ad_created

In [20]:

autos['ad_created_date'] = autos['ad_created'].str[:10]

get_col_details(autos, 'ad_created_date', round_vals=4, sort_index=True, show_describe=False, normalize=True, dropna=False)

Ad Created Date - Unique Values: 76 


FIRST 20 VALUES
   Ad Created Date Frequency (%)
0       2015-06-11        0.0021
1       2015-08-10        0.0021
2       2015-09-09        0.0021
3       2015-11-10        0.0021
4       2015-12-05        0.0021
5       2015-12-30        0.0021
6       2016-01-03        0.0021
7       2016-01-07        0.0021
8       2016-01-10        0.0041
9       2016-01-13        0.0021
10      2016-01-14        0.0021
11      2016-01-16        0.0021
12      2016-01-22        0.0021
13      2016-01-27        0.0062
14      2016-01-29        0.0021
15      2016-02-01        0.0021
16      2016-02-02        0.0041
17      2016-02-05        0.0041
18      2016-02-07        0.0021
19      2016-02-08        0.0021
---------------------------------------------
Total frequency: 0.0515 


LAST 20 VALUES
   Ad Created Date Frequency (%)
56      2016-03-19        3.3687
57      2016-03-20        3.7949
58      2016-03-21        3.7579
59      2016-03-22        3.2801
60      2016-03-23         3.206
61      2016-03-24         2.928
62      2016-03-25        3.1751
63      2016-03-26        3.2266
64      2016-03-27        3.0989
65      2016-03-28        3.4984
66      2016-03-29        3.4037
67      2016-03-30        3.3501
68      2016-03-31        3.1875
69      2016-04-01        3.3687
70      2016-04-02        3.5149
71      2016-04-03        3.8855
72      2016-04-04        3.6858
73      2016-04-05        1.1819
74      2016-04-06        0.3253
75      2016-04-07        0.1256
---------------------------------------------
Total frequency: 59.3637

Most of ad created dates fall within 1-2 months of the listing date, but we did notice some old values, with the oldest at around 9 months.

Let's check the column registration_year

In [21]:

get_col_details(autos, 'registration_year', show_val_counts=False)

Registration Year - Unique Values: 95 

Registration Year - Statistics:
count    48,565.0
mean     2,004.76
std         88.64
min       1,000.0
25%       1,999.0
50%       2,004.0
75%       2,008.0
max       9,999.0
Name: registration_year, dtype: object

The registration year of the car tells us how old it's. However, we notice some strange values. The minimum value is 1'000, this is long before cars were invented and the maximum value is 9'999, many years into the future.

Dealing with Incorrect Registration Year Data¶

We'll consider only those cars whose registration years fall within 1900 - 2016; since, at the beginning of the 20th century, cars began to be mass-produced.

Before performing the filter, we determine what percentage of the dataset have invalid values in this column:

In [22]:

total_cars = autos.shape[0]
total_wrong_reg_year = total_cars - autos['registration_year'].between(1900, 2016).sum()

perc_wrong_reg_year = (total_wrong_reg_year / total_cars) * 100

print("Percentage of wrong data in 'registration_year' column: {:.2f}".format(perc_wrong_reg_year))

Percentage of wrong data in 'registration_year' column: 3.88

Since these records only represent 3.88% of our data, we'll delete these rows.

In [23]:

# Filters the data
autos = autos[autos['registration_year'].between(1900, 2016)]

get_col_details(autos, 'registration_year', round_vals=4, sort_index=True, show_describe=False, normalize=True)

Registration Year - Unique Values: 78 


FIRST 20 VALUES
   Registration Year Frequency (%)
0              1,910        0.0107
1              1,927        0.0021
2              1,929        0.0021
3              1,931        0.0021
4              1,934        0.0043
5              1,937        0.0086
6              1,938        0.0021
7              1,939        0.0021
8              1,941        0.0043
9              1,943        0.0021
10             1,948        0.0021
11             1,950        0.0064
12             1,951        0.0043
13             1,952        0.0021
14             1,953        0.0021
15             1,954        0.0043
16             1,955        0.0043
17             1,956        0.0086
18             1,957        0.0043
19             1,958        0.0086
---------------------------------------------
Total frequency: 0.0878 


LAST 20 VALUES
   Registration Year Frequency (%)
58             1,997        4.1794
59             1,998         5.062
60             1,999         6.206
61             2,000        6.7608
62             2,001        5.6468
63             2,002        5.3255
64             2,003        5.7818
65             2,004        5.7904
66             2,005        6.2895
67             2,006        5.7197
68             2,007        4.8778
69             2,008         4.745
70             2,009        4.4665
71             2,010         3.404
72             2,011        3.4768
73             2,012        2.8063
74             2,013        1.7202
75             2,014        1.4203
76             2,015        0.8397
77             2,016        2.6135
---------------------------------------------
Total frequency: 87.1318

We can see that most of the vehicles were registered for the first time in the last 20 years or so (87.13%).

Exploring Price by Brand¶

In [24]:

brand_fq = get_col_details(autos, 'brand', round_vals=3, show_describe=False, normalize=True, return_val_counts=True)

Brand - Unique Values: 40 


             Brand  Frequency (%)
0       volkswagen         21.126
1              bmw         11.004
2             opel         10.758
3    mercedes_benz          9.646
4             audi          8.657
5             ford          6.990
6          renault          4.715
7          peugeot          2.984
8             fiat          2.564
9             seat          1.827
10           skoda          1.641
11          nissan          1.527
12           mazda          1.519
13           smart          1.416
14         citroen          1.401
15          toyota          1.270
16         hyundai          1.003
17  sonstige_autos          0.981
18           volvo          0.915
19            mini          0.876
20      mitsubishi          0.823
21           honda          0.784
22             kia          0.707
23      alfa_romeo          0.664
24         porsche          0.613
25          suzuki          0.593
26       chevrolet          0.570
27        chrysler          0.351
28           dacia          0.263
29        daihatsu          0.251
30            jeep          0.227
31          subaru          0.214
32      land_rover          0.210
33            saab          0.165
34          jaguar          0.156
35          daewoo          0.150
36         trabant          0.139
37           rover          0.133
38          lancia          0.107
39            lada          0.058
---------------------------------------------
Total frequency: 100.0

We note that German manufacturers occupy the top five brands, with 61.19% of the overall lists. Volkswagen is the most popular brand, with approximately double the cars for sale of the next two brands combined.

We also note that many brands don't have a significant percentage of listings, so we'll limit our analysis to brands that represent more than 5% of total listings.

In [25]:

common_brands = brand_fq[brand_fq['Frequency (%)'] > 5]['Brand']

print('Principal Brands:\n', common_brands, sep='\n')

Principal Brands:

0       volkswagen
1              bmw
2             opel
3    mercedes_benz
4             audi
5             ford
Name: Brand, dtype: object

Before continuing, we'll implement a function that will allow us to obtain the average value of a column based on a specific field.

In [26]:

def get_col_mean_by(df, col, mean_col, unique_values):
    dict_mean_vals = {}
    
    for value in unique_values:
        only_value = df[df[col] == value]
        mean = only_value[mean_col].mean()
        dict_mean_vals[value] = int(mean)
    
    # Sorts the dictionary by values
    dict_mean_vals = dict(sorted(dict_mean_vals.items(), key=lambda x: x[1], reverse=True))
    
    # Converts the dictionary to a series (applies thousands format)
    serie = pd.Series(dict_mean_vals).apply(sep_format)
    
    return serie

Now with the function already implemented, we can get the average prices for the main brands.

In [27]:

brand_mean_prices = get_col_mean_by(autos, 'brand', 'price', common_brands.values)

print('Top Brands - Mean Price:\n', *brand_mean_prices.items(), sep='\n')

Top Brands - Mean Price:

('audi', '9,336')
('mercedes_benz', '8,628')
('bmw', '8,332')
('volkswagen', '5,402')
('ford', '3,749')
('opel', '2,975')

Of the top 5 brands, we can classify the results based on price:

More expensive: Audi, BMW and Mercedes Benz.
Less expensive: Ford and Opel.
Average cost (most popular): Volkswagen.

Exploring Mileage by Brand¶

We'll start by exploring the average mileage for the major car brands.

In [28]:

brand_mean_mileages = get_col_mean_by(autos, 'brand', 'odometer_km', common_brands.values)

print('Top Brands - Mean Mileage:\n', *brand_mean_mileages.items(), sep='\n')

Top Brands - Mean Mileage:

('bmw', '132,572')
('mercedes_benz', '130,788')
('opel', '129,310')
('audi', '129,157')
('volkswagen', '128,707')
('ford', '124,266')

We note that German car brands have higher mileage (top five) with an average of 130,107 km; 4.49% above the ford brand.

Now, let's see if within the major brands there is any relationship between average mileage and average price.

In [29]:

brand_mileage_info = pd.DataFrame(brand_mean_mileages, columns=['Mean Mileage'])
brand_mileage_info['Mean Price'] = brand_mean_prices

brand_mileage_info

Out[29]:

	Mean Mileage	Mean Price
bmw	132,572	8,332
mercedes_benz	130,788	8,628
opel	129,310	2,975
audi	129,157	9,336
volkswagen	128,707	5,402
ford	124,266	3,749

Within the list of the main brands there is a slight tendency for the more expensive vehicles to have a higher mileage.

Finding the Most Common Brand/Model Combinations¶

We'll start by exploring the values of the column model.

In [30]:

get_col_details(autos, 'model', sort_values=True, show_describe=False)

Model - Unique Values: 245 


FIRST 30 VALUES
                 Model Frequency
0           rangerover         1
1                b_max         1
2                  200         1
3                   i3         1
4                kappa         2
5              charade         3
6               samara         3
7                  v60         3
8              materia         4
9                croma         4
10                 145         4
11                9000         5
12  range_rover_evoque         5
13               delta         5
14               lodgy         5
15           crossfire         6
16                exeo         6
17              amarok         6
18              nubira         8
19                  90         8
20                 r19         9
21         range_rover         9
22                move         9
23                 mii        10
24                300c        10
25               lybra        10
26               lanos        10
27                  gl        10
28              terios        10
29              sirion        11
---------------------------------------------
Total frequency: 179 


LAST 30 VALUES
           Model Frequency
214       megane       335
215           ka       349
216       zafira       394
217        punto       415
218       touran       433
219         clio       473
220       mondeo       479
221      3_reihe       486
222          1er       521
223     a_klasse       539
224       vectra       544
225       fortwo       550
226      2_reihe       600
227       twingo       615
228  transporter       674
229       fiesta       722
230        focus       762
231           a6       797
232           a3       825
233     e_klasse       958
234          5er     1,132
235     c_klasse     1,136
236           a4     1,231
237        astra     1,348
238       passat     1,349
239        corsa     1,592
240         polo     1,609
241          3er     2,615
242       andere     3,373
243         golf     3,707
---------------------------------------------
Total frequency: 30,563

Now, we'll create a new column, in which its values are the concatenation of the values of the brand and model columns. Then we'll examine their values.

In [31]:

autos['brand_model'] = autos['brand'] + ' / ' + autos['model']

get_col_details(autos, 'brand_model', sort_values=True, show_describe=False)

Brand Model - Unique Values: 291 


FIRST 30 VALUES
                        Brand Model Frequency
0                          bmw / i3         1
1                      ford / b_max         1
2                rover / rangerover         1
3                        audi / 200         1
4                 rover / discovery         1
5                rover / freelander         2
6                    lancia / kappa         2
7                    dacia / andere         2
8                daihatsu / charade         3
9                       volvo / v60         3
10                    lada / samara         3
11                     fiat / croma         4
12               daihatsu / materia         4
13                 alfa_romeo / 145         4
14              land_rover / andere         4
15  land_rover / range_rover_evoque         5
16                      saab / 9000         5
17                    dacia / lodgy         5
18                   lancia / delta         5
19              volkswagen / amarok         6
20                      seat / exeo         6
21             chrysler / crossfire         6
22                    lada / andere         7
23                   smart / andere         7
24                  daewoo / nubira         8
25                        audi / 90         8
26                 trabant / andere         8
27                    renault / r19         9
28                  daihatsu / move         9
29                  daewoo / andere         9
---------------------------------------------
Total frequency: 139 


LAST 30 VALUES
                  Brand Model Frequency
260              seat / ibiza       328
261          renault / megane       335
262                 ford / ka       349
263             opel / zafira       394
264              fiat / punto       415
265       volkswagen / touran       433
266    mercedes_benz / andere       439
267            renault / clio       473
268             ford / mondeo       479
269                 bmw / 1er       521
270  mercedes_benz / a_klasse       539
271             opel / vectra       544
272            smart / fortwo       550
273         peugeot / 2_reihe       600
274          renault / twingo       615
275  volkswagen / transporter       674
276             ford / fiesta       722
277              ford / focus       762
278                 audi / a6       797
279                 audi / a3       825
280  mercedes_benz / e_klasse       958
281                 bmw / 5er     1,132
282  mercedes_benz / c_klasse     1,136
283                 audi / a4     1,231
284              opel / astra     1,348
285       volkswagen / passat     1,349
286              opel / corsa     1,592
287         volkswagen / polo     1,609
288                 bmw / 3er     2,615
289         volkswagen / golf     3,707
---------------------------------------------
Total frequency: 27,471

The list of the last 30 values represents 58.73% of the total cars listed. The volkswagen / golf combination is the most common, 41.76% above the second-place combination bmw / 3rd.

The list with the highest values has an average frequency of 916. For our purpose, we'll only consider those values that exceed the average frequency. Coincidentally, the last 10 values in the list are the ones that meet this criteria.

We'll avoid using our previously implemented function since we only need the first 10 highest values.

In [32]:

top_brand_model = autos['brand_model'].value_counts().sort_values(ascending=False).head(10)

print('Top Ten - Most Common [Brand / Model] Cars:\n')
for count, bm in enumerate(top_brand_model.index):
    print(count + 1, bm)

Top Ten - Most Common [Brand / Model] Cars:

1 volkswagen / golf
2 bmw / 3er
3 volkswagen / polo
4 opel / corsa
5 volkswagen / passat
6 opel / astra
7 audi / a4
8 mercedes_benz / c_klasse
9 bmw / 5er
10 mercedes_benz / e_klasse

Finding Patterns Between Average Prices and Mileage¶

We'll divide the values in the odometer_km column into groups by using Sturge's rule to calculate the number of classes that our frequency table should contain.

In [33]:

# Total rows number
n = autos.shape[0]

# Calculating the range of the data
min_odometer = autos['odometer_km'].min()
max_odometer = autos['odometer_km'].max()
range_ = max_odometer - min_odometer

# Calculating the approximate number of classes
k = 1 + 3.32 * np.log10(n)
k = round(k)

# Calculating the approximate size of the class interval
h = range_ / k
h = round(h, -1) # Round up

# Obtains the list with the limits of each class
i = 0
interval = min_odometer
odometer_intervals = []
while i <= k:
    odometer_intervals.append(int(interval))
    interval += h
    i += 1

print('Odometer (km) - Limits for the bins:\n', *odometer_intervals, sep='\n')

Odometer (km) - Limits for the bins:

5000
13530
22060
30590
39120
47650
56180
64710
73240
81770
90300
98830
107360
115890
124420
132950
141480
150010

With the limits list obtained, we can now build the frequency table.

In [34]:

mileage_group = get_col_details(autos, 'odometer_km', sort_index=True, show_unique_num=False,
                                show_describe=False, bins=odometer_intervals, return_val_counts=True)

             Odometer Km  Frequency
0    (4999.999, 13530.0]       1026
1     (13530.0, 22060.0]        742
2     (22060.0, 30590.0]        760
3     (30590.0, 39120.0]          0
4     (39120.0, 47650.0]        797
5     (47650.0, 56180.0]        993
6     (56180.0, 64710.0]       1128
7     (64710.0, 73240.0]       1187
8     (73240.0, 81770.0]       1375
9     (81770.0, 90300.0]       1673
10    (90300.0, 98830.0]          0
11   (98830.0, 107360.0]       2058
12  (107360.0, 115890.0]          0
13  (115890.0, 124420.0]          0
14  (124420.0, 132950.0]       4857
15  (132950.0, 141480.0]          0
16  (141480.0, 150010.0]      30085
---------------------------------------------
Total frequency: 46,681

As a result, we got a DataFrame with the frequency table. Next, we'll get the lists of average mileage and average prices for each class and add them to the DataFrame.

In [35]:

mileages_avg_prices = []
avg_mileage = []

for index in range(len(odometer_intervals) - 1):
    lower_limit = odometer_intervals[index]
    upper_limit = odometer_intervals[index + 1]
    
    only_class = autos[(autos['odometer_km'] >= lower_limit) & (autos['odometer_km'] < upper_limit)]
    
    mean_price = only_class['price'].mean()
    mean_mileage = only_class['odometer_km'].mean()
    
    if pd.isna(mean_price):
        mean_price = 0
    if pd.isna(mean_mileage):
        mean_mileage = 0
    
    mileages_avg_prices.append(int(mean_price))
    avg_mileage.append(int(mean_mileage))

# Converts the lists into series
mileages_prices_serie = pd.Series(mileages_avg_prices)
avg_mileages_serie = pd.Series(avg_mileage)

# Formats the series and adds them to the frequency table
mileage_group['AVG Mileage'] = avg_mileages_serie.apply(sep_format)
mileage_group['AVG Price'] = mileages_prices_serie.apply(sep_format)

# Formats the frequency values
mileage_group['Frequency'] = mileage_group['Frequency'].apply(sep_format)

print(mileage_group)

             Odometer Km Frequency AVG Mileage AVG Price
0    (4999.999, 13530.0]     1,026       6,174    11,616
1     (13530.0, 22060.0]       742      20,000    18,448
2     (22060.0, 30590.0]       760      30,000    16,608
3     (30590.0, 39120.0]         0           0         0
4     (39120.0, 47650.0]       797      40,000    15,499
5     (47650.0, 56180.0]       993      50,000    13,812
6     (56180.0, 64710.0]     1,128      60,000    12,385
7     (64710.0, 73240.0]     1,187      70,000    10,927
8     (73240.0, 81770.0]     1,375      80,000     9,721
9     (81770.0, 90300.0]     1,673      90,000     8,465
10    (90300.0, 98830.0]         0           0         0
11   (98830.0, 107360.0]     2,058     100,000     8,132
12  (107360.0, 115890.0]         0           0         0
13  (115890.0, 124420.0]         0           0         0
14  (124420.0, 132950.0]     4,857     125,000     6,214
15  (132950.0, 141480.0]         0           0         0
16  (141480.0, 150010.0]    30,085     150,000     3,767

We can note, with the exception of the second class, that the average mileage is inversely related to the average price; that is, as the average mileage increases, the average price decreases.

This is normal in the sale of used cars, since if it has higher mileage, commonly the price is usually lower.

Average Price on Damaged Cars and Their Undamaged Counterparts¶

Let's start by exploring the values of the column unrepaired_damage

In [36]:

get_col_details(autos, 'unrepaired_damage', show_describe=False, dropna=False)

Unrepaired Damage - Unique Values: 3 


  Unrepaired Damage Frequency
0              nein    33,834
1               NaN     8,307
2                ja     4,540
---------------------------------------------
Total frequency: 46,681

We see that in addition to having null values, the other values are in German. We'll translate these values and replace them.

In [37]:

autos['unrepaired_damage'] = (autos['unrepaired_damage'].
                              str.replace('nein', 'No').
                             str.replace('ja', 'Yes'))

get_col_details(autos, 'unrepaired_damage', show_describe=False, dropna=False)

Unrepaired Damage - Unique Values: 3 


  Unrepaired Damage Frequency
0                No    33,834
1               NaN     8,307
2               Yes     4,540
---------------------------------------------
Total frequency: 46,681

Now, we calculate the average price for each value.

For our purpose we'll ignore the null values.

In [38]:

unrepaired_damage_avg_price = get_col_mean_by(autos, 'unrepaired_damage', 'price', ['No', 'Yes'])

unrep_price_info = pd.DataFrame(unrepaired_damage_avg_price, columns=['AVG Price'])
unrep_price_info.index.name = 'Unrepaired Damage'

unrep_price_info

Out[38]:

	AVG Price
Unrepaired Damage
No	7,164
Yes	2,241

As we expected, those cars that show unrepaired damage tend to have the lowest prices on average.

Conclusion¶

We determined that German car manufacturers occupy the top five brands, with 61.19% of the overall lists; being volkswagen the most popular brand.
We found that the most common brand-price combination is volkswagen / golf being 41.76% higher than the second-place bmw / 3rd combination.
We also found that in most cases, cars with higher mileage tend to be cheaper.
Similarly, cars with unrepaired damage tend to be cheaper than cars whose damage has already been repaired. These are about 220% more expensive.