EBay is an American e-commerce websites where users can place their wares on sale. EBay car sales include sales from individual sellers and car dealers. Thus, investigating the car sales that take place on eBay will help shed light on consumer behaviour and demand.
The original dataset was uploaded on Kaggle but can now be found on data.world. The data comprises of 50,000 points and the following columns:
Column Name | Details |
---|---|
dateCrawled | When this ad was first crawled. All field-values are taken from this date |
name | Name of the car |
seller | Whether the seller is private or a dealer |
offerType | The type of listing |
price | The price on the ad to sell the car |
abtest | Whether the listing is included in an A/B test |
vehicleType | The vehicle Type |
yearOfRegistration | The year in which the car was first registered |
gearbox | The transmission type |
powerPS | The power of the car in PS |
model | The car model name |
kilometer | How many kilometers the car has driven |
monthOfRegistration | The month in which the car was first registered |
fuelType | What type of fuel the car uses |
brand | The brand of the car |
notRepairedDamage | If the car has a damage which is not yet repaired |
dateCreated | The date on which the eBay listing was created |
nrOfPictures | The number of pictures in the ad |
postalCode | The postal code for the location of the vehicle |
lastSeenOnline | When the crawler saw this ad last online |
#import numpy and pandas
import numpy as np
import pandas as pd
#read csv file
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
#initial data exploration
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include = 'all')
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos.dtypes
dateCrawled object name object seller object offerType object price object abtest object vehicleType object yearOfRegistration int64 gearbox object powerPS int64 model object odometer object monthOfRegistration int64 fuelType object brand object notRepairedDamage object dateCreated object nrOfPictures int64 postalCode int64 lastSeen object dtype: object
At a first glance, it can be observed that the column names are in camelcase instead of snakecase. This makes analysis and readability a bit more difficult. Thus, all column names will be transformed to snake case. Some of the names, (like yearOfRegistration) are too long and will be shortened (to registration_year) for easy readability.
Additionally, there are some columns like price and odometer which have extra characters that render the dtype of the column as object. However, after those extra characters are removed, the column dtypes can be changed to numeric ones. TThe date columns ad_created, last_seen and date_crawled are identified as strings and will need to be converted to numeric representation for easy analysis.
Finally, there are null values in the vehicleType, gearbox, model, fuelType, notRepairedDamage columns which will need to be appropriately dealt with.
old_column_names = autos.columns
# create a list for new column names
new_column_names = []
changed_names = {
'yearOfRegistration' : 'registration_year',
'monthOfRegistration' : 'registration_month',
'notRepairedDamage' : 'unrepaired_damage',
'dateCreated' : 'ad_created',
'powerPS' : 'power_ps',
'nrOfPictures': 'pictures_number'
}
# transform old column names add them to new list
for column in old_column_names:
if column in changed_names:
column = changed_names[column]
column_l = list(column)
column_l2 = column_l[:]
for i in range(len(column_l2)):
if column_l2[i].isupper():
column_l.insert (i, '_')
column = ''.join(column_l)
column = column.lower()
new_column_names.append(column)
# assign new names to columns
autos.columns = new_column_names
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_number | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
As mentioned previously, the odometer and price columns need to be converted to numeric data types. However, to ensure that the right operations are being conducted the unique values of each will be analysed.
print(autos['price'].unique())
print(autos['odometer'].unique())
['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km']
It can be observed that the non-numeric characters for the price column are '$' and ',' and that of the *odometer column are ',' and 'km'. The odometer column can be converted to integer values. However, the price column will be converted into a float, since price is sometimes quoted with decimals.
#remove all non-numeric characters for odometer and price columns
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(float)
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_number | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
These columns will be further analysed for any unrealistic minimum/maxiumum values and those outliers will be removed.
# explore odometer_km column values
print(autos['odometer_km'].describe())
print('\n')
print(autos['odometer_km'].value_counts())
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
# explore price column values
print("Summary of values in price column")
print(autos['price'].describe())
print('\n')
print('Most common values')
print(autos['price'].value_counts().head())
print('\n')
print('Least common values')
print(autos['price'].value_counts().tail())
print('\n')
print('Cars more expensive than $4 million')
print(autos.loc[(autos['price'] > 4000000), 'price'])
print('\n')
Summary of values in price column count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 Most common values 0.0 1421 500.0 781 1500.0 734 2500.0 643 1200.0 639 Name: price, dtype: int64 Least common values 6202.0 1 18310.0 1 898.0 1 11240.0 1 789.0 1 Name: price, dtype: int64 Cars more expensive than $4 million 2897 11111111.0 11137 10000000.0 24384 11111111.0 27371 12345678.0 39377 12345678.0 39705 99999999.0 42221 27322222.0 47598 12345678.0 Name: price, dtype: float64
Looking at the column summary for the price column, both the minimum and maximum values are a bit unrealistic. While there are some cars that are sold for extremely low prices, it is highly unlikely that a car was sold for nothing. On the other hand, the most expensive vehicle sold in ebay up until 2012 was $3.3million and in 2016 no car was sold on ebay over $1 million. Thus, although there is little to no information for the years 2013-2015, it can perhaps be estimated that no car was sold above $4million dollars.
Thus, cars that are less than $50, and those more than $4 million will be removed.
# remove outliers in price column
outlier_bool = autos[~(autos['price'].between(50,4000000))]
outlier_indices = outlier_bool.index
autos = autos.drop(outlier_indices)
print(autos['price'].describe())
count 4.835300e+04 mean 6.109634e+03 std 2.283463e+04 min 5.000000e+01 25% 1.250000e+03 50% 3.000000e+03 75% 7.499000e+03 max 3.890000e+06 Name: price, dtype: float64
After removing the outliers the mean reduced by 37.9%. Additionally the max price reduced from $100 million to $3.6 million which is more in line with the data we have concerning expensive car sales on eBay.
There are 5 columns that have date values: date crawled, last seen, ad_created, registration_month and registratin_year. Of the 5, date_crawled and last_seen and ad_created are stored as strings. However, in order to analyse them, they need to be converted to numeric representation.
# create percentage distribution for dates in date_crawled
date_crawled_percents = autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False)
date_crawled_percents.sort_index(inplace = True)
print(date_crawled_percents)
2016-03-05 0.025355 2016-03-06 0.014063 2016-03-07 0.036027 2016-03-08 0.033255 2016-03-09 0.033049 2016-03-10 0.032221 2016-03-11 0.032552 2016-03-12 0.036978 2016-03-13 0.015656 2016-03-14 0.036668 2016-03-15 0.034289 2016-03-16 0.029491 2016-03-17 0.031601 2016-03-18 0.012905 2016-03-19 0.034744 2016-03-20 0.037785 2016-03-21 0.037309 2016-03-22 0.032966 2016-03-23 0.032263 2016-03-24 0.029429 2016-03-25 0.031498 2016-03-26 0.032263 2016-03-27 0.031084 2016-03-28 0.034889 2016-03-29 0.034124 2016-03-30 0.033710 2016-03-31 0.031828 2016-04-01 0.033731 2016-04-02 0.035530 2016-04-03 0.038571 2016-04-04 0.036564 2016-04-05 0.013050 2016-04-06 0.003164 2016-04-07 0.001386 Name: date_crawled, dtype: float64
From the data above it can be seen that the web scraper was active for one month from March to April 2016. With the maxiumum amount of scraping done on 3rd April 2016.
# create percentage distribution for dates in last_seen
last_seen_percents = autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False)
last_seen_percents.sort_index(inplace = True)
print(last_seen_percents)
2016-03-05 0.001075 2016-03-06 0.004322 2016-03-07 0.005418 2016-03-08 0.007342 2016-03-09 0.009637 2016-03-10 0.010609 2016-03-11 0.012388 2016-03-12 0.023742 2016-03-13 0.008872 2016-03-14 0.012636 2016-03-15 0.015883 2016-03-16 0.016462 2016-03-17 0.028085 2016-03-18 0.007321 2016-03-19 0.015800 2016-03-20 0.020619 2016-03-21 0.020599 2016-03-22 0.021405 2016-03-23 0.018592 2016-03-24 0.019771 2016-03-25 0.019172 2016-03-26 0.016772 2016-03-27 0.015573 2016-03-28 0.020867 2016-03-29 0.022336 2016-03-30 0.024735 2016-03-31 0.023825 2016-04-01 0.022873 2016-04-02 0.024900 2016-04-03 0.025169 2016-04-04 0.024507 2016-04-05 0.124997 2016-04-06 0.221620 2016-04-07 0.132070 Name: last_seen, dtype: float64
The last seen dates mirror those of the web scraper activity which makes sense, since the web scraper could only judge the last seen dates of ads when it was active.
# create percentage distribution for dates in ad_created
ad_created_percents = autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False)
ad_created_percents.sort_index(inplace = True)
print(ad_created_percents)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033131 2016-03-10 0.031932 2016-03-11 0.032883 2016-03-12 0.036792 2016-03-13 0.017021 2016-03-14 0.035303 2016-03-15 0.034021 2016-03-16 0.029988 2016-03-17 0.031270 2016-03-18 0.013588 2016-03-19 0.033628 2016-03-20 0.037847 2016-03-21 0.037536 2016-03-22 0.032780 2016-03-23 0.032097 2016-03-24 0.029367 2016-03-25 0.031622 2016-03-26 0.032325 2016-03-27 0.031001 2016-03-28 0.034993 2016-03-29 0.034083 2016-03-30 0.033524 2016-03-31 0.031870 2016-04-01 0.033710 2016-04-02 0.035220 2016-04-03 0.038819 2016-04-04 0.036937 2016-04-05 0.011768 2016-04-06 0.003247 2016-04-07 0.001241 Name: ad_created, Length: 76, dtype: float64
The earliest car ad that was picked up by the scraper was in 2015. This is in line with the scraper's active times, since most cars posted before June 2015 would probably have been sold by the March-April 2016.
autos['registration_year'].describe()
autos['registration_year'].unique()
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010, 1999, 1982, 1990, 2015, 2014, 1996, 1992, 2002, 2012, 2011, 2005, 2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988, 1989, 1973, 1967, 1976, 4500, 1987, 1991, 1983, 1960, 1969, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971, 1966, 1979, 1981, 1970, 1974, 1975, 4100, 2019, 1956, 9999, 6200, 1964, 1959, 1958, 1800, 1948, 1931, 1943, 1941, 1950, 1962, 1927, 1937, 1929, 1000, 1910, 5000, 1957, 1952, 1111, 1955, 1939, 8888, 1954, 1938, 2800, 5911, 1953, 1951, 4800, 1001, 9000])
The minimum and maximum years are wildly unrealistic. They year 9999 is far into the future and by 1000 cars had not even been invented. The first car to be registered in the United States was in 1901, thus this will be taken as the minimum year, and since the year the web scraper was active was 2016, any years after that will be discarded.
# remove outliers in registration_year column
autos = autos[(autos['registration_year'].between(1901, 2016))]
autos['registration_year'].unique()
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2010, 1999, 1982, 1990, 2015, 2014, 1996, 1992, 2002, 2012, 2011, 2005, 2008, 1985, 2016, 1994, 1986, 2001, 2013, 1972, 1993, 1988, 1989, 1973, 1967, 1976, 1987, 1991, 1983, 1960, 1969, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971, 1966, 1979, 1981, 1970, 1974, 1975, 1956, 1964, 1959, 1958, 1948, 1931, 1943, 1941, 1950, 1962, 1927, 1937, 1929, 1910, 1957, 1952, 1955, 1939, 1954, 1938, 1953, 1951])
#calculate percentage distribution of values in registration_year
autos['registration_year'].value_counts(normalize = True)
2000 0.067127 2005 0.062846 1999 0.062071 2004 0.058069 2003 0.057983 2006 0.057467 2001 0.056628 2002 0.053314 1998 0.050582 2007 0.048925 2008 0.047591 2009 0.044794 1997 0.041739 2011 0.034854 2010 0.034187 1996 0.029347 2012 0.028163 2016 0.025969 1995 0.025926 2013 0.017234 2014 0.014243 1994 0.013533 1993 0.009122 2015 0.008262 1992 0.007939 1991 0.007294 1990 0.007229 1989 0.003679 1988 0.002883 1985 0.002044 ... 1966 0.000473 1976 0.000452 1969 0.000409 1975 0.000387 1965 0.000366 1964 0.000258 1963 0.000172 1961 0.000129 1959 0.000129 1962 0.000086 1956 0.000086 1937 0.000086 1958 0.000086 1955 0.000043 1954 0.000043 1957 0.000043 1951 0.000043 1910 0.000043 1934 0.000043 1941 0.000043 1927 0.000022 1929 0.000022 1950 0.000022 1931 0.000022 1948 0.000022 1938 0.000022 1939 0.000022 1953 0.000022 1943 0.000022 1952 0.000022 Name: registration_year, Length: 78, dtype: float64
The most popular registration year, for car sold on eBay was 2000. Cars registered in 2000 comprised about 6.7% while those in 2005 closely followed with 6.2%. Clearly 10-16 year old cars sell the best on eBay. This is reinforced by the fact that the top 5 years ranged from 1999 to 2005. On the other hand, vintage cars (50+ years) sold the least, with the last ten spots being occupied by cars registered before 1953.
It would be interesting to figure out which brands sell the most expensive cars and if that aligns with the traditional views held on the prestige of those brands.
import operator
#calculate 6 most common brands
common_brands = list(autos['brand'].value_counts()[:6].index)
#function to calculate mean for each brand
def mean_calculator(brand_list, column):
'''This function takes in a list of brands and a relevant common name(as a string)
and calculates by brand, the mean of the values pertaining to it in the specified
column. It returns a dictionary which maps each brand to the relevant mean'''
brand_mean = {}
for br in brand_list:
brand_mean[br] = autos.loc[autos['brand'] == br, column].mean()
return brand_mean
#calculate mean price for each brand
brand_mean_price = mean_calculator(common_brands, 'price')
print(brand_mean_price)
{'bmw': 8616.716885149677, 'mercedes_benz': 8663.059308807135, 'audi': 9373.771428571428, 'ford': 4068.368275438057, 'volkswagen': 5627.416675150158, 'opel': 2996.0850210547424}
As expected, Audi, Mercedes Benz & BMW have the most expesive cars. This suggests that e-commerce sales mirror real-life prestige enjoyed by those brands. However, in the top 6 most common brands there's a lost of variation in mean price. The mean price of an opel is 68% lower than that of an audi (most expensive). However, even compared to ford which is the the 2nd least expensive common car, the price of an opel is 38% lower. In order to investigate the variance in price, the average mileage of each of these brands will be investigated in order to see if this is the main factor affecting the price variability.
#calculate mean mileage for each brand
brand_mean_mileage = mean_calculator(common_brands, 'odometer_km')
#convert both dictionaries to series objects
bmp_series = pd.Series(brand_mean_price)
bmm_series = pd.Series(brand_mean_mileage)
price_mileage = pd.DataFrame(bmp_series, columns=['mean_price'])
price_mileage['mean_mileage'] = bmm_series
print(price_mileage)
mean_price mean_mileage audi 9373.771429 129254.658385 bmw 8616.716885 132655.057719 ford 4068.368275 124296.034430 mercedes_benz 8663.059309 130923.076923 opel 2996.085021 129440.545418 volkswagen 5627.416675 128775.323221
There seems to be no clear pattern between mean price and mean mileage this indicates that the most expensive cars sell for more money due to some other factors, such as perceived quality or durability.