Exploring Ebay Car Sales Data
In this project the aim is to clean the data and analyze the included used car listings. I so doing I will also become familiar some of the unique benefits jupyter notebook provides for pandas.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the information displayed by the Dataframe.info()
method above, it can be seen that the dataset has 20 columns, 5 of which contains null values (i.e vehicleType, gearbox, gearbox, fuelType and notRepairedDamage). It consist of 2 data types object and integer
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From observation it can be seen that the column names use camelcase instead of Python's preferred snakecase. I just converted it to snakecase while making a little adjustment to some column name to help with understanding
autos.describe(include='all')
autos['odometer'].value_counts()
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
Taking a close look at the data set it can be seen that the price and odometer columns are stored as object instead of int.
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)
autos['price'].value_counts().sort_index(ascending=False).head(10)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 Name: price, dtype: int64
autos_mod = autos[autos['price'].between(100,500000)]
After analysing the odometer_km and price columns, I felt the odometer_km values to be okay depending on how long and how far the car has been owned and travelled wth respectively. But the price on the other hand seems to have some outliers because I feel that a used car should not cost less than $100 or more than $500,000
. This is selected and saved as autos_mod
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
From the analysis above, we can see that the data was crawled between 05 of March 2016 and 07 of April 2016 with the last 2 days having the list amount of data crawled.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 2015-12-30 0.00002 2016-01-03 0.00002 2016-01-07 0.00002 2016-01-10 0.00004 2016-01-13 0.00002 2016-01-14 0.00002 2016-01-16 0.00002 2016-01-22 0.00002 2016-01-27 0.00006 2016-01-29 0.00002 2016-02-01 0.00002 2016-02-02 0.00004 2016-02-05 0.00004 2016-02-07 0.00002 2016-02-08 0.00002 2016-02-09 0.00004 2016-02-11 0.00002 2016-02-12 0.00006 2016-02-14 0.00004 2016-02-16 0.00002 2016-02-17 0.00002 2016-02-18 0.00004 2016-02-19 0.00006 2016-02-20 0.00004 2016-02-21 0.00006 ... 2016-03-09 0.03324 2016-03-10 0.03186 2016-03-11 0.03278 2016-03-12 0.03662 2016-03-13 0.01692 2016-03-14 0.03522 2016-03-15 0.03374 2016-03-16 0.03000 2016-03-17 0.03120 2016-03-18 0.01372 2016-03-19 0.03384 2016-03-20 0.03786 2016-03-21 0.03772 2016-03-22 0.03280 2016-03-23 0.03218 2016-03-24 0.02908 2016-03-25 0.03188 2016-03-26 0.03256 2016-03-27 0.03090 2016-03-28 0.03496 2016-03-29 0.03414 2016-03-30 0.03344 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03508 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
Analyzing the date the listings were created above shows that they were created between 11th of June 2015 and 07th of April 2016. Moajority of the listings were also created during the period that the data was crawled.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64
From exploring the date the crawler saw the ads online last, it is obsered that the dates are also spread between 05 of March 2016 and 07 of April 2016 with majority of the ads last seen on the last 3 days.
autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
It is observed from the description above that the least year some cars were registered is 1000 and the most year is 9999 whih seems erroneous.
autos['registration_year'].value_counts().sort_index()
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 34 1961 6 1962 4 1963 9 ... 2001 2703 2002 2533 2003 2727 2004 2737 2005 3015 2006 2708 2007 2304 2008 2231 2009 2098 2010 1597 2011 1634 2012 1323 2013 806 2014 666 2015 399 2016 1316 2017 1453 2018 492 2019 3 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 2 9996 1 9999 4 Name: registration_year, Length: 97, dtype: int64
A closer look at the registration years of the cars we can decide as follows:
using the autos_mod data set
autos_mod = autos_mod[autos_mod['registration_year'].between(1920, 2016)]
autos_mod['registration_year'].value_counts(normalize=True).sort_values(ascending=False).head(30)
2000 0.066969 2005 0.062805 1999 0.062114 2004 0.058231 2003 0.058101 2006 0.057562 2001 0.056721 2002 0.053441 1998 0.050485 2007 0.049040 2008 0.047681 2009 0.044876 1997 0.041532 2011 0.034908 2010 0.034239 1996 0.029234 2012 0.028220 2016 0.025933 1995 0.025739 2013 0.017282 2014 0.014283 1994 0.013506 1993 0.009061 2015 0.008198 1992 0.007918 1991 0.007292 1990 0.007163 1989 0.003689 1988 0.002869 1985 0.002006 Name: registration_year, dtype: float64
It can be seen that after removing the values outside the upper and lower bounds, the higher percentage of cars were registered between 2016 and 1994.
brands = autos_mod['brand'].value_counts(normalize=True).sort_values(ascending=False).head(20)
brands
volkswagen 0.211413 bmw 0.110183 opel 0.107228 mercedes_benz 0.096656 audi 0.086775 ford 0.069838 renault 0.047055 peugeot 0.029860 fiat 0.025609 seat 0.018252 skoda 0.016419 nissan 0.015340 mazda 0.015232 smart 0.014196 citroen 0.014045 toyota 0.012794 hyundai 0.010011 sonstige_autos 0.009536 volvo 0.009126 mini 0.008803 Name: brand, dtype: float64
Having a total of 39 unique car brands, let us aggregate using the top 20 brands by percentage. So we will start by assigning the top 20 brands by percentage to brands as above
avg_price_per_brand = {}
for brand in brands.index:
selected_brand = autos_mod[autos_mod['brand'] == brand]
mean = selected_brand['price'].mean()
avg_price_per_brand[brand] = int(mean)
sorted(avg_price_per_brand.items(), key=lambda kv: kv[1],reverse=True)
[('sonstige_autos', 12784), ('mini', 10639), ('audi', 9380), ('mercedes_benz', 8672), ('bmw', 8381), ('skoda', 6409), ('volkswagen', 5436), ('hyundai', 5411), ('toyota', 5167), ('volvo', 4993), ('nissan', 4756), ('seat', 4433), ('mazda', 4129), ('citroen', 3796), ('ford', 3779), ('smart', 3596), ('peugeot', 3113), ('opel', 3006), ('fiat', 2836), ('renault', 2496)]
From the aggregation above, we can see that citroen, sonstige_autos, volvo has the highest price average. This might be as a result of lower percentage of listings. While Renult and peugeot has the least price average.
Let's calculate the average mileage per brand to see if there is a correlation between the price and the mileage. We will start by aggregating using a dictionary.
avg_mileage_per_brand = {}
for brand in brands.index:
selected_row = autos_mod[autos_mod['brand'] == brand]
mean = selected_row['odometer_km'].mean()
avg_mileage_per_brand[brand] = int(mean)
sorted(avg_mileage_per_brand.items(), key=lambda kv: kv[1],reverse=True)
[('volvo', 138581), ('bmw', 132695), ('mercedes_benz', 131025), ('opel', 129380), ('audi', 129245), ('volkswagen', 128799), ('renault', 128337), ('peugeot', 127127), ('mazda', 124553), ('ford', 124277), ('seat', 121536), ('citroen', 119554), ('nissan', 118326), ('fiat', 116950), ('toyota', 115944), ('skoda', 110906), ('hyundai', 106885), ('smart', 99734), ('sonstige_autos', 90395), ('mini', 88308)]
bap_series = pd.Series(avg_price_per_brand)
bam_series = pd.Series(avg_mileage_per_brand)
bam_series
audi 129245 bmw 132695 citroen 119554 fiat 116950 ford 124277 hyundai 106885 mazda 124553 mercedes_benz 131025 mini 88308 nissan 118326 opel 129380 peugeot 127127 renault 128337 seat 121536 skoda 110906 smart 99734 sonstige_autos 90395 toyota 115944 volkswagen 128799 volvo 138581 dtype: int64
bpm_df = pd.DataFrame(bap_series, columns=['average_price'])
bpm_df['average_mileage'] = bam_series
bpm_df.sort_values('average_price', ascending=False)
average_price | average_mileage | |
---|---|---|
sonstige_autos | 12784 | 90395 |
mini | 10639 | 88308 |
audi | 9380 | 129245 |
mercedes_benz | 8672 | 131025 |
bmw | 8381 | 132695 |
skoda | 6409 | 110906 |
volkswagen | 5436 | 128799 |
hyundai | 5411 | 106885 |
toyota | 5167 | 115944 |
volvo | 4993 | 138581 |
nissan | 4756 | 118326 |
seat | 4433 | 121536 |
mazda | 4129 | 124553 |
citroen | 3796 | 119554 |
ford | 3779 | 124277 |
smart | 3596 | 99734 |
peugeot | 3113 | 127127 |
opel | 3006 | 129380 |
fiat | 2836 | 116950 |
renault | 2496 | 128337 |
bpm_df.sort_values('average_mileage', ascending=False)
average_price | average_mileage | |
---|---|---|
volvo | 4993 | 138581 |
bmw | 8381 | 132695 |
mercedes_benz | 8672 | 131025 |
opel | 3006 | 129380 |
audi | 9380 | 129245 |
volkswagen | 5436 | 128799 |
renault | 2496 | 128337 |
peugeot | 3113 | 127127 |
mazda | 4129 | 124553 |
ford | 3779 | 124277 |
seat | 4433 | 121536 |
citroen | 3796 | 119554 |
nissan | 4756 | 118326 |
fiat | 2836 | 116950 |
toyota | 5167 | 115944 |
skoda | 6409 | 110906 |
hyundai | 5411 | 106885 |
smart | 3596 | 99734 |
sonstige_autos | 12784 | 90395 |
mini | 10639 | 88308 |
From the analysis above it shows that the price has no direct correlation with the mileage.