In this project I will explore the vast used car data spanning from June 2015 to April 2016, available at eBay Kleinanzeigen, to answer the following questions:
To accomplish this I will do the following:
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
print(autos.info())
print(autos.head())
print(autos.tail())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB None dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50 dateCrawled name \ 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offerType price abtest vehicleType yearOfRegistration \ 49995 privat Angebot $24,900 control limousine 2011 49996 privat Angebot $1,980 control cabrio 1996 49997 privat Angebot $13,200 test cabrio 2014 49998 privat Angebot $22,900 control kombi 2013 49999 privat Angebot $1,250 control limousine 1996 gearbox powerPS model odometer monthOfRegistration fuelType \ 49995 automatik 239 q5 100,000km 1 diesel 49996 manuell 75 astra 150,000km 5 benzin 49997 automatik 69 500 5,000km 11 benzin 49998 manuell 150 a3 40,000km 11 diesel 49999 manuell 101 vectra 150,000km 1 benzin brand notRepairedDamage dateCreated nrOfPictures postalCode \ 49995 audi nein 2016-03-27 00:00:00 0 82131 49996 opel nein 2016-03-28 00:00:00 0 44807 49997 fiat nein 2016-04-02 00:00:00 0 73430 49998 audi nein 2016-03-08 00:00:00 0 35683 49999 opel nein 2016-03-13 00:00:00 0 45897 lastSeen 49995 2016-04-01 13:47:40 49996 2016-04-02 14:18:02 49997 2016-04-04 11:47:27 49998 2016-04-05 16:45:07 49999 2016-04-06 21:18:48
autos.describe()
yearOfRegistration | powerPS | monthOfRegistration | nrOfPictures | postalCode | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
In this brief exploration of the data I can see the following:
notRepairedDamage
has 40171 non-null)Below I will do the following:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos_copy = autos.copy()
columns = autos_copy.columns
mapping_columns = {
'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'offerType':'offer_type',
'price':'price',
'abtest':'abtest',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'odometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'
}
autos_copy.columns = pd.Series(columns).map(mapping_columns)
autos_copy.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns = autos_copy.columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-09 11:54:38 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
From my previous exploration of the price
column as described by the autos.info()
output, I can see that the data is expressed as strings.
I will further explore the price
data looking for the following:
I will clean that data by doing the following:
renaming the column, and/or dropping values as nan
, if necessary
It will also be important to determine if the data uses one unit of measure to convert the values into integers
autos['price'].describe()
count 50000 unique 2357 top $0 freq 1421 Name: price, dtype: object
print(autos['price'].dtype)
print(autos['price'].unique())
unique_price = autos['price'].unique()
print(unique_price)
object ['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] ['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']
autos['price'] = autos["price"].str.replace('$','')
autos['price'] = autos["price"].str.replace(',','')
unique_price = autos["price"].unique()
print(unique_price)
['5000' '8500' '8990' ... '385' '22200' '16995']
autos['price'] = autos['price'].astype(int)
dtypes = autos.dtypes
price_unique = autos['price'].unique()
print(price_unique)
[ 5000 8500 8990 ... 385 22200 16995]
autos['price'].unique().shape
(2357,)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
print(autos["price"].value_counts().sort_index().head(10))
print("\n")
print(autos["price"].value_counts().sort_index().tail(10))
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 Name: price, dtype: int64 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
price_bool = (autos["price"] < 500 ) | (autos["price"] > 250000)
autos.loc[price_bool,"price"] = np.nan
print(autos["price"].unique().shape)
autos["price"].describe()
(2203,)
count 45091.000000 mean 6281.293074 std 8601.170170 min 500.000000 25% 1500.000000 50% 3500.000000 75% 7900.000000 max 250000.000000 Name: price, dtype: float64
autos['price'].value_counts().head()
500.0 781 1500.0 734 2500.0 643 1200.0 639 1000.0 639 Name: price, dtype: int64
autos['price'].sort_index(ascending=True).head()
0 5000.0 1 8500.0 2 8990.0 3 4350.0 4 1350.0 Name: price, dtype: float64
autos['price'].sort_index(ascending=False).head()
49999 1250.0 49998 22900.0 49997 13200.0 49996 1980.0 49995 24900.0 Name: price, dtype: float64
From my exploration of the price
column, I can see the following:
From my previous exploration of the odometer
column as described by the autos.info()
output, I can see that the data is expressed as strings as was the price
data above.
I will further explore the odometer
data looking for the following:
I will clean that data by doing the following:
renaming the column, and/or dropping values, if necessary
It will also be important to determine if the data uses one unit of measure to convert the values into integers
autos['odometer'].describe()
count 50000 unique 13 top 150,000km freq 32424 Name: odometer, dtype: object
print(autos['odometer'].dtype)
print(autos['odometer'].unique())
unique_price = autos['odometer'].unique()
print(unique_price)
object ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km'] ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km']
autos['odometer'] = autos["odometer"].str.replace('km','')
autos['odometer'] = autos["odometer"].str.replace(',','')
unique_odometer = autos["odometer"].unique()
print(unique_odometer)
['150000' '70000' '50000' '80000' '10000' '30000' '125000' '90000' '20000' '60000' '5000' '100000' '40000']
autos['odometer'] = autos['odometer'].astype(int)
dtypes = autos.dtypes
odometer_unique = autos['odometer'].unique()
print(odometer_unique)
[150000 70000 50000 80000 10000 30000 125000 90000 20000 60000 5000 100000 40000]
autos_copy2 = autos.rename(columns={"odometer": "odometer_km"})
autos = autos_copy2
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().head(13)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['odometer_km'].sort_index(ascending=True).head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
autos['odometer_km'].sort_index(ascending=False).head()
49999 150000 49998 40000 49997 5000 49996 150000 49995 100000 Name: odometer_km, dtype: int64
From my exploration of the odometer_km
column, I can see the following:
int
information and renamed the column to specify the unit measureFrom my previous exploration of the date_crawled
ad_created
last_seen
columns as described by the autos.info()
output, I can see that the data is expressed as strings as was the odometer_km
data above.
I will further explore the date data looking for the following:
I will clean that data by doing the following:
renaming the column, and/or dropping values, if necessary
It will also be important to find a way to represent the dara quatitatively to better be able to explore and analyze it
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-03 0.03868 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-12 0.03678 2016-03-14 0.03662 2016-04-04 0.03652 2016-03-07 0.03596 2016-04-02 0.03540 2016-03-19 0.03490 2016-03-28 0.03484 Name: date_crawled, dtype: float64
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 Name: date_crawled, dtype: float64
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()
2016-04-07 0.00142 2016-04-06 0.00318 2016-04-05 0.01310 2016-04-04 0.03652 2016-04-03 0.03868 Name: date_crawled, dtype: float64
autos['date_crawled'].str[:10].sort_index(ascending=True).head()
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: date_crawled, dtype: object
autos['date_crawled'].str[:10].sort_index(ascending=False).head()
49999 2016-03-14 49998 2016-03-08 49997 2016-04-02 49996 2016-03-28 49995 2016-03-27 Name: date_crawled, dtype: object
From my exploration of the date_crawled
column, I can see the following:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-03 0.03892 2016-03-20 0.03786 2016-03-21 0.03772 2016-04-04 0.03688 2016-03-12 0.03662 2016-03-14 0.03522 2016-04-02 0.03508 2016-03-28 0.03496 2016-03-07 0.03474 2016-03-29 0.03414 Name: ad_created, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 Name: ad_created, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()
2016-04-07 0.00128 2016-04-06 0.00326 2016-04-05 0.01184 2016-04-04 0.03688 2016-04-03 0.03892 Name: ad_created, dtype: float64
autos['ad_created'].str[:10].sort_index(ascending=True).head()
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: ad_created, dtype: object
autos['ad_created'].str[:10].sort_index(ascending=False).head()
49999 2016-03-13 49998 2016-03-08 49997 2016-04-02 49996 2016-03-28 49995 2016-03-27 Name: ad_created, dtype: object
From my exploration of the ad_created
column, I can see the following:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-06 0.22100 2016-04-07 0.13092 2016-04-05 0.12428 2016-03-17 0.02792 2016-04-03 0.02536 2016-04-02 0.02490 2016-03-30 0.02484 2016-04-04 0.02462 2016-03-31 0.02384 2016-03-12 0.02382 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()
2016-04-07 0.13092 2016-04-06 0.22100 2016-04-05 0.12428 2016-04-04 0.02462 2016-04-03 0.02536 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].sort_index(ascending=True).head()
0 2016-04-06 1 2016-04-06 2 2016-04-06 3 2016-03-15 4 2016-04-01 Name: last_seen, dtype: object
autos['last_seen'].str[:10].sort_index(ascending=False).head()
49999 2016-04-06 49998 2016-04-05 49997 2016-04-04 49996 2016-04-02 49995 2016-04-01 Name: last_seen, dtype: object
From my exploration of the last_seen
column, I can see the following:
last_seen
dates hold more proportion of the last views in more recent dates.autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
From my exploration of the registration_year
column, I can see the following:
Both are unrealistic numbers. I will limit registration to the time where cars were invented in the early 1900s, to 2016, the latest a car must've been registered to have made it accurately listed on the ad.
print(autos.loc[(autos["registration_year"] > 2016 ) | (autos["registration_year"] < 1900) , "registration_year"])
10 2017 55 2017 65 2017 68 2017 84 2018 113 2017 164 2018 197 2017 253 2017 348 2017 390 2018 438 2017 453 4500 454 2017 457 2017 477 2017 545 2017 548 2017 557 2017 568 2017 577 2017 582 2017 654 2017 740 2017 754 2017 758 2017 765 2017 802 2018 871 2017 889 2017 ... 49154 2017 49178 2018 49185 2019 49218 2018 49245 2017 49259 2017 49261 2017 49262 2018 49266 2017 49283 1001 49342 2017 49343 2017 49347 2017 49354 2018 49389 2017 49411 2018 49522 2017 49557 2017 49561 2017 49653 2017 49662 2017 49689 2017 49696 2017 49731 2017 49770 2018 49796 2017 49841 2017 49880 2017 49910 9000 49935 2017 Name: registration_year, Length: 1972, dtype: int64
autos.loc[(autos["registration_year"] > 2016 ) | (autos["registration_year"] < 1900) , "registration_year"] = np.nan
autos["registration_year"].value_counts(normalize=True,dropna=True).head(10)
2000.0 0.069834 2005.0 0.062776 1999.0 0.062464 2004.0 0.056988 2003.0 0.056779 2006.0 0.056384 2001.0 0.056280 2002.0 0.052740 1998.0 0.051074 2007.0 0.047972 Name: registration_year, dtype: float64
autos['registration_year'].value_counts(normalize=True, dropna=False)
2000.0 0.06708 2005.0 0.06030 1999.0 0.06000 2004.0 0.05474 2003.0 0.05454 2006.0 0.05416 2001.0 0.05406 2002.0 0.05066 1998.0 0.04906 2007.0 0.04608 2008.0 0.04462 2009.0 0.04196 1997.0 0.04056 NaN 0.03944 2011.0 0.03268 2010.0 0.03194 1996.0 0.02888 2012.0 0.02646 2016.0 0.02632 1995.0 0.02626 2013.0 0.01612 2014.0 0.01332 1994.0 0.01320 1993.0 0.00890 2015.0 0.00798 1990.0 0.00790 1992.0 0.00782 1991.0 0.00712 1989.0 0.00362 1988.0 0.00284 ... 1977.0 0.00044 1966.0 0.00044 1975.0 0.00038 1969.0 0.00038 1965.0 0.00034 1964.0 0.00024 1910.0 0.00018 1963.0 0.00018 1959.0 0.00014 1961.0 0.00012 1956.0 0.00010 1958.0 0.00008 1962.0 0.00008 1937.0 0.00008 1950.0 0.00006 1951.0 0.00004 1954.0 0.00004 1957.0 0.00004 1941.0 0.00004 1955.0 0.00004 1934.0 0.00004 1953.0 0.00002 1952.0 0.00002 1948.0 0.00002 1939.0 0.00002 1938.0 0.00002 1931.0 0.00002 1929.0 0.00002 1927.0 0.00002 1943.0 0.00002 Name: registration_year, Length: 79, dtype: float64
autos['registration_year'].describe()
count 48028.00000 mean 2002.80351 std 7.31085 min 1910.00000 25% 1999.00000 50% 2003.00000 75% 2008.00000 max 2016.00000 Name: registration_year, dtype: float64
From my new exploration of the registration_year
column, I can see the following:
Given that the statistics for this data wasn't affected significantly and represents a more realistic presence of cars in history, I will continue my analysis given the boudaries I've set for registration year.
I will begin my analysis of this data by aggregating the brand
, price
and milage
columns to explore the kinds of cars that are available, their prices relative to each other, and their usage to date. Analyzing this information will better inform any potential purchase.
autos[['brand','price','odometer_km']][0:10]
brand | price | odometer_km | |
---|---|---|---|
0 | peugeot | 5000.0 | 150000 |
1 | bmw | 8500.0 | 150000 |
2 | volkswagen | 8990.0 | 70000 |
3 | smart | 4350.0 | 70000 |
4 | ford | 1350.0 | 150000 |
5 | chrysler | 7900.0 | 150000 |
6 | volkswagen | NaN | 150000 |
7 | volkswagen | 1990.0 | 150000 |
8 | seat | NaN | 150000 |
9 | renault | 590.0 | 150000 |
autos["brand"].describe()
count 50000 unique 40 top volkswagen freq 10687 Name: brand, dtype: object
autos["brand"].notnull().value_counts()
True 50000 Name: brand, dtype: int64
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
autos['brand'].value_counts()
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
autos["brand"].value_counts(normalize = True).head(10)
volkswagen 0.21374 opel 0.10922 bmw 0.10858 mercedes_benz 0.09468 audi 0.08566 ford 0.06958 renault 0.04808 peugeot 0.02912 fiat 0.02616 seat 0.01882 Name: brand, dtype: float64
From my new exploration of the brand
column, I can see the following:
To further explore the top 10 brands, I will aggregate their price
data averages.
autos["brand"].value_counts().index[0:10]
Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
brand_10_price = {}
for index in autos["brand"].value_counts().index[0:10]:
subset = autos.loc[autos["brand"] == index,"price"]
brand_price = subset.sum()/subset.count()
brand_10_price[index] = int(brand_price)
for index in sorted(brand_10_price,key=brand_10_price.get, reverse = True):
print(index,":",brand_10_price[index])
audi : 9484 mercedes_benz : 8670 bmw : 8465 volkswagen : 5701 seat : 4709 ford : 4267 opel : 3348 peugeot : 3329 fiat : 3211 renault : 2760
From my new exploration of the top 10 brand
and price
aggregation, I can see the following:
Below, I will explore the brand
and odometer_km
averages in aggregate.
brand_10_milage = {}
for index in autos["brand"].value_counts().index[0:10]:
subset = autos.loc[autos["brand"] == index,"odometer_km"]
brand_milage = subset.sum()/subset.count()
brand_10_milage[index] = int(brand_milage)
for index in sorted(brand_10_milage,key=brand_10_milage.get, reverse = True):
print(index,":",brand_10_milage[index])
bmw : 132521 mercedes_benz : 130886 audi : 129643 opel : 129298 volkswagen : 128955 renault : 128223 peugeot : 127352 ford : 124131 seat : 122061 fiat : 117037
This display of both aggregated series objects has a few limitations:
brand
, of both series objects so we can easily make visual comparisonsAs a result, I will combine the data from both series objects into a single dataframe, with a shared index, and display the dataframe directly.
bmp_series = pd.Series(brand_10_price)
print(bmp_series)
volkswagen 5701 opel 3348 bmw 8465 mercedes_benz 8670 audi 9484 ford 4267 renault 2760 peugeot 3329 fiat 3211 seat 4709 dtype: int64
df = pd.DataFrame(bmp_series, columns=['mean_price'])
print(df)
mean_price volkswagen 5701 opel 3348 bmw 8465 mercedes_benz 8670 audi 9484 ford 4267 renault 2760 peugeot 3329 fiat 3211 seat 4709
bmd_km_series = pd.Series(brand_10_milage)
print(bmd_km_series)
volkswagen 128955 opel 129298 bmw 132521 mercedes_benz 130886 audi 129643 ford 124131 renault 128223 peugeot 127352 fiat 117037 seat 122061 dtype: int64
df['mean_km'] = bmd_km_series
print(df)
mean_price mean_km volkswagen 5701 128955 opel 3348 129298 bmw 8465 132521 mercedes_benz 8670 130886 audi 9484 129643 ford 4267 124131 renault 2760 128223 peugeot 3329 127352 fiat 3211 117037 seat 4709 122061
Interestingly, the top 3 most expensive cars on average are also those with the most amount of milage on average.
Brand | Price (dollars) | Milage (km) |
---|---|---|
Audi | 9484 | 129643 |
Mercedes | 8670 | 130886 |
BMW | 8465 | 132521 |
I have briefly analyzed the vast eBay car listing data available at eBay Kleinanzeigen. My analysis is the follwing:
On average, the top 10 most expensive car listings are from the following brands:
Brand | Price (dollars) |
---|---|
audi | 9484 |
mercedes_benz | 8670 |
bmw | 8465 |
volkswagen | 5701 |
seat | 4709 |
ford | 4267 |
opel | 3348 |
peugeot | 3329 |
fiat | 3211 |
renault | 2760 |
It is important to note that 9 out of the top 10, are European brands. The American brand Ford joins the top 10 most expensive listings at number 6 with an average of 4,267 dollars. This data appears to correlate with the cost of new cars for these brands. As a result, some important questions to ask are the following:
On average, the top 10 most expensive car listings had the following average milage per brand:
Brand | Milage (km) |
---|---|
bmw | 132521 |
mercedes_benz | 130886 |
audi | 129643 |
opel | 129298 |
volkswagen | 128955 |
renault | 128223 |
peugeot | 127352 |
ford | 124131 |
seat | 122061 |
fiat | 117037 |
As I mentioned above, the top 3 of those expenive car brands with most milage on average also listed the highest prices. Overall, the range of milage of these 10 brands was 15,484 km, ranging from 117,037 km to 132,521 km. The cars listing the most milage on average don't list lower prices. The oposite is true. As a result, there is no clear correlation between car milage and prices from my analysis of the top 10 most expensive brands. This provides an opportunity for further analysis to ask the following:
This data correlates with the assumption that typically expensive car brands will sell at higher prices than less expensive brands, and in this case despite small variations in milage averages. This data does not showcase a correlation with higher car milage averages and lower car prices. This may be because the milage averages in my analysis are not significant enough to showcase differences in price listings. As a reault, it asks for further analysis to potentially find correlations in this assumption.
This data does provide a potential buyer with the opportunity to make an educated purchase on eBay, assuming that the data is fully accurate. A potential buyer may want to further explore the questions I have provided in this conclusion as well as look further into the repair data available, to make a more educated decision.