This is a dataset of scraped data from ebay website with some interesting features about every fairly used and available vehicle for sale as at the time the spider crawled the ebay website to collect the informations.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
The dataset consist of 50000 rows and 20 columns. There are 2 distinct data types in the original dataset which are the object(15) and int64(5) data types in the column section.
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
For a quick overview of the dataset, we use the DataFrame.head() to check the first 5 rows of the dataset by default. We can quickly see how the values and columns are structured.
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns.to_series().groupby(autos.dtypes).groups
{dtype('int64'): Index(['yearOfRegistration', 'powerPS', 'monthOfRegistration', 'nrOfPictures', 'postalCode'], dtype='object'), dtype('O'): Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'gearbox', 'model', 'odometer', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'lastSeen'], dtype='object')}
autos.isnull().sum()
dateCrawled 0 name 0 seller 0 offerType 0 price 0 abtest 0 vehicleType 5095 yearOfRegistration 0 gearbox 2680 powerPS 0 model 2758 odometer 0 monthOfRegistration 0 fuelType 4482 brand 0 notRepairedDamage 9829 dateCreated 0 nrOfPictures 0 postalCode 0 lastSeen 0 dtype: int64
The vehicleType, gearbox, model, fuelType and notRepairedDamage all indicate the presense of Null or invalid data entries.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
The original dataset column were recorded using the very common camelCase format. However, in our analysis, we prefer the snakeCase by separating each word by an underscore. We do this by making a copy of the columns and make a manual editing of each appropriate column name
auto_column = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'picture_numb', 'postal_code',
'last_seen']
autos.columns = auto_column
autos.drop('picture_numb', axis=1, inplace=True)
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null int64 5 abtest 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_PS 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 50000 non-null int64 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 postal_code 50000 non-null int64 18 last_seen 50000 non-null object dtypes: int64(6), object(13) memory usage: 7.2+ MB
Perform a brief statistical preview of the dataset which include both numerical and categorical data types. This can summarilly be done using the DataFrame.describe() to look at basic statistical summary. It can be seen from the cell below that, only the numerical columns show up in our description by default. The most common statistical estimates here include the mean, min, max, 25%, 50% and 75% of each numerical column. The picture_numb column seems redundant and its a likely candidate to be dropped as most of its descriptions are 0s. The registration_year column has a strange value at the min and max estimate. The registration_month and power_PS also have invalid entry in the min estimate.
autos.describe()
price | registration_year | power_PS | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
autos['price'].value_counts()
0 1421 500 781 1500 734 2500 643 1000 639 ... 20790 1 8970 1 846 1 2895 1 33980 1 Name: price, Length: 2357, dtype: int64
The price column has 1421 $0 entries. This obviously seems like an error. To perform analysis with this column, it is logical to get rid of such erroneous entries.
autos = autos.loc[autos['price']>0]
autos['price'].describe()
count 4.857900e+04 mean 1.012788e+04 std 4.880873e+05 min 1.000000e+00 25% 1.200000e+03 50% 3.000000e+03 75% 7.490000e+03 max 1.000000e+08 Name: price, dtype: float64
autos = (autos[autos['price']
.between(1,350000, inclusive=True)]
.sort_values(by=['price','brand'], ascending=False)
)
autos.head(10)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_PS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36818 | 2016-03-27 18:37:37 | Porsche_991 | privat | Angebot | 350000 | control | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 70499 | 2016-03-27 18:37:37 |
14715 | 2016-03-30 08:37:24 | Rolls_Royce_Phantom_Drophead_Coupe | privat | Angebot | 345000 | control | cabrio | 2012 | automatik | 460 | NaN | 20000 | 8 | benzin | sonstige_autos | nein | 2016-03-30 00:00:00 | 73525 | 2016-04-07 00:16:26 |
34723 | 2016-03-23 16:37:29 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | privat | Angebot | 299000 | test | coupe | 1977 | manuell | 260 | 911 | 100000 | 7 | benzin | porsche | nein | 2016-03-23 00:00:00 | 61462 | 2016-04-06 16:44:50 |
35923 | 2016-04-03 07:56:23 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | privat | Angebot | 295000 | test | cabrio | 2015 | automatik | 400 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-04-03 00:00:00 | 74078 | 2016-04-03 08:56:20 |
12682 | 2016-03-28 22:48:01 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | privat | Angebot | 265000 | control | coupe | 2016 | automatik | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-28 00:00:00 | 70193 | 2016-04-05 03:44:51 |
47337 | 2016-04-05 10:25:38 | BMW_Z8_roadster | privat | Angebot | 259000 | test | cabrio | 2001 | manuell | 400 | z_reihe | 20000 | 6 | benzin | bmw | nein | 2016-04-05 00:00:00 | 61462 | 2016-04-05 12:07:32 |
38299 | 2016-03-28 22:25:25 | Glas_BMW_mit_Wasser | privat | Angebot | 250000 | test | NaN | 2015 | NaN | 0 | x_reihe | 5000 | 0 | NaN | bmw | NaN | 2016-03-28 00:00:00 | 60489 | 2016-03-28 22:25:25 |
37840 | 2016-03-21 10:50:12 | Porsche_997 | privat | Angebot | 220000 | test | coupe | 2008 | manuell | 415 | 911 | 30000 | 7 | benzin | porsche | nein | 2016-03-21 00:00:00 | 69198 | 2016-04-06 04:46:14 |
40918 | 2016-03-20 18:40:05 | Porsche_911_991_GT3_RS | privat | Angebot | 198000 | test | coupe | 2015 | automatik | 500 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-03-20 00:00:00 | 51491 | 2016-03-21 21:46:36 |
43668 | 2016-03-16 18:47:26 | Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts... | privat | Angebot | 197000 | control | coupe | 1998 | manuell | 450 | 911 | 150000 | 3 | NaN | porsche | nein | 2016-03-16 00:00:00 | 46147 | 2016-04-07 02:44:47 |
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().sort_index(ascending=False)
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 40000 815 30000 780 20000 762 10000 253 5000 836 Name: odometer_km, dtype: int64
autos.shape
(48565, 19)
autos = (autos[autos['odometer_km']
.between(4900,155000, inclusive=True)]
)
The odometer_km column has a minimum and maximum values of 5000 and 150000 km respectively. These are realistic values and hence we can conclude that there are no outliers here. Majority of the vehicles have covered 150000 km
autos.describe()
price | registration_year | power_PS | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 |
mean | 5888.935591 | 2004.755421 | 117.197158 | 125770.101925 | 5.782251 | 50975.745207 |
std | 9059.854754 | 88.643887 | 200.649618 | 39788.636804 | 3.685595 | 25746.968398 |
min | 1.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1200.000000 | 1999.000000 | 71.000000 | 125000.000000 | 3.000000 | 30657.000000 |
50% | 3000.000000 | 2004.000000 | 107.000000 | 150000.000000 | 6.000000 | 49716.000000 |
75% | 7490.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71665.000000 |
max | 350000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
((round((autos['ad_created'].str[:10]
.value_counts(normalize=True,dropna=False)*100),6)
.sort_index(ascending=True))
)
2015-06-11 0.002059 2015-08-10 0.002059 2015-09-09 0.002059 2015-11-10 0.002059 2015-12-05 0.002059 ... 2016-04-03 3.885514 2016-04-04 3.685782 2016-04-05 1.181921 2016-04-06 0.325337 2016-04-07 0.125605 Name: ad_created, Length: 76, dtype: float64
((round((autos['date_crawled'].str[:10]
.value_counts(normalize=True, dropna=False)*100),6)
.sort_index(ascending=True))
)
2016-03-05 2.532688 2016-03-06 1.404304 2016-03-07 3.601359 2016-03-08 3.329558 2016-03-09 3.308967 2016-03-10 3.218367 2016-03-11 3.257490 2016-03-12 3.691959 2016-03-13 1.566972 2016-03-14 3.654896 2016-03-15 3.428395 2016-03-16 2.960980 2016-03-17 3.162772 2016-03-18 1.291053 2016-03-19 3.477813 2016-03-20 3.788737 2016-03-21 3.737259 2016-03-22 3.298672 2016-03-23 3.222485 2016-03-24 2.934212 2016-03-25 3.160712 2016-03-26 3.220426 2016-03-27 3.109235 2016-03-28 3.486050 2016-03-29 3.409863 2016-03-30 3.368681 2016-03-31 3.183363 2016-04-01 3.368681 2016-04-02 3.547823 2016-04-03 3.860805 2016-04-04 3.648718 2016-04-05 1.309585 2016-04-06 0.317101 2016-04-07 0.140019 Name: date_crawled, dtype: float64
((round((autos['last_seen'].str[:10]
.value_counts(normalize=True, dropna=False)*100),6)
.sort_index(ascending=True))
.head(10)
)
2016-03-05 0.107073 2016-03-06 0.432410 2016-03-07 0.539483 2016-03-08 0.741275 2016-03-09 0.959539 2016-03-10 1.066612 2016-03-11 1.237517 2016-03-12 2.378256 2016-03-13 0.889529 2016-03-14 1.260167 Name: last_seen, dtype: float64
There are 34 unique values in the date_crawled and last_seen column. The last_seen and date_crawled have similarities. The ads were created a few days before they began to be crawled. These actions began on the 2016-03-07. The last_seen column began it's actions a few days after their creation and crawling.
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The mean of the registration_year is pegged at 2002. The maximum year of registration is unrealistic(9999). The earliest year of vehicle registration which is the oldest vehicle was in the year 1000 - this was even before the invention of the first car.
(autos['registration_year'].value_counts()
.sort_values(ascending=False)
)
2000 3156 2005 2936 1999 2897 2004 2703 2003 2699 ... 1953 1 4500 1 1939 1 1001 1 1952 1 Name: registration_year, Length: 95, dtype: int64
autos = (autos.loc[autos['registration_year'].between(1900,2016)]
.sort_values(by='registration_year')
)
autos.shape
(46681, 19)
There are 1884 values outside the limit of our assumed registration year. The earliest year being 1910 and the latest year being 2016.
(round((autos['registration_year']
.value_counts(normalize=True)*100),5)
.sort_values(ascending=False)
)
2000 6.76078 2005 6.28950 1999 6.20595 2004 5.79036 2003 5.78180 ... 1931 0.00214 1929 0.00214 1943 0.00214 1953 0.00214 1952 0.00214 Name: registration_year, Length: 78, dtype: float64
The year 2000 saw the highest percentage of vehicle registration. followed by year 2005.
brand_unique = autos['brand'].unique()
brand_unique
array(['sonstige_autos', 'trabant', 'opel', 'renault', 'ford', 'bmw', 'mercedes_benz', 'volkswagen', 'porsche', 'alfa_romeo', 'seat', 'land_rover', 'chevrolet', 'citroen', 'skoda', 'volvo', 'fiat', 'peugeot', 'chrysler', 'jaguar', 'audi', 'lancia', 'mini', 'jeep', 'honda', 'toyota', 'lada', 'nissan', 'saab', 'mitsubishi', 'mazda', 'suzuki', 'daihatsu', 'subaru', 'rover', 'smart', 'hyundai', 'kia', 'daewoo', 'dacia'], dtype=object)
car_cou = {}
for car in brand_unique:
car_num = autos.loc[autos['brand']==car,'brand'].count()
car_cou[car] = car_num
auto_num_count = pd.Series(car_cou)
limit = auto_num_count.sort_values(ascending=False)[:10]
ind = list(limit.index)
ind
['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat']
The top 10 vehicles on Ebay site are volkswagen, opel, bmw, mercedes_benz, audi, ford, renault, peugeot, fiat, seat.
car_dict = {}
for car in ind:
if car in brand_unique:
car_mean = (autos.loc[autos['brand']==car,'price']
.sort_values(ascending=False).mean()
)
car_dict[car] = int(car_mean)
else:
break
#auto_price_mean = pd.Series(car_dict)
#auto_price_mean.sort_values(ascending=False)
price_mean = pd.Series(car_dict)
price_mean.sort_values(ascending=False)
audi 9336 mercedes_benz 8628 bmw 8332 volkswagen 5402 seat 4397 ford 3749 peugeot 3094 opel 2975 fiat 2813 renault 2474 dtype: int64
There are 40 unique brands in our dataset. VolkWagen is the brand with the highest ad on the site, while lada brand has the least value on the platform. Audi, Mercedes_benz and BMW are the expensive top brand on Ebay, Volkswagen is next while Opel and Ford are less expensive
car_mil_dict = {}
for car in ind:
if car in brand_unique:
car_mil_mean = (autos.loc[autos['brand']==car
,'odometer_km']
.sort_values(ascending=False)
.mean()
)
car_mil_dict[car] = int(car_mil_mean)
else:
break
mile_mean = pd.Series(car_mil_dict)
mile_mean.sort_values(ascending=False)
bmw 132572 mercedes_benz 130788 opel 129310 audi 129157 volkswagen 128707 renault 128071 peugeot 127153 ford 124266 seat 121131 fiat 117121 dtype: int64
df = pd.DataFrame(price_mean,columns=['mean_price'])
df['mileage_mean'] = mile_mean
df.sort_values(by='mean_price', ascending=False)
mean_price | mileage_mean | |
---|---|---|
audi | 9336 | 129157 |
mercedes_benz | 8628 | 130788 |
bmw | 8332 | 132572 |
volkswagen | 5402 | 128707 |
seat | 4397 | 121131 |
ford | 3749 | 124266 |
peugeot | 3094 | 127153 |
opel | 2975 | 129310 |
fiat | 2813 | 117121 |
renault | 2474 | 128071 |
Audi, Mercedes Benz and BMW are the most expensive brands. However, these brands covered most distance on average in their reverse order. Opel, Fiat and renault are the lowest brands in the top 10 brand with Fiat making the least distance coverage. Assumptions can be made from this observation that, the top brands cover a higher distance than their cheaper counterparts.