This is a dataset of scraped data from ebay website with some interesting features about every fairly used and available vehicle for sale as at the time the spider crawled the ebay website to collect the informations.

In [1]:

import pandas as pd
import numpy as np

In [2]:

autos = pd.read_csv('autos.csv', encoding='Latin-1')

The dataset consist of 50000 rows and 20 columns. There are 2 distinct data types in the original dataset which are the object(15) and int64(5) data types in the column section.

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

For a quick overview of the dataset, we use the DataFrame.head() to check the first 5 rows of the dataset by default. We can quickly see how the values and columns are structured.

In [4]:

autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [5]:

autos.columns.to_series().groupby(autos.dtypes).groups

Out[5]:

{dtype('int64'): Index(['yearOfRegistration', 'powerPS', 'monthOfRegistration', 'nrOfPictures',
        'postalCode'],
       dtype='object'),
 dtype('O'): Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
        'vehicleType', 'gearbox', 'model', 'odometer', 'fuelType', 'brand',
        'notRepairedDamage', 'dateCreated', 'lastSeen'],
       dtype='object')}

In [6]:

autos.isnull().sum()

Out[6]:

dateCrawled               0
name                      0
seller                    0
offerType                 0
price                     0
abtest                    0
vehicleType            5095
yearOfRegistration        0
gearbox                2680
powerPS                   0
model                  2758
odometer                  0
monthOfRegistration       0
fuelType               4482
brand                     0
notRepairedDamage      9829
dateCreated               0
nrOfPictures              0
postalCode                0
lastSeen                  0
dtype: int64

The vehicleType, gearbox, model, fuelType and notRepairedDamage all indicate the presense of Null or invalid data entries.

In [7]:

autos.columns

Out[7]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

The original dataset column were recorded using the very common camelCase format. However, in our analysis, we prefer the snakeCase by separating each word by an underscore. We do this by making a copy of the columns and make a manual editing of each appropriate column name

In [8]:

auto_column = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'picture_numb', 'postal_code',
       'last_seen']

autos.columns = auto_column

In [9]:

autos.drop('picture_numb', axis=1, inplace=True)

In [10]:

autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)

autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)

In [11]:

autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [12]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price               50000 non-null  int64 
 5   abtest              50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gearbox             47320 non-null  object
 9   power_PS            50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer_km         50000 non-null  int64 
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50000 non-null  object
 17  postal_code         50000 non-null  int64 
 18  last_seen           50000 non-null  object
dtypes: int64(6), object(13)
memory usage: 7.2+ MB

Perform a brief statistical preview of the dataset which include both numerical and categorical data types. This can summarilly be done using the DataFrame.describe() to look at basic statistical summary. It can be seen from the cell below that, only the numerical columns show up in our description by default. The most common statistical estimates here include the mean, min, max, 25%, 50% and 75% of each numerical column. The picture_numb column seems redundant and its a likely candidate to be dropped as most of its descriptions are 0s. The registration_year column has a strange value at the min and max estimate. The registration_month and power_PS also have invalid entry in the min estimate.

In [13]:

autos.describe()

Out[13]:

	price	registration_year	power_PS	odometer_km	registration_month	postal_code
count	5.000000e+04	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	9.840044e+03	2005.073280	116.355920	125732.700000	5.723360	50813.627300
std	4.811044e+05	105.712813	209.216627	40042.211706	3.711984	25779.747957
min	0.000000e+00	1000.000000	0.000000	5000.000000	0.000000	1067.000000
25%	1.100000e+03	1999.000000	70.000000	125000.000000	3.000000	30451.000000
50%	2.950000e+03	2003.000000	105.000000	150000.000000	6.000000	49577.000000
75%	7.200000e+03	2008.000000	150.000000	150000.000000	9.000000	71540.000000
max	1.000000e+08	9999.000000	17700.000000	150000.000000	12.000000	99998.000000

In [14]:

autos['price'].value_counts()

Out[14]:

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price, Length: 2357, dtype: int64

The price column has 1421 $0 entries. This obviously seems like an error. To perform analysis with this column, it is logical to get rid of such erroneous entries.

In [15]:

autos = autos.loc[autos['price']>0]

In [16]:

autos['price'].describe()

Out[16]:

count    4.857900e+04
mean     1.012788e+04
std      4.880873e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      1.000000e+08
Name: price, dtype: float64

In [17]:

autos = (autos[autos['price']
             .between(1,350000, inclusive=True)]
             .sort_values(by=['price','brand'], ascending=False)
            )

In [18]:

autos.head(10)

Out[18]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_PS	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
36818	2016-03-27 18:37:37	Porsche_991	privat	Angebot	350000	control	coupe	2016	manuell	500	911	5000	3	benzin	porsche	nein	2016-03-27 00:00:00	70499	2016-03-27 18:37:37
14715	2016-03-30 08:37:24	Rolls_Royce_Phantom_Drophead_Coupe	privat	Angebot	345000	control	cabrio	2012	automatik	460	NaN	20000	8	benzin	sonstige_autos	nein	2016-03-30 00:00:00	73525	2016-04-07 00:16:26
34723	2016-03-23 16:37:29	Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au...	privat	Angebot	299000	test	coupe	1977	manuell	260	911	100000	7	benzin	porsche	nein	2016-03-23 00:00:00	61462	2016-04-06 16:44:50
35923	2016-04-03 07:56:23	Porsche_911_Targa_Exclusive_Edition__1_von_15_...	privat	Angebot	295000	test	cabrio	2015	automatik	400	911	5000	6	benzin	porsche	nein	2016-04-03 00:00:00	74078	2016-04-03 08:56:20
12682	2016-03-28 22:48:01	Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...	privat	Angebot	265000	control	coupe	2016	automatik	500	911	5000	3	benzin	porsche	nein	2016-03-28 00:00:00	70193	2016-04-05 03:44:51
47337	2016-04-05 10:25:38	BMW_Z8_roadster	privat	Angebot	259000	test	cabrio	2001	manuell	400	z_reihe	20000	6	benzin	bmw	nein	2016-04-05 00:00:00	61462	2016-04-05 12:07:32
38299	2016-03-28 22:25:25	Glas_BMW_mit_Wasser	privat	Angebot	250000	test	NaN	2015	NaN	0	x_reihe	5000	0	NaN	bmw	NaN	2016-03-28 00:00:00	60489	2016-03-28 22:25:25
37840	2016-03-21 10:50:12	Porsche_997	privat	Angebot	220000	test	coupe	2008	manuell	415	911	30000	7	benzin	porsche	nein	2016-03-21 00:00:00	69198	2016-04-06 04:46:14
40918	2016-03-20 18:40:05	Porsche_911_991_GT3_RS	privat	Angebot	198000	test	coupe	2015	automatik	500	911	5000	6	benzin	porsche	nein	2016-03-20 00:00:00	51491	2016-03-21 21:46:36
43668	2016-03-16 18:47:26	Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts...	privat	Angebot	197000	control	coupe	1998	manuell	450	911	150000	3	NaN	porsche	nein	2016-03-16 00:00:00	46147	2016-04-07 02:44:47

In [19]:

autos['odometer_km'].unique().shape

Out[19]:

(13,)

In [20]:

autos['odometer_km'].describe()

Out[20]:

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [21]:

autos['odometer_km'].value_counts().sort_index(ascending=False)

Out[21]:

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
40000       815
30000       780
20000       762
10000       253
5000        836
Name: odometer_km, dtype: int64

In [22]:

autos.shape

Out[22]:

(48565, 19)

In [23]:

autos = (autos[autos['odometer_km']
                   .between(4900,155000, inclusive=True)]
            )

The odometer_km column has a minimum and maximum values of 5000 and 150000 km respectively. These are realistic values and hence we can conclude that there are no outliers here. Majority of the vehicles have covered 150000 km

In [24]:

autos.describe()

Out[24]:

	price	registration_year	power_PS	odometer_km	registration_month	postal_code
count	48565.000000	48565.000000	48565.000000	48565.000000	48565.000000	48565.000000
mean	5888.935591	2004.755421	117.197158	125770.101925	5.782251	50975.745207
std	9059.854754	88.643887	200.649618	39788.636804	3.685595	25746.968398
min	1.000000	1000.000000	0.000000	5000.000000	0.000000	1067.000000
25%	1200.000000	1999.000000	71.000000	125000.000000	3.000000	30657.000000
50%	3000.000000	2004.000000	107.000000	150000.000000	6.000000	49716.000000
75%	7490.000000	2008.000000	150.000000	150000.000000	9.000000	71665.000000
max	350000.000000	9999.000000	17700.000000	150000.000000	12.000000	99998.000000

In [25]:

((round((autos['ad_created'].str[:10]
                 .value_counts(normalize=True,dropna=False)*100),6)
                 .sort_index(ascending=True))

)

Out[25]:

2015-06-11    0.002059
2015-08-10    0.002059
2015-09-09    0.002059
2015-11-10    0.002059
2015-12-05    0.002059
                ...   
2016-04-03    3.885514
2016-04-04    3.685782
2016-04-05    1.181921
2016-04-06    0.325337
2016-04-07    0.125605
Name: ad_created, Length: 76, dtype: float64

In [26]:

((round((autos['date_crawled'].str[:10]
         .value_counts(normalize=True, dropna=False)*100),6)
         .sort_index(ascending=True))
)

Out[26]:

2016-03-05    2.532688
2016-03-06    1.404304
2016-03-07    3.601359
2016-03-08    3.329558
2016-03-09    3.308967
2016-03-10    3.218367
2016-03-11    3.257490
2016-03-12    3.691959
2016-03-13    1.566972
2016-03-14    3.654896
2016-03-15    3.428395
2016-03-16    2.960980
2016-03-17    3.162772
2016-03-18    1.291053
2016-03-19    3.477813
2016-03-20    3.788737
2016-03-21    3.737259
2016-03-22    3.298672
2016-03-23    3.222485
2016-03-24    2.934212
2016-03-25    3.160712
2016-03-26    3.220426
2016-03-27    3.109235
2016-03-28    3.486050
2016-03-29    3.409863
2016-03-30    3.368681
2016-03-31    3.183363
2016-04-01    3.368681
2016-04-02    3.547823
2016-04-03    3.860805
2016-04-04    3.648718
2016-04-05    1.309585
2016-04-06    0.317101
2016-04-07    0.140019
Name: date_crawled, dtype: float64

In [27]:

((round((autos['last_seen'].str[:10]
                 .value_counts(normalize=True, dropna=False)*100),6)
                 .sort_index(ascending=True))
                 .head(10)
)

Out[27]:

2016-03-05    0.107073
2016-03-06    0.432410
2016-03-07    0.539483
2016-03-08    0.741275
2016-03-09    0.959539
2016-03-10    1.066612
2016-03-11    1.237517
2016-03-12    2.378256
2016-03-13    0.889529
2016-03-14    1.260167
Name: last_seen, dtype: float64

There are 34 unique values in the date_crawled and last_seen column. The last_seen and date_crawled have similarities. The ads were created a few days before they began to be crawled. These actions began on the 2016-03-07. The last_seen column began it's actions a few days after their creation and crawling.

In [28]:

autos['registration_year'].describe()

Out[28]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The mean of the registration_year is pegged at 2002. The maximum year of registration is unrealistic(9999). The earliest year of vehicle registration which is the oldest vehicle was in the year 1000 - this was even before the invention of the first car.

In [29]:

(autos['registration_year'].value_counts()
         .sort_values(ascending=False)
)

Out[29]:

2000    3156
2005    2936
1999    2897
2004    2703
2003    2699
        ... 
1953       1
4500       1
1939       1
1001       1
1952       1
Name: registration_year, Length: 95, dtype: int64

In [30]:

autos = (autos.loc[autos['registration_year'].between(1900,2016)]
         .sort_values(by='registration_year')
        )

autos.shape

Out[30]:

(46681, 19)

There are 1884 values outside the limit of our assumed registration year. The earliest year being 1910 and the latest year being 2016.

In [31]:

(round((autos['registration_year']
        .value_counts(normalize=True)*100),5)
        .sort_values(ascending=False)
)

Out[31]:

2000    6.76078
2005    6.28950
1999    6.20595
2004    5.79036
2003    5.78180
         ...   
1931    0.00214
1929    0.00214
1943    0.00214
1953    0.00214
1952    0.00214
Name: registration_year, Length: 78, dtype: float64

The year 2000 saw the highest percentage of vehicle registration. followed by year 2005.

In [32]:

brand_unique = autos['brand'].unique()
brand_unique

Out[32]:

array(['sonstige_autos', 'trabant', 'opel', 'renault', 'ford', 'bmw',
       'mercedes_benz', 'volkswagen', 'porsche', 'alfa_romeo', 'seat',
       'land_rover', 'chevrolet', 'citroen', 'skoda', 'volvo', 'fiat',
       'peugeot', 'chrysler', 'jaguar', 'audi', 'lancia', 'mini', 'jeep',
       'honda', 'toyota', 'lada', 'nissan', 'saab', 'mitsubishi', 'mazda',
       'suzuki', 'daihatsu', 'subaru', 'rover', 'smart', 'hyundai', 'kia',
       'daewoo', 'dacia'], dtype=object)

In [33]:

car_cou = {}


for car in brand_unique:
    car_num = autos.loc[autos['brand']==car,'brand'].count()
    car_cou[car] = car_num

auto_num_count = pd.Series(car_cou)
limit = auto_num_count.sort_values(ascending=False)[:10]
ind  = list(limit.index)
ind

Out[33]:

['volkswagen',
 'bmw',
 'opel',
 'mercedes_benz',
 'audi',
 'ford',
 'renault',
 'peugeot',
 'fiat',
 'seat']

The top 10 vehicles on Ebay site are volkswagen, opel, bmw, mercedes_benz, audi, ford, renault, peugeot, fiat, seat.

In [34]:

car_dict = {}


for car in ind:
    if car in brand_unique:
        car_mean = (autos.loc[autos['brand']==car,'price']
                    .sort_values(ascending=False).mean()
                   )
        car_dict[car] = int(car_mean)
    else:
        break

In [35]:

#auto_price_mean = pd.Series(car_dict)
#auto_price_mean.sort_values(ascending=False)
price_mean = pd.Series(car_dict)
price_mean.sort_values(ascending=False)

Out[35]:

audi             9336
mercedes_benz    8628
bmw              8332
volkswagen       5402
seat             4397
ford             3749
peugeot          3094
opel             2975
fiat             2813
renault          2474
dtype: int64

There are 40 unique brands in our dataset. VolkWagen is the brand with the highest ad on the site, while lada brand has the least value on the platform. Audi, Mercedes_benz and BMW are the expensive top brand on Ebay, Volkswagen is next while Opel and Ford are less expensive

In [36]:

car_mil_dict = {}


for car in ind:
    if car in brand_unique:
        car_mil_mean = (autos.loc[autos['brand']==car
                                    ,'odometer_km']
                                    .sort_values(ascending=False)
                                    .mean()
                       )
        car_mil_dict[car] = int(car_mil_mean)
    else:
        break

In [37]:

mile_mean = pd.Series(car_mil_dict)
mile_mean.sort_values(ascending=False)

Out[37]:

bmw              132572
mercedes_benz    130788
opel             129310
audi             129157
volkswagen       128707
renault          128071
peugeot          127153
ford             124266
seat             121131
fiat             117121
dtype: int64

In [38]:

df = pd.DataFrame(price_mean,columns=['mean_price'])
df['mileage_mean'] = mile_mean
df.sort_values(by='mean_price', ascending=False)

Out[38]:

	mean_price	mileage_mean
audi	9336	129157
mercedes_benz	8628	130788
bmw	8332	132572
volkswagen	5402	128707
seat	4397	121131
ford	3749	124266
peugeot	3094	127153
opel	2975	129310
fiat	2813	117121
renault	2474	128071

Audi, Mercedes Benz and BMW are the most expensive brands. However, these brands covered most distance on average in their reverse order. Opel, Fiat and renault are the lowest brands in the top 10 brand with Fiat making the least distance coverage. Assumptions can be made from this observation that, the top brands cover a higher distance than their cheaper counterparts.