Here in this Project I will try to clean the 50000 sample used cars as it was crawled from Ebay-German Website.

We will do the following to our sample: 1- Clean the Data and analyze the included used car listing.

2-And we will familiarize our self with the benefits of jupyter notebooks brings when working with Pandas library.

In [1]:

import numpy as np
import pandas as pd

autos=pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:

autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [4]:

autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

We can see that in this dataset there are some columns which does not have info in their respective cells(NaN) We can see also that we are dealing with Two types of data:

1- string (Shown as Object) 2- integer(Int64) 3- The Dataset contains 20 columns(most of the are Object=string)

Next we are going to change and reword the columns to better reflect the data contains in their cells.

In [5]:

autos.columns

Out[5]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:

co={'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps',
       'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
       'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'no_of_pictures', 'postalCode':'postal_code',
       'lastSeen':'last_seen','name':'name','seller':'seller','price':'price','abtest':'abtest','gearbox':'gearbox','model':'model','odometer':'odometer','brand':'brand'}

In [7]:

autos=autos.rename(columns=co)

In [8]:

autos.columns

Out[8]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [9]:

autos.head()

Out[9]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

I have change the columns names for better understanding for my self and whoever interested working with data.

The original data columns name was written based on camelCase and I have changed that to snake_case. and in the end I have re assigned the new columns names back to the original dataframe

NEXT I am going to explore my data to determine what kind of cleanin tasks left to be done. for starters, we will take a look at Text columns and chekc whether they are providing good and accountable data for our analysis. Then, we go ahead and try to clean our numeric data and convert them if we need to. The methods that we implement for our data exploration are:

1- Describe() (for both categorical and numeric columns) 2- Value_counts and head()

In [10]:

autos.describe(include='all')

Out[10]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	no_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-25 19:57:10	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

based on the quick look: I have relized there are number of columns which are not doing any effect in our analysis and there are other columns which need more time to spend on.

the following columns are not play very importantly: 1- Offer_type 2- Seller

How ever the following columns need more attemtion and cleaning as well: 1- odometer 2- price

In [11]:

autos['price'].value_counts().unique

Out[11]:

<bound method Series.unique of $0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$8,419        1
$3,410        1
$6,889        1
$10,090       1
$19,911       1
Name: price, Length: 2357, dtype: int64>

In [12]:

autos['price']=autos['price'].str.replace('$','').str.replace(',','').str.replace('.','').astype(int)

In [13]:

autos['price']

Out[13]:

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int32

In [14]:

autos=autos.rename(columns={'odometer':'odometer_km'})

In [15]:

autos['odometer_km']=autos['odometer_km'].str.replace(',','').str.replace('km','').astype(int)

In [16]:

autos['odometer_km']=autos['odometer_km'].astype(int)

In [17]:

autos['odometer_km'].unique().shape

Out[17]:

(13,)

In [18]:

autos['odometer_km'].describe()

Out[18]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [19]:

autos['odometer_km'].value_counts().sort_index(ascending=False).head()

Out[19]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

In [20]:

autos['odometer_km'].value_counts().sort_index(ascending=True).head()

Out[20]:

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

In [21]:

autos['price']=autos['price'].astype(int)

In [22]:

autos['price'].unique().shape

Out[22]:

(2357,)

In [23]:

autos['price'].describe()

Out[23]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [24]:

autos['price'].value_counts().sort_index(ascending=True).head(50)

Out[24]:

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price, dtype: int64

In [25]:

autos['price'].value_counts().sort_index(ascending=False).head(50)

Out[25]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
145000      1
139997      1
137999      1
135000      1
130000      1
129000      1
128000      1
120000      2
119900      1
119500      1
116000      1
115991      1
115000      1
114400      1
109999      1
105000      2
104900      1
99900       2
99000       2
98500       1
Name: price, dtype: int64

Ok, So far So good. We have realized that there are some inconsistancies in the prices which we need to get rid of those as they are not correctly reflected into our data.

I am going to remove anything above 350000 Dollars and anything below 100$

autos=autos[autos['price'].between(100,350000)]

In [26]:

autos['price'].value_counts().sort_index().head(20)

Out[26]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [27]:

autos['price'].value_counts().sort_index(ascending=False)

Out[27]:

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64

I have successfully removed the outliers anything above or below the considered values

In [28]:

autos.head(100)

Out[28]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	no_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	2016-03-30 15:59:17	Jaguar_XJ_5.0_V8_Portfolio	privat	Angebot	36000	test	limousine	2012	automatik	385	andere	40000	8	benzin	jaguar	nein	2016-03-30 00:00:00	0	21720	2016-04-07 06:44:54
96	2016-03-07 20:56:44	Volkswagen_Golf_Variant_1.9_TDI_Joker	privat	Angebot	1000	test	kombi	1997	manuell	110	golf	150000	6	diesel	volkswagen	nein	2016-03-07 00:00:00	0	96106	2016-04-05 23:45:36
97	2016-04-01 16:54:40	Treuer_Begleiter_abzugeben	privat	Angebot	800	control	bus	2000	manuell	63	andere	125000	8	benzin	mazda	ja	2016-04-01 00:00:00	0	87648	2016-04-05 12:17:40
98	2016-04-04 15:47:36	Audi_A4_Cabriolet_2.0_TFSI_S_Line	privat	Angebot	13500	test	cabrio	2008	automatik	200	a4	125000	5	benzin	audi	NaN	2016-04-04 00:00:00	0	66459	2016-04-06 17:17:56
99	2016-04-05 09:48:54	Peugeot_207_CC___Cabrio_Bj_2011	privat	Angebot	0	control	cabrio	2011	manuell	0	2_reihe	60000	7	diesel	peugeot	nein	2016-04-05 00:00:00	0	99735	2016-04-07 12:17:34

100 rows × 20 columns

In [29]:

autos['abtest'].value_counts().sort_index().head()

Out[29]:

control    24244
test       25756
Name: abtest, dtype: int64

In [30]:

autos['abtest'].unique()

Out[30]:

array(['control', 'test'], dtype=object)

In [31]:

autos['abtest'].describe()

Out[31]:

count     50000
unique        2
top        test
freq      25756
Name: abtest, dtype: object

In [32]:

autos['date_crawled'].head()

Out[32]:

0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object

In [33]:

autos['last_seen'].head()

Out[33]:

0    2016-04-06 06:45:54
1    2016-04-06 14:45:08
2    2016-04-06 20:15:37
3    2016-03-15 03:16:28
4    2016-04-01 14:38:50
Name: last_seen, dtype: object

In [34]:

autos['ad_created'].head()

Out[34]:

0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
2    2016-03-26 00:00:00
3    2016-03-12 00:00:00
4    2016-04-01 00:00:00
Name: ad_created, dtype: object

In [35]:

autos['registration_month'].head()

Out[35]:

0    3
1    6
2    7
3    6
4    7
Name: registration_month, dtype: int64

In [36]:

autos['registration_year'].head()

Out[36]:

0    2004
1    1997
2    2009
3    2007
4    2003
Name: registration_year, dtype: int64

In [37]:

autos['registration_month'].describe()

Out[37]:

count    50000.000000
mean         5.723360
std          3.711984
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

In [38]:

autos['registration_month'].value_counts().sort_index()

Out[38]:

0     5075
1     3282
2     3008
3     5071
4     4102
5     4107
6     4368
7     3949
8     3191
9     3389
10    3651
11    3360
12    3447
Name: registration_month, dtype: int64

In [39]:

crawled=autos['date_crawled'].str[:10]
crawled.value_counts(normalize=True,dropna=False).sort_index()

Out[39]:

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

Here I have used str method on pandas series 'date_crawled' to focus on the date and day of the crawled data.

Then, I have used value_counts with normalized argument to change the number of occurence of specific date to Percentage. and then I used sort_index function to sort them out based on ascending order.

I am going to do the same for 'ad_created' and 'last_seen' columns since they are both recognized as Oject(string) in Pandas.

In [40]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[40]:

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64

In [41]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[41]:

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

In [42]:

autos['last_seen'].describe()

Out[42]:

count                   50000
unique                  39481
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object

In [43]:

autos['registration_year'].describe()

Out[43]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

As you can see from the result above there are Two particular number which do not make any sense. min value is =1000 max value is=9999

I am going to limit the domain and consider values between 1900 up to 2016

In [44]:

autos=autos[autos['registration_year'].between(1960,2017)]

In [45]:

autos['registration_year'].value_counts().sort_index()

Out[45]:

1960      34
1961       6
1962       4
1963       9
1964      12
1965      17
1966      22
1967      27
1968      26
1969      19
1970      45
1971      27
1972      35
1973      26
1974      24
1975      19
1976      27
1977      22
1978      47
1979      35
1980      97
1981      31
1982      43
1983      53
1984      53
1985     105
1986      76
1987      75
1988     142
1989     181
1990     395
1991     356
1992     391
1993     445
1994     660
1995    1313
1996    1444
1997    2028
1998    2453
1999    3000
2000    3354
2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
Name: registration_year, dtype: int64

In [46]:

autos['brand'].unique()

Out[46]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [47]:

brands=autos['brand'].value_counts()
brands

Out[47]:

volkswagen        10564
opel               5387
bmw                5382
mercedes_benz      4682
audi               4254
ford               3445
renault            2367
peugeot            1449
fiat               1288
seat                914
skoda               781
mazda               751
nissan              749
smart               695
citroen             692
toyota              615
sonstige_autos      513
hyundai             485
volvo               453
mini                424
mitsubishi          399
honda               391
kia                 353
alfa_romeo          326
porsche             292
suzuki              289
chevrolet           277
chrysler            181
dacia               128
daihatsu            126
jeep                109
subaru              107
land_rover           98
saab                 78
daewoo               78
trabant              76
jaguar               76
rover                69
lancia               56
lada                 29
Name: brand, dtype: int64

In [48]:

brands.index[0:20]

Out[48]:

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],
      dtype='object')

I have decided to create aggregate data of top 20 brands and price average and put it in the dictionary variable.

In [49]:

my_dict={}
for i in brands.index[0:6]:
    a=autos[autos['brand']==i]
    b=a['price'].mean()
    my_dict[i]=int(b)
print(my_dict)
    
    

{'volkswagen': 6413, 'opel': 5144, 'bmw': 8278, 'mercedes_benz': 29679, 'audi': 8989, 'ford': 7111}

Audi, BMW and Mercedes Benz are more expensive Ford and Opel are less expensive Volkswagen is in between

In [50]:

avg_kilometers={}
for i in brands.index[0:6]:
    a=autos[autos['brand']==i]
    b=a['odometer_km'].mean()
    
    avg_kilometers[i]=int(b)

print(avg_kilometers)
    

{'volkswagen': 128977, 'opel': 129356, 'bmw': 132544, 'mercedes_benz': 131168, 'audi': 129561, 'ford': 124184}

Here we calculated the average milage based on top 6 brands

In [51]:

autos.columns

Out[51]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Next I am going to converts those dictionaries created to Pandas. Series

In [52]:

avg_price_brand=pd.Series(my_dict)
avg_milage_brand=pd.Series(avg_kilometers)

No I am going to converts these series into DataFram

In [53]:

Avg_Price_Mile=pd.DataFrame(avg_price_brand)

In [54]:

Avg_Price_Mile['avg_milage_brand']=avg_milage_brand

In [55]:

Avg_Price_Mile=Avg_Price_Mile.rename(columns={0:'avg_price_brand'})

In [56]:

Avg_Price_Mile.sort_index(ascending=False)

Out[56]:

	avg_price_brand	avg_milage_brand
volkswagen	6413	128977
opel	5144	129356
mercedes_benz	29679	131168
ford	7111	124184
bmw	8278	132544
audi	8989	129561

The table above shows the Average Price/Milage The minimum average mileage belong to Fold and maximum mileage belong to BMW

The minimum average price belong to Opel and maximum average price accross all top 6 brands is belong to Mercedes_Benz

In [ ]: