Exploring Ebay Car Sales Data¶

by David VanHeeswijk¶

In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

In [111]:

import pandas as pd

autos = pd.read_csv('autos.csv', encoding ='Latin-1')

In [112]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[112]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [113]:

autos.columns

Out[113]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We see that the column names are all in camelcase, which capitalizes every word after the first one. We would like to replace these column names with the snakecase version of each name, which places an underscore between each word.

In [114]:

autos = autos.rename(columns={'yearOfRegistration':'registration_year'}).rename(columns={'monthOfRegistration':'registration_month'}).rename(columns={'notRepairedDamage':'unrepaired_damage'}).rename(columns={'dateCreated':'ad_created'})
autos.columns

Out[114]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuelType', 'brand',
       'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [115]:

autos = autos.rename(columns={'dateCrawled':'date_crawled'}).rename(columns = {'offerType':'offer_type'}).rename(columns = {'vehicleType':'vehicle_type'}).rename(columns = {'powerPS':'power_ps'}).rename(columns={'fuelType':'fuel_type'}).rename(columns={'nrOfPictures':'nr_of_pictures'}).rename(columns={'postalCode':'postal_code'}).rename(columns={'lastSeen':'last_seen'})
autos.columns

Out[115]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [116]:

autos.head()

Out[116]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The column names have now all been changed to reflect the style of snakecase.

In [117]:

autos.describe()

Out[117]:

	registration_year	power_ps	registration_month	nr_of_pictures	postal_code
count	50000.000000	50000.000000	50000.000000	50000.0	50000.000000
mean	2005.073280	116.355920	5.723360	0.0	50813.627300
std	105.712813	209.216627	3.711984	0.0	25779.747957
min	1000.000000	0.000000	0.000000	0.0	1067.000000
25%	1999.000000	70.000000	3.000000	0.0	30451.000000
50%	2003.000000	105.000000	6.000000	0.0	49577.000000
75%	2008.000000	150.000000	9.000000	0.0	71540.000000
max	9999.000000	17700.000000	12.000000	0.0	99998.000000

As we can see, the column for "nr_of_pictures" doesn't give any significant amount of data, as all values are equivalent, so we will need to explore this column to see if we can drop it from our analysis.

We also see that the registration year doesn't match what would be acceptable in terms of the minimum value of '1000' and the maximum value of '9999'. We will have to look into why these values come up in the column and adjust any info that is invalid.

Lastly, a few columns are listed as string values when they could be converted to integers. Namely, the prices and the odometer readings can be converted to numeric values and added to our analysis.

In [118]:

autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
autos['odometer'].head()

Out[118]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

In [119]:

autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['price'].head()

Out[119]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [120]:

autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein','0').str.replace('ja','1').str.replace('nan','0').astype(bool)

In [121]:

autos['unrepaired_damage'].head()

Out[121]:

0    True
1    True
2    True
3    True
4    True
Name: unrepaired_damage, dtype: bool

In [122]:

autos.describe()

Out[122]:

	price	registration_year	power_ps	odometer	registration_month	nr_of_pictures	postal_code
count	5.000000e+04	50000.000000	50000.000000	50000.000000	50000.000000	50000.0	50000.000000
mean	9.840044e+03	2005.073280	116.355920	125732.700000	5.723360	0.0	50813.627300
std	4.811044e+05	105.712813	209.216627	40042.211706	3.711984	0.0	25779.747957
min	0.000000e+00	1000.000000	0.000000	5000.000000	0.000000	0.0	1067.000000
25%	1.100000e+03	1999.000000	70.000000	125000.000000	3.000000	0.0	30451.000000
50%	2.950000e+03	2003.000000	105.000000	150000.000000	6.000000	0.0	49577.000000
75%	7.200000e+03	2008.000000	150.000000	150000.000000	9.000000	0.0	71540.000000
max	1.000000e+08	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

In [123]:

autos['registration_year'].unique()

Out[123]:

array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])

In [124]:

autos.rename(columns={'odometer':'odometer_km'},inplace=True)

In [125]:

autos.head()

Out[125]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	True	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	True	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	True	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	True	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	True	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [126]:

autos['seller'].unique()

Out[126]:

array(['privat', 'gewerblich'], dtype=object)

In [127]:

autos['offer_type'].unique()

Out[127]:

array(['Angebot', 'Gesuch'], dtype=object)

In [128]:

autos['odometer_km'].unique().shape

Out[128]:

(13,)

In [129]:

autos['odometer_km'].describe()

Out[129]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [130]:

autos['odometer_km'].value_counts()

Out[130]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [131]:

autos['price'].unique().shape

Out[131]:

(2357,)

In [132]:

autos['price'].describe()

Out[132]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [133]:

autos['price'].value_counts().head()

Out[133]:

0       1421
500      781
1500     734
2500     643
1000     639
Name: price, dtype: int64

In [134]:

autos['price'].value_counts().sort_index().head()

Out[134]:

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

Now, we will look to remove any outliers in our data that may affect our analysis. We will start by removing any cars who have a listed price of 0, as this is not likely to be accurate. Similarly, we will eliminate any values that seem to be too high.

Then, we will look into the 'odometer_km' column for any values that are either too high or too low, then remove the rows that correspond with these outliers.

In [135]:

autos['price'].describe()

Out[135]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [136]:

autos = autos[autos['price'].between(10,500000)]

In [137]:

autos['price'].sort_values()

Out[137]:

27658        10
35009        10
17079        10
33893        10
44597        10
29246        10
42186        10
35555        11
9255         11
15747        12
18443        12
3416         12
5132         13
4598         13
19632        14
40856        15
45157        15
30587        17
38552        17
27813        17
8570         18
37141        20
26772        20
46835        20
33200        20
18592        25
33838        25
43572        25
36026        25
14322        25
          ...  
43282    119900
44406    120000
2751     120000
8232     128000
1878     129000
49815    130000
14268    135000
2454     137999
32185    139997
49668    145000
33638    151990
10500    155000
45387    163500
22673    163991
18509    169000
32840    169999
11433    175000
17140    180000
20351    190000
28090    194000
43668    197000
40918    198000
37840    220000
38299    250000
47337    259000
12682    265000
35923    295000
34723    299000
14715    345000
36818    350000
Name: price, Length: 48401, dtype: int64

In [138]:

autos['odometer_km'].describe()

Out[138]:

count     48401.000000
mean     125799.570257
std       39722.502478
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [139]:

autos['odometer_km'].sort_values()

Out[139]:

15428      5000
29518      5000
38385      5000
6750       5000
2040       5000
12971      5000
5031       5000
844        5000
18003      5000
2029       5000
33120      5000
22849      5000
18014      5000
20074      5000
26188      5000
43514      5000
2006       5000
31828      5000
37022      5000
41672      5000
20060      5000
38336      5000
26139      5000
3416       5000
26138      5000
30656      5000
10119      5000
6675       5000
12855      5000
33845      5000
          ...  
20234    150000
20235    150000
20237    150000
20238    150000
20239    150000
20240    150000
20241    150000
20242    150000
20243    150000
20223    150000
20370    150000
20222    150000
20220    150000
20193    150000
20195    150000
20196    150000
20198    150000
20199    150000
20200    150000
20201    150000
20202    150000
20203    150000
20206    150000
20207    150000
20208    150000
20214    150000
20216    150000
20217    150000
20221    150000
49999    150000
Name: odometer_km, Length: 48401, dtype: int64

We removed all prices under $10 and over $500,000 since these were both outliers in regards to our expectations. Additionally, we also analyzed the odometer readings after removing the outliers from the prices and found that there were no remaining outliers in this column, so we left the data set as it was.

Next, we will look to convert the 'date_crawled', 'last_seen', 'ad_created', 'registration_month', and 'registration_year' columns into data that we can understand and analyze quantitatively.

In [140]:

autos.describe()

Out[140]:

	price	registration_year	power_ps	odometer_km	registration_month	nr_of_pictures	postal_code
count	48401.000000	48401.000000	48401.000000	48401.000000	48401.000000	48401.0	48401.000000
mean	5908.885457	2004.774715	117.362079	125799.570257	5.790665	0.0	50993.775418
std	9068.695231	88.790201	200.923769	39722.502478	3.682315	0.0	25744.823947
min	10.000000	1000.000000	0.000000	5000.000000	0.000000	0.0	1067.000000
25%	1250.000000	1999.000000	72.000000	125000.000000	3.000000	0.0	30823.000000
50%	3000.000000	2004.000000	107.000000	150000.000000	6.000000	0.0	49733.000000
75%	7498.000000	2008.000000	150.000000	150000.000000	9.000000	0.0	71679.000000
max	350000.000000	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

In [141]:

autos['registration_year'].head()

Out[141]:

0    2004
1    1997
2    2009
3    2007
4    2003
Name: registration_year, dtype: int64

In [142]:

autos['date_crawled'].head()

Out[142]:

0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object

In [143]:

autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)

Out[143]:

2016-04-03    0.038574
2016-03-20    0.037809
2016-03-21    0.037293
2016-03-12    0.036962
2016-03-14    0.036631
2016-04-04    0.036487
2016-03-07    0.036053
2016-04-02    0.035495
2016-03-28    0.034855
2016-03-19    0.034751
2016-03-15    0.034276
2016-03-29    0.034111
2016-04-01    0.033739
2016-03-30    0.033739
2016-03-08    0.033264
2016-03-09    0.033037
2016-03-22    0.032933
2016-03-11    0.032603
2016-03-23    0.032272
2016-03-26    0.032251
2016-03-10    0.032210
2016-03-31    0.031818
2016-03-17    0.031611
2016-03-25    0.031570
2016-03-27    0.031115
2016-03-16    0.029504
2016-03-24    0.029400
2016-03-05    0.025371
2016-03-13    0.015661
2016-03-06    0.014070
2016-04-05    0.013078
2016-03-18    0.012913
2016-04-06    0.003161
2016-04-07    0.001384
Name: date_crawled, dtype: float64

The analysis performed above shows that the crawler created all of its entries between March 1, 2016 and April 30, 2016, with a greater percentage happening in late March/ early April.

In [144]:

autos['date_crawled'].str[:10].sort_values()

Out[144]:

33784    2016-03-05
34224    2016-03-05
20939    2016-03-05
28933    2016-03-05
20958    2016-03-05
9850     2016-03-05
13612    2016-03-05
47733    2016-03-05
20965    2016-03-05
13620    2016-03-05
13602    2016-03-05
34272    2016-03-05
20991    2016-03-05
21011    2016-03-05
21016    2016-03-05
47746    2016-03-05
42671    2016-03-05
28960    2016-03-05
3456     2016-03-05
20977    2016-03-05
21039    2016-03-05
42785    2016-03-05
42849    2016-03-05
3304     2016-03-05
26925    2016-03-05
42954    2016-03-05
3311     2016-03-05
34139    2016-03-05
26924    2016-03-05
9853     2016-03-05
            ...    
8576     2016-04-07
41121    2016-04-07
12119    2016-04-07
33996    2016-04-07
44551    2016-04-07
31608    2016-04-07
44797    2016-04-07
49569    2016-04-07
6183     2016-04-07
42072    2016-04-07
2213     2016-04-07
9363     2016-04-07
29938    2016-04-07
29853    2016-04-07
4719     2016-04-07
44894    2016-04-07
24853    2016-04-07
26209    2016-04-07
19809    2016-04-07
21527    2016-04-07
4752     2016-04-07
7982     2016-04-07
21510    2016-04-07
37112    2016-04-07
23409    2016-04-07
4378     2016-04-07
16358    2016-04-07
39445    2016-04-07
3654     2016-04-07
38743    2016-04-07
Name: date_crawled, Length: 48401, dtype: object

Upon further inspection, it seems that every entry was entered between March 5 and April 7 of 2016.

In [145]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)

Out[145]:

2016-04-03    0.038822
2016-03-20    0.037871
2016-03-21    0.037520
2016-04-04    0.036859
2016-03-12    0.036776
2016-03-14    0.035268
2016-04-02    0.035185
2016-03-28    0.034979
2016-03-07    0.034793
2016-03-29    0.034049
2016-03-15    0.034008
2016-04-01    0.033718
2016-03-19    0.033636
2016-03-30    0.033553
2016-03-08    0.033264
2016-03-09    0.033119
2016-03-11    0.032933
2016-03-22    0.032747
2016-03-26    0.032313
2016-03-23    0.032107
2016-03-10    0.031921
2016-03-31    0.031859
2016-03-25    0.031694
2016-03-17    0.031280
2016-03-27    0.031032
2016-03-16    0.029999
2016-03-24    0.029338
2016-03-05    0.022933
2016-03-13    0.017024
2016-03-06    0.015330
                ...   
2016-02-14    0.000041
2016-02-12    0.000041
2016-02-24    0.000041
2016-02-02    0.000041
2016-02-20    0.000041
2016-02-05    0.000041
2016-02-18    0.000041
2016-01-10    0.000041
2016-02-26    0.000041
2015-09-09    0.000021
2016-01-16    0.000021
2015-06-11    0.000021
2016-02-16    0.000021
2016-01-07    0.000021
2016-02-08    0.000021
2016-02-07    0.000021
2016-02-01    0.000021
2016-01-22    0.000021
2016-01-13    0.000021
2016-02-22    0.000021
2016-01-29    0.000021
2016-02-09    0.000021
2015-12-05    0.000021
2016-01-14    0.000021
2015-12-30    0.000021
2015-08-10    0.000021
2016-01-03    0.000021
2016-02-17    0.000021
2016-02-11    0.000021
2015-11-10    0.000021
Name: ad_created, Length: 76, dtype: float64

Similar to our observation before, a great deal of the ads in our data set were created in mid to late March and early April of 2016.

In [146]:

autos['ad_created'].str[:10].sort_values()

Out[146]:

22781    2015-06-11
20649    2015-08-10
34883    2015-09-09
2243     2015-11-10
27986    2015-12-05
36993    2015-12-30
2232     2016-01-03
39270    2016-01-07
24851    2016-01-10
49743    2016-01-10
34224    2016-01-13
23708    2016-01-14
36429    2016-01-16
12518    2016-01-22
10942    2016-01-27
4696     2016-01-27
41508    2016-01-27
10735    2016-01-29
13756    2016-02-01
34058    2016-02-02
15081    2016-02-02
46108    2016-02-05
11432    2016-02-05
38294    2016-02-07
25855    2016-02-08
48887    2016-02-09
23602    2016-02-11
9706     2016-02-12
19356    2016-02-12
9467     2016-02-14
            ...    
39045    2016-04-07
1138     2016-04-07
29853    2016-04-07
30532    2016-04-07
23957    2016-04-07
7536     2016-04-07
8036     2016-04-07
20941    2016-04-07
21527    2016-04-07
49569    2016-04-07
12119    2016-04-07
10765    2016-04-07
4378     2016-04-07
44797    2016-04-07
37268    2016-04-07
42822    2016-04-07
45786    2016-04-07
24929    2016-04-07
47709    2016-04-07
22899    2016-04-07
47588    2016-04-07
48922    2016-04-07
11538    2016-04-07
13976    2016-04-07
37112    2016-04-07
41121    2016-04-07
16304    2016-04-07
29938    2016-04-07
39445    2016-04-07
48576    2016-04-07
Name: ad_created, Length: 48401, dtype: object

When we look at the values in the 'ad_created' column sorted by date, we notice that the earliest dates any ads were created is June 11, 2015 while the most recent ads recorded are from April 7, 2016.

In [147]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)

Out[147]:

2016-04-06    0.221607
2016-04-07    0.132043
2016-04-05    0.124956
2016-03-17    0.028057
2016-04-03    0.025185
2016-04-02    0.024855
2016-03-30    0.024752
2016-04-04    0.024504
2016-03-31    0.023842
2016-03-12    0.023781
2016-04-01    0.022871
2016-03-29    0.022334
2016-03-22    0.021384
2016-03-28    0.020867
2016-03-20    0.020640
2016-03-21    0.020599
2016-03-24    0.019752
2016-03-25    0.019194
2016-03-23    0.018595
2016-03-26    0.016818
2016-03-16    0.016446
2016-03-15    0.015867
2016-03-19    0.015826
2016-03-27    0.015599
2016-03-14    0.012624
2016-03-11    0.012376
2016-03-10    0.010620
2016-03-09    0.009628
2016-03-13    0.008863
2016-03-08    0.007355
2016-03-18    0.007335
2016-03-07    0.005413
2016-03-06    0.004339
2016-03-05    0.001074
Name: last_seen, dtype: float64

When we observe the percentages of ads based on the 'last_seen' column, we see that over half of the ads have been checked on between April 5-7, which corresponds to the latest dates that the data was collected.

In [148]:

autos['last_seen'].str[:10].sort_values()

Out[148]:

24801    2016-03-05
27294    2016-03-05
34030    2016-03-05
13133    2016-03-05
39178    2016-03-05
1736     2016-03-05
22493    2016-03-05
48946    2016-03-05
9628     2016-03-05
30389    2016-03-05
11329    2016-03-05
4488     2016-03-05
40848    2016-03-05
33684    2016-03-05
3580     2016-03-05
1027     2016-03-05
36848    2016-03-05
48668    2016-03-05
3983     2016-03-05
18611    2016-03-05
33712    2016-03-05
37880    2016-03-05
34139    2016-03-05
2346     2016-03-05
10942    2016-03-05
19738    2016-03-05
669      2016-03-05
26529    2016-03-05
10142    2016-03-05
31117    2016-03-05
            ...    
18978    2016-04-07
18983    2016-04-07
18998    2016-04-07
19007    2016-04-07
42300    2016-04-07
19010    2016-04-07
19014    2016-04-07
19022    2016-04-07
19024    2016-04-07
19039    2016-04-07
19040    2016-04-07
19042    2016-04-07
19043    2016-04-07
19046    2016-04-07
19111    2016-04-07
19110    2016-04-07
42265    2016-04-07
19091    2016-04-07
19089    2016-04-07
42272    2016-04-07
18593    2016-04-07
19083    2016-04-07
42275    2016-04-07
19063    2016-04-07
19058    2016-04-07
42282    2016-04-07
42283    2016-04-07
19047    2016-04-07
19080    2016-04-07
17702    2016-04-07
Name: last_seen, Length: 48401, dtype: object

Upon further notice, every ad in our database has a last seen date between March 5 and April 7 of 2016, precisely equivalent to the date_crawled column in regards to the range.

In [149]:

autos['registration_year'].describe()

Out[149]:

count    48401.000000
mean      2004.774715
std         88.790201
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

We notice that there are cars whose registration year is marked as '1000' and some are much higher than '2016', which cannot be the case. This means that we will have to remove some data, as the registration years column has outliers that do not make sense.

In [150]:

autos['registration_year'].sort_values(ascending=False)

Out[150]:

8012     9999
33950    9999
38076    9999
49910    9000
25003    8888
8360     6200
27618    5911
24519    5000
22799    5000
4164     5000
49153    5000
42079    4800
453      4500
4549     4100
27578    2800
5763     2019
49185    2019
34800    2018
17069    2018
31810    2018
24334    2018
5950     2018
6321     2018
22339    2018
9575     2018
41840    2018
38978    2018
27672    2018
6597     2018
21379    2018
         ... 
25556    1954
35453    1953
23372    1952
44406    1951
35921    1951
14020    1950
11047    1948
11585    1943
25792    1941
13963    1941
24855    1939
26103    1938
21421    1937
23804    1937
39725    1937
26607    1937
2573     1934
2221     1934
11246    1931
22101    1929
21416    1927
28693    1910
45157    1910
22659    1910
30781    1910
10556    1800
32585    1800
24511    1111
49283    1001
22316    1000
Name: registration_year, Length: 48401, dtype: int64

In [151]:

autos = autos[autos['registration_year'].between(1899,2016)]

In [152]:

autos['registration_year'].describe()

Out[152]:

count    46524.000000
mean      2002.926468
std          7.155528
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Upon removing the outliers, now our data, in regards to registration year, makes more sense.

In [153]:

autos['brand'].value_counts()

Out[153]:

volkswagen        9833
bmw               5116
opel              4994
mercedes_benz     4488
audi              4030
ford              3256
renault           2193
peugeot           1390
fiat              1195
seat               849
skoda              764
nissan             712
mazda              709
smart              661
citroen            653
toyota             593
hyundai            467
sonstige_autos     448
volvo              425
mini               408
mitsubishi         381
honda              365
kia                330
alfa_romeo         309
porsche            283
suzuki             276
chevrolet          265
chrysler           164
dacia              123
daihatsu           117
jeep               106
subaru              99
land_rover          98
saab                77
jaguar              73
daewoo              70
trabant             65
rover               62
lancia              50
lada                27
Name: brand, dtype: int64

In [154]:

top_auto_brands = autos['brand'].value_counts().sort_values(ascending=False).head(20).index
top_autos_price = {}
top_auto_brands

Out[154]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

We have decided to aggregate over the top 20 brands that are in our data set. This should give us relevant info for each brand that shows most frequently.

In [156]:

for car in top_auto_brands:
    brand_row = autos[autos['brand'] == car]
    mean_price = brand_row['price'].mean()
    mean_mileage = brand_row['odometer_km'].mean()
    top_autos_price[car] = mean_price
    
sorted_top_autos_price = sorted(top_autos_price.items(), key=lambda x: x[1], reverse=True)
print(sorted_top_autos_price)

[('sonstige_autos', 12613.939732142857), ('mini', 10639.450980392157), ('audi', 9362.169478908188), ('mercedes_benz', 8657.285427807486), ('bmw', 8367.020719311962), ('skoda', 6384.667539267016), ('volkswagen', 5418.33997762636), ('hyundai', 5376.740899357602), ('toyota', 5167.091062394604), ('volvo', 4969.774117647059), ('nissan', 4750.063202247191), ('seat', 4417.942285041225), ('mazda', 4112.596614950635), ('citroen', 3784.924961715161), ('ford', 3757.5282555282556), ('smart', 3580.2239031770046), ('peugeot', 3100.6928057553955), ('opel', 2991.9164997997595), ('fiat', 2818.456066945607), ('renault', 2483.889192886457)]

In the code above, we created a dictionary that refers to each car brand from our top 20, along with the mean price for each brand. Looking at our results, we see that Sonstige Autos has our highest mean price, followed by Mini and Audi. The lowest mean price is shown to be Renault, with Fiat and Opel rounding out the other two lowest mean price.

Additionally, the mean price for all the car brands in the top 20 falls between $2,483 and $12,614.

In [157]:

top_autos_mileage = {}

for car in top_auto_brands:
    brand_row = autos[autos['brand'] == car]
    mean_price = brand_row['price'].mean()
    mean_mileage = brand_row['odometer_km'].mean()
    top_autos_mileage[car] = mean_mileage

print(top_autos_mileage)

{'audi': 129208.43672456576, 'toyota': 115944.35075885328, 'skoda': 110746.07329842931, 'peugeot': 127122.30215827338, 'fiat': 117066.94560669456, 'opel': 129314.1770124149, 'nissan': 118370.78651685393, 'volkswagen': 128706.39682701108, 'seat': 121166.0777385159, 'ford': 124210.68796068797, 'mercedes_benz': 130838.9037433155, 'citroen': 119647.77947932619, 'hyundai': 106541.75588865097, 'mazda': 124464.03385049365, 'volvo': 138294.11764705883, 'renault': 128144.09484724123, 'bmw': 132553.75293197812, 'mini': 88308.82352941176, 'sonstige_autos': 90446.42857142857, 'smart': 99326.77760968229}

In [158]:

mean_price_series = pd.Series(top_autos)
mean_mileage_series = pd.Series(top_autos_mileage)

In [159]:

print(mean_price_series)

audi               9362.169479
bmw                8367.020719
citroen            3784.924962
fiat               2818.456067
ford               3757.528256
hyundai            5376.740899
mazda              4112.596615
mercedes_benz      8657.285428
mini              10639.450980
nissan             4750.063202
opel               2991.916500
peugeot            3100.692806
renault            2483.889193
seat               4417.942285
skoda              6384.667539
smart              3580.223903
sonstige_autos    12613.939732
toyota             5167.091062
volkswagen         5418.339978
volvo              4969.774118
dtype: float64

In [161]:

print(mean_mileage_series)

audi              129208.436725
bmw               132553.752932
citroen           119647.779479
fiat              117066.945607
ford              124210.687961
hyundai           106541.755889
mazda             124464.033850
mercedes_benz     130838.903743
mini               88308.823529
nissan            118370.786517
opel              129314.177012
peugeot           127122.302158
renault           128144.094847
seat              121166.077739
skoda             110746.073298
smart              99326.777610
sonstige_autos     90446.428571
toyota            115944.350759
volkswagen        128706.396827
volvo             138294.117647
dtype: float64

In [165]:

brand_price_mileage_df = pd.DataFrame({'mean_price': mean_price_series, 'mean_mileage': mean_mileage_series})
brand_price_mileage_df

Out[165]:

	mean_mileage	mean_price
audi	129208.436725	9362.169479
bmw	132553.752932	8367.020719
citroen	119647.779479	3784.924962
fiat	117066.945607	2818.456067
ford	124210.687961	3757.528256
hyundai	106541.755889	5376.740899
mazda	124464.033850	4112.596615
mercedes_benz	130838.903743	8657.285428
mini	88308.823529	10639.450980
nissan	118370.786517	4750.063202
opel	129314.177012	2991.916500
peugeot	127122.302158	3100.692806
renault	128144.094847	2483.889193
seat	121166.077739	4417.942285
skoda	110746.073298	6384.667539
smart	99326.777610	3580.223903
sonstige_autos	90446.428571	12613.939732
toyota	115944.350759	5167.091062
volkswagen	128706.396827	5418.339978
volvo	138294.117647	4969.774118

We have now added average mileage and average price to a data frame for the top 20 brands from our data set. We have created two columns to show the values as such, with the index being the brand names themselves.

Our analysis of the data shows a few observations. The brands Sonstige Autos and Mini, which were both in the top 3 for the highest average price, also happen to be the lowest two in average mileage.

Audi, which was in the top 3 in regards to average price, falls on the top 4 in regards to average mileage. This means that Audis have the highest cost per miles driven in comparison to the other cars in the list.

Opel and Renault both have very high average mileage, which makes the lower average price seem appropriate. In contrast, Fiat falls somewhere in the middle in regards to mileage, which means that the value with regards to price per mile is likely lower than many of the other cars in the list.

In [ ]: