Exploring eBay Car Sales Data¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The data is 50,000 sampled data points from the full dataset.

Part1: Import packages and understand the data¶

In [216]:

# Importing packages and reading in the csv data
import pandas as pd
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [214]:

autos

Out[214]:

	date_crawled	name	seller	offer_type	price_usd	abtest	vehicle_type	registration_year	gearbox	power_PS	model	kilometers	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
0	2016-03-26	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26	0	79588	2016-04-06
1	2016-04-04	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04	0	71034	2016-04-06
2	2016-03-26	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26	0	35394	2016-04-06
3	2016-03-12	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12	0	33729	2016-03-15
4	2016-04-01	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01	0	39218	2016-04-01
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	24900	control	limousine	2011	automatik	239	q5	100000	1	diesel	audi	nein	2016-03-27	0	82131	2016-04-01
49996	2016-03-28	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	1980	control	cabrio	1996	manuell	75	astra	150000	5	benzin	opel	nein	2016-03-28	0	44807	2016-04-02
49997	2016-04-02	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	13200	test	cabrio	2014	automatik	69	500	5000	11	benzin	fiat	nein	2016-04-02	0	73430	2016-04-04
49998	2016-03-08	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	22900	control	kombi	2013	manuell	150	a3	40000	11	diesel	audi	nein	2016-03-08	0	35683	2016-04-05
49999	2016-03-14	Opel_Vectra_1.6_16V	privat	Angebot	1250	control	limousine	1996	manuell	101	vectra	150000	1	benzin	opel	nein	2016-03-13	0	45897	2016-04-06

43323 rows × 20 columns

The dataset contains 5000 rows of information with 20 columns

In [218]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Part 2: Convert column names to snakecase and rename some column names based on data dictionary¶

Array of column names to be used to make edits¶

In [167]:

for i in autos.columns:
    print(i)

dateCrawled
name
seller
offerType
price
abtest
vehicleType
yearOfRegistration
gearbox
powerPS
model
odometer
monthOfRegistration
fuelType
brand
notRepairedDamage
dateCreated
nrOfPictures
postalCode
lastSeen

In [221]:

old_columns = autos.columns.copy()

In [222]:

new_columns = [
                'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
                'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
                'kilometers', 'registration_month', 'fuel_type', 'brand',
                'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
                'last_seen'
               ]

In [223]:

autos.columns = new_columns

In [171]:

autos.head()

Out[171]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_PS	model	kilometers	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The following are the new column names that will be used from now on in the dataframe. The column names were changed to that they conform to the snakecase format from the original camelcase in which they were.

Part 3: Basic Data Exploration and Cleaning Tasks¶

In [217]:

autos.describe(include='all')

Out[217]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

The columns seller, offer_type and abtest have each mostly one value. These columns need to be investigated further.

In [219]:

autos["seller"].unique()

Out[219]:

array(['privat', 'gewerblich'], dtype=object)

In [174]:

autos.loc[autos['seller'] == 'privat', 'seller'].value_counts()

Out[174]:

privat    49999
Name: seller, dtype: int64

In [224]:

autos["offer_type"].unique()

Out[224]:

array(['Angebot', 'Gesuch'], dtype=object)

In [225]:

autos.loc[autos['offer_type'] == 'Angebot', 'offer_type'].value_counts()

Out[225]:

Angebot    49999
Name: offer_type, dtype: int64

The two columns seller and offer type can be dropped as they only have mostly one type of information.

In [177]:

autos.dtypes

Out[177]:

date_crawled          object
name                  object
seller                object
offer_type            object
price                 object
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_PS               int64
model                 object
kilometers            object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
nr_of_pictures         int64
postal_code            int64
last_seen             object
dtype: object

The infomation in columns price, kilometers should be integers while ad_created and last seen are dates.

In [178]:

autos["price"].tail()

Out[178]:

49995    $24,900
49996     $1,980
49997    $13,200
49998    $22,900
49999     $1,250
Name: price, dtype: object

For price, the dollar symbol and the coma is to be removed, type changed to integer and
column name price changed to price_usd

In [226]:

autos['price'] = autos['price'].str.replace('$', "")\
                               .str.replace(",", "")\
                               .astype(int)

In [227]:

autos.rename(columns={'price': 'price_usd'}, inplace=True)

In [228]:

autos["price_usd"].head()

Out[228]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price_usd, dtype: int64

In [229]:

autos["kilometers"].head()

Out[229]:

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: kilometers, dtype: object

For kilometers, the km symbol and the coma is to be removed and type changed to integer

In [230]:

autos['kilometers'] = autos['kilometers'].str.replace('km', "")\
                               .str.replace(",", "")\
                               .astype(int)

In [231]:

autos["kilometers"].head()

Out[231]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: kilometers, dtype: int64

Part 4: Further Data Exploration - Removing Outliers¶

In [232]:

print(autos["kilometers"].unique().shape)
autos["kilometers"].describe()

(13,)

Out[232]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: kilometers, dtype: float64

In [233]:

autos["kilometers"].value_counts()

Out[233]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: kilometers, dtype: int64

In [234]:

print(autos["price_usd"].unique().shape)
autos['price_usd'].describe().apply(lambda x: format(x, 'f'))

(2357,)

Out[234]:

count       50000.000000
mean         9840.043760
std        481104.380500
min             0.000000
25%          1100.000000
50%          2950.000000
75%          7200.000000
max      99999999.000000
Name: price_usd, dtype: object

In [235]:

autos["price_usd"].value_counts().head(10)

Out[235]:

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price_usd, dtype: int64

In [189]:

autos["price_usd"].value_counts().sort_index(ascending=False).head(50)

Out[189]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
145000      1
139997      1
137999      1
135000      1
130000      1
129000      1
128000      1
120000      2
119900      1
119500      1
116000      1
115991      1
115000      1
114400      1
109999      1
105000      2
104900      1
99900       2
99000       2
98500       1
Name: price_usd, dtype: int64

In [190]:

autos["price_usd"].value_counts().sort_index(ascending=True)[100:150]

Out[190]:

330      8
333     17
340     11
349     15
350    335
356      1
359      1
360      8
369      1
370     21
375      7
378      1
379      1
380     29
385      1
388      1
390     26
395      2
399     71
400    321
410      1
414      1
420      9
425      4
430     13
435      1
440      1
444      8
449     12
450    265
459      1
460      5
470      7
475      4
480     25
485      2
490     34
495      4
499     92
500    781
501      1
510      2
517      1
520      8
525      4
530      8
540      2
549     13
550    356
554      1
Name: price_usd, dtype: int64

For autos price, remove any prices above 999990 which does not seem reasonable and for the minimum price of the cars, we could maybe use 100.

In [237]:

autos = autos[autos["price_usd"].between(100,900000)]
autos.shape

Out[237]:

(48224, 20)

In [238]:

autos["price_usd"].describe().apply(lambda x: format(x, "f"))

Out[238]:

count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price_usd, dtype: object

In [239]:

autos["price_usd"].value_counts().sort_index(ascending=False).head()

Out[239]:

350000    1
345000    1
299000    1
295000    1
265000    1
Name: price_usd, dtype: int64

The remaining values are better now, after having removed the outliers.

Part 5: Data Exploration of Date Columns¶

In [240]:

autos[["date_crawled", "ad_created", "last_seen"]].head()

Out[240]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [241]:

autos.loc[:, "date_crawled"] = autos.loc[:, "date_crawled"].str[:10]
autos["date_crawled"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

Out[241]:

2016-03-05    0.025361
2016-03-06    0.014039
2016-03-07    0.036061
2016-03-08    0.033179
2016-03-09    0.033013
2016-03-10    0.032287
2016-03-11    0.032598
2016-03-12    0.036911
2016-03-13    0.015677
2016-03-14    0.036662
2016-03-15    0.034319
2016-03-16    0.029467
2016-03-17    0.031499
2016-03-18    0.012898
2016-03-19    0.034734
2016-03-20    0.037803
2016-03-21    0.037201
2016-03-22    0.032888
2016-03-23    0.032287
2016-03-24    0.029446
2016-03-25    0.031499
2016-03-26    0.032308
2016-03-27    0.031126
2016-03-28    0.034962
2016-03-29    0.034112
2016-03-30    0.033738
2016-03-31    0.031851
2016-04-01    0.033697
2016-04-02    0.035605
2016-04-03    0.038611
2016-04-04    0.036538
2016-04-05    0.013064
2016-04-06    0.003173
2016-04-07    0.001389
Name: date_crawled, dtype: float64

The data was crawled between 3rd March to 7 April 2016 for each day. The distribution of most of the values are ranging between 1 to 3 percent of the total, except for data that was collected on 6 and 7 April, which was less than 1 percent.

In [242]:

pd.options.display.max_rows = 100
autos.loc[:, "ad_created"] = autos.loc[:, "ad_created"].str[:10]
autos["ad_created"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

Out[242]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
2016-02-22    0.000021
2016-02-23    0.000083
2016-02-24    0.000041
2016-02-25    0.000062
2016-02-26    0.000041
2016-02-27    0.000124
2016-02-28    0.000207
2016-02-29    0.000166
2016-03-01    0.000104
2016-03-02    0.000104
2016-03-03    0.000871
2016-03-04    0.001493
2016-03-05    0.022914
2016-03-06    0.015304
2016-03-07    0.034796
2016-03-08    0.033179
2016-03-09    0.033096
2016-03-10    0.031997
2016-03-11    0.032909
2016-03-12    0.036745
2016-03-13    0.017045
2016-03-14    0.035294
2016-03-15    0.034049
2016-03-16    0.029964
2016-03-17    0.031167
2016-03-18    0.013582
2016-03-19    0.033614
2016-03-20    0.037865
2016-03-21    0.037429
2016-03-22    0.032702
2016-03-23    0.032121
2016-03-24    0.029384
2016-03-25    0.031623
2016-03-26    0.032370
2016-03-27    0.031043
2016-03-28    0.035066
2016-03-29    0.034070
2016-03-30    0.033552
2016-03-31    0.031893
2016-04-01    0.033676
2016-04-02    0.035294
2016-04-03    0.038860
2016-04-04    0.036890
2016-04-05    0.011799
2016-04-06    0.003256
2016-04-07    0.001244
Name: ad_created, dtype: float64

The ads were created between 11 June 2015 to 7 April 2016. Adds created appear to start off slow from June 2015 then increase up in March 2016, remain stable for much of the time until begining of April 2016 then slow down again from there.

In [243]:

autos.loc[:, "last_seen"] = autos.loc[:, "last_seen"].str[:10]
autos["last_seen"].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

Out[243]:

2016-03-05    0.001078
2016-03-06    0.004313
2016-03-07    0.005433
2016-03-08    0.007320
2016-03-09    0.009580
2016-03-10    0.010638
2016-03-11    0.012400
2016-03-12    0.023785
2016-03-13    0.008875
2016-03-14    0.012629
2016-03-15    0.015863
2016-03-16    0.016444
2016-03-17    0.028098
2016-03-18    0.007320
2016-03-19    0.015760
2016-03-20    0.020654
2016-03-21    0.020550
2016-03-22    0.021359
2016-03-23    0.018580
2016-03-24    0.019762
2016-03-25    0.019098
2016-03-26    0.016672
2016-03-27    0.015552
2016-03-28    0.020840
2016-03-29    0.022292
2016-03-30    0.024697
2016-03-31    0.023826
2016-04-01    0.022852
2016-04-02    0.024884
2016-04-03    0.025133
2016-04-04    0.024531
2016-04-05    0.125062
2016-04-06    0.221964
2016-04-07    0.132154
Name: last_seen, dtype: float64

The last_seen date information was collected on a daily basis between 3 March to 7 April 2016 and looks uniformly distributed.

Part 6: Counting Number of Listings Outside the 1900 to 2016 Interval¶

In [244]:

autos["registration_year"].head()

Out[244]:

0    2004
1    1997
2    2009
3    2007
4    2003
Name: registration_year, dtype: int64

In [245]:

autos["registration_year"].describe()

Out[245]:

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration year of 1000 (minimum) and that of 9999 (max) seems to be off, could be a mistake. These values should be removed most probably.

In [246]:

# Determining number of listings with cars that fall outside 
# the 1900 - 2016 interval 
print(autos[(autos["registration_year"] < 1900) | 
            (autos["registration_year"] > 2016)].shape)
autos.shape

(1872, 20)

Out[246]:

(48224, 20)

There are 1872 from the total of 48224 which is approximately 4%, so they can be removed.

In [247]:

autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True)

Out[247]:

1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
1937    0.000086
1938    0.000022
1939    0.000022
1941    0.000043
1943    0.000022
1948    0.000022
1950    0.000022
1951    0.000043
1952    0.000022
1953    0.000022
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000129
1960    0.000475
1961    0.000129
1962    0.000086
1963    0.000173
1964    0.000259
1965    0.000367
1966    0.000475
1967    0.000561
1968    0.000561
1969    0.000410
1970    0.000798
1971    0.000561
1972    0.000712
1973    0.000496
1974    0.000518
1975    0.000388
1976    0.000453
1977    0.000475
1978    0.000906
1979    0.000734
1980    0.001747
1981    0.000582
1982    0.000885
1983    0.001100
1984    0.001100
1985    0.002006
1986    0.001532
1987    0.001553
1988    0.002869
1989    0.003689
1990    0.007163
1991    0.007292
1992    0.007918
1993    0.009061
1994    0.013505
1995    0.025738
1996    0.029233
1997    0.041530
1998    0.050483
1999    0.062112
2000    0.066966
2001    0.056718
2002    0.053439
2003    0.058099
2004    0.058228
2005    0.062802
2006    0.057560
2007    0.049038
2008    0.047679
2009    0.044874
2010    0.034238
2011    0.034907
2012    0.028219
2013    0.017281
2014    0.014282
2015    0.008198
2016    0.025932
Name: registration_year, dtype: float64

There is a steady increase in registration from 1910 and it peaks around 1996 to 2009, and then drops from then on.

Part 7: Aggregation Analysis¶

In [248]:

autos["brand"].unique()

Out[248]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

I choose to agregate by only the brands that have over 5% of total values and these are volkswagen, bmw, mercedes_benz, opel, audi and ford

In [249]:

brands = autos["brand"].value_counts(normalize=True, sort=True)
selected_brands = brands[brands>=0.05]
selected_brands

Out[249]:

volkswagen       0.211404
bmw              0.110179
opel             0.107245
mercedes_benz    0.096652
audi             0.086771
ford             0.069835
Name: brand, dtype: float64

In [250]:

# Create an empty dictionary to store the aggregated data
brand_data = {}

# Use a for loop to iterate over the selected brands
for brand in selected_brands.index:
    # Assign the mean price to the dictionary with the brand name as key
    selected_rows = autos[autos["brand"]==brand]
    meanprice = selected_rows['price_usd'].mean()
    # Print out the dictionary of aggregated data
    brand_data[brand] = meanprice

In [251]:

# initializing K for rounding off
K = 3

# loop to iterate for values 
res = dict()
for key in brand_data:
      
    # rounding to K using round()
    res[key] = round(brand_data[key], K)

#Rounded result 
res

Out[251]:

{'volkswagen': 5436.95,
 'bmw': 8381.677,
 'opel': 3005.496,
 'mercedes_benz': 8672.654,
 'audi': 9380.719,
 'ford': 3779.269}

From the top 5% brands, Audi is the most expensive, followed by Mercedes Benz and BMW. Ford and Opel are the cheapest among the top 5% and Volkwagen is in the middle

Part 8: Combine Data from Two Series into Single Dataframe¶

In [252]:

autos["kilometers"].head()

Out[252]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: kilometers, dtype: int64

In [253]:

brands = autos["brand"].value_counts(normalize=True, sort=True)
brand_labels = brands[brands >= 0.05].index
brand_labels

Out[253]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [254]:

brand_mean_prices = {}
for brand in brand_labels:
    price = autos[autos["brand"] == brand]['price_usd']
    brand_mean_prices[brand] = int(price.mean())
brand_mean_prices

Out[254]:

{'volkswagen': 5436,
 'bmw': 8381,
 'opel': 3005,
 'mercedes_benz': 8672,
 'audi': 9380,
 'ford': 3779}

In [255]:

brand_mean_kilos = {}
for brand in brand_labels:
    kilos = autos[autos["brand"] == brand]['kilometers']
    brand_mean_kilos[brand] = int(kilos.mean())
brand_mean_kilos

Out[255]:

{'volkswagen': 128799,
 'bmw': 132695,
 'opel': 129384,
 'mercedes_benz': 131025,
 'audi': 129245,
 'ford': 124277}

In [256]:

bmp_series = pd.Series(brand_mean_prices)
bmk_series = pd.Series(brand_mean_kilos)
df = pd.DataFrame(bmp_series,columns=['mean_price'])
df['mean_km'] = bmk_series
df.sort_values(by=['mean_price', 'mean_km'], ascending=False)

Out[256]:

	mean_price	mean_km
audi	9380	129245
mercedes_benz	8672	131025
bmw	8381	132695
volkswagen	5436	128799
ford	3779	124277
opel	3005	129384

The three more expensive brands, Audi, Mecedes benz and BMW also have the highest mileage according to this analysis.

Part 9: Translate German words into English¶

In [259]:

autos.columns

Out[259]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'kilometers', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [257]:

autos["seller"].unique()

Out[257]:

array(['privat', 'gewerblich'], dtype=object)

In [258]:

seller = {"privat":"private", "gewerblich":"commercial"}
autos["seller"] = autos["seller"].map(seller)
autos["seller"].unique()

Out[258]:

array(['private', 'commercial'], dtype=object)

In [261]:

autos["vehicle_type"].unique()

Out[261]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [262]:

vehicle = {"bus":"bus", "limousine":"limousine", "kleinwagen": "compact",
           "kombi": "station wagon", "coupe": "coupe", "suv": "suv",
           "cabrio": "convertible", "andere": "other"}
autos["vehicle_type"] = autos["vehicle_type"].map(vehicle)
autos["vehicle_type"].unique()

Out[262]:

array(['bus', 'limousine', 'compact', 'station wagon', nan, 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

In [263]:

autos["unrepaired_damage"].unique()

Out[263]:

array(['nein', nan, 'ja'], dtype=object)

In [264]:

damage = {"nein":"no", "ja": "yes"}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(damage)
autos["unrepaired_damage"].unique()

Out[264]:

array(['no', nan, 'yes'], dtype=object)

Most Common Brand/Model Combinations¶

In [265]:

autos["kilometers"].head()

Out[265]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: kilometers, dtype: int64

In [268]:

gp = autos.groupby('brand')['model'].value_counts().\
sort_values(ascending=False)[0:10]
gp

Out[268]:

brand          model   
volkswagen     golf        3684
bmw            3er         2602
volkswagen     polo        1592
opel           corsa       1568
volkswagen     passat      1345
opel           astra       1337
audi           a4          1226
mercedes_benz  c_klasse    1135
bmw            5er         1123
mercedes_benz  e_klasse     953
Name: model, dtype: int64

VW, BMW and Opel are among the most common brands

Average Price compared to Mileage¶

In [270]:

autos.groupby('kilometers')['price_usd'].mean()

Out[270]:

kilometers
5000       9794.102672
10000     21255.304721
20000     18523.200271
30000     16674.564069
40000     15518.952261
50000     13853.911111
60000     12406.989343
70000     10954.826014
80000      9743.110058
90000      8470.087321
100000     8188.223581
125000     6238.277181
150000     3792.320241
Name: price_usd, dtype: float64

It can be seen that the mean price generally reduces as the mileage increases

Brand Compared to Unrepaired Damage¶

In [272]:

pd.options.display.max_rows = 200
top_brands = df.index.to_list()
top_autos = autos[autos['brand'].isin(top_brands)]
damage_cost = top_autos.groupby(['brand', 'unrepaired_damage'])['price_usd'].mean()
damage_cost

Out[272]:

brand          unrepaired_damage
audi           no                   10929.202288
               yes                   3324.684685
bmw            no                    9454.446587
               yes                   3538.860000
ford           no                    4666.671089
               yes                   1408.277228
mercedes_benz  no                    9815.369760
               yes                   3933.283626
opel           no                    3672.347530
               yes                   1396.977509
volkswagen     no                    6483.153377
               yes                   2194.585511
Name: price_usd, dtype: float64

In [293]:

damage_df = pd.DataFrame(damage_cost).reset_index()
damage_df

Out[293]:

	brand	unrepaired_damage	price_usd
0	audi	no	10929.202288
1	audi	yes	3324.684685
2	bmw	no	9454.446587
3	bmw	yes	3538.860000
4	ford	no	4666.671089
5	ford	yes	1408.277228
6	mercedes_benz	no	9815.369760
7	mercedes_benz	yes	3933.283626
8	opel	no	3672.347530
9	opel	yes	1396.977509
10	volkswagen	no	6483.153377
11	volkswagen	yes	2194.585511

It looks like cars with unrepaired damage cost about 1/2 to 1/3 the price of repaired cars of the same brand just looking at the top brands

In [ ]: