In this guided project, I'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

This is my first data cleaning/ data analysis project. The main goal is to practice Pandas and Numpy libraries learned fro Dataquest platform.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Let's start with importing the libraries and the dataset.

In [1]:

import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos

Out[1]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-24 11:52:17	Golf_3_1.6	privat	Angebot	480	test	NaN	1993	manuell	0	golf	150000	0	benzin	volkswagen	NaN	2016-03-24 00:00:00	0	70435	2016-04-07 03:16:57
1	2016-03-24 10:58:45	A5_Sportback_2.7_Tdi	privat	Angebot	18300	test	coupe	2011	manuell	190	NaN	125000	5	diesel	audi	ja	2016-03-24 00:00:00	0	66954	2016-04-07 01:46:50
2	2016-03-14 12:52:21	Jeep_Grand_Cherokee_"Overland"	privat	Angebot	9800	test	suv	2004	automatik	163	grand	125000	8	diesel	jeep	NaN	2016-03-14 00:00:00	0	90480	2016-04-05 12:47:46
3	2016-03-17 16:54:04	GOLF_4_1_4__3TÜRER	privat	Angebot	1500	test	kleinwagen	2001	manuell	75	golf	150000	6	benzin	volkswagen	nein	2016-03-17 00:00:00	0	91074	2016-03-17 17:40:17
4	2016-03-31 17:25:20	Skoda_Fabia_1.4_TDI_PD_Classic	privat	Angebot	3600	test	kleinwagen	2008	manuell	69	fabia	90000	7	diesel	skoda	nein	2016-03-31 00:00:00	0	60437	2016-04-06 10:17:21
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
371523	2016-03-14 17:48:27	Suche_t4___vito_ab_6_sitze	privat	Angebot	2200	test	NaN	2005	NaN	0	NaN	20000	1	NaN	sonstige_autos	NaN	2016-03-14 00:00:00	0	39576	2016-04-06 00:46:52
371524	2016-03-05 19:56:21	Smart_smart_leistungssteigerung_100ps	privat	Angebot	1199	test	cabrio	2000	automatik	101	fortwo	125000	3	benzin	smart	nein	2016-03-05 00:00:00	0	26135	2016-03-11 18:17:12
371525	2016-03-19 18:57:12	Volkswagen_Multivan_T4_TDI_7DC_UY2	privat	Angebot	9200	test	bus	1996	manuell	102	transporter	150000	3	diesel	volkswagen	nein	2016-03-19 00:00:00	0	87439	2016-04-07 07:15:26
371526	2016-03-20 19:41:08	VW_Golf_Kombi_1_9l_TDI	privat	Angebot	3400	test	kombi	2002	manuell	100	golf	150000	6	diesel	volkswagen	NaN	2016-03-20 00:00:00	0	40764	2016-03-24 12:45:21
371527	2016-03-07 19:39:19	BMW_M135i_vollausgestattet_NP_52.720____Euro	privat	Angebot	28990	control	limousine	2013	manuell	320	m_reihe	50000	8	benzin	bmw	nein	2016-03-07 00:00:00	0	73326	2016-03-22 03:17:10

371528 rows × 20 columns

Default encoding (UTF-8) couldn't recognize this dataset, so I used Latin-1. Jupyter gives first and last 5 rows of the dataset just by calling it. Now, let's get some info about our data.

In [2]:

print("first 5 rows:", autos.head(), sep='\n')
print(autos.info())
print("shape:", autos.shape)

first 5 rows:
           dateCrawled                            name  seller offerType  \
0  2016-03-24 11:52:17                      Golf_3_1.6  privat   Angebot   
1  2016-03-24 10:58:45            A5_Sportback_2.7_Tdi  privat   Angebot   
2  2016-03-14 12:52:21  Jeep_Grand_Cherokee_"Overland"  privat   Angebot   
3  2016-03-17 16:54:04              GOLF_4_1_4__3TÜRER  privat   Angebot   
4  2016-03-31 17:25:20  Skoda_Fabia_1.4_TDI_PD_Classic  privat   Angebot   

   price abtest vehicleType  yearOfRegistration    gearbox  powerPS  model  \
0    480   test         NaN                1993    manuell        0   golf   
1  18300   test       coupe                2011    manuell      190    NaN   
2   9800   test         suv                2004  automatik      163  grand   
3   1500   test  kleinwagen                2001    manuell       75   golf   
4   3600   test  kleinwagen                2008    manuell       69  fabia   

   kilometer  monthOfRegistration fuelType       brand notRepairedDamage  \
0     150000                    0   benzin  volkswagen               NaN   
1     125000                    5   diesel        audi                ja   
2     125000                    8   diesel        jeep               NaN   
3     150000                    6   benzin  volkswagen              nein   
4      90000                    7   diesel       skoda              nein   

           dateCreated  nrOfPictures  postalCode             lastSeen  
0  2016-03-24 00:00:00             0       70435  2016-04-07 03:16:57  
1  2016-03-24 00:00:00             0       66954  2016-04-07 01:46:50  
2  2016-03-14 00:00:00             0       90480  2016-04-05 12:47:46  
3  2016-03-17 00:00:00             0       91074  2016-03-17 17:40:17  
4  2016-03-31 00:00:00             0       60437  2016-04-06 10:17:21  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-null  object
 16  dateCreated          371528 non-null  object
 17  nrOfPictures         371528 non-null  int64 
 18  postalCode           371528 non-null  int64 
 19  lastSeen             371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB
None
shape: (371528, 20)

Data exploration¶

Several observations can be made based on the obtained preliminary info:

This is a huge dataset (371528 rows, 20 columns).
Only int and object (string) dtypes, no floats.
Several columns have null values (but to more than 20%).
Some info is given in German, some columns need to be renamed.
"CamelCase" is used instead of "snake_case"

Let's rename columns from CamelCase to snake_case and make them look more descriptive

In [3]:

#quick look at columns' names
columns = autos.columns
#Method1. creating a dictionary with changed columns' names
new_col_names=  {'dateCrawled':'crawled_date',
                'offerType':'offer_type',
                'abtest':'a/b_test',
                'vehicleType':'vehicle_type',
                'yearOfRegistration':'registration_year',
                'powerPS':'power_PS',
                'monthOfRegistration':'registration_month',
                'fuelType':'fuel_type',
                'notRepairedDamage':'unrepaired_damage',
                'dateCreated':'ad_created',
                'nrOfPictures':'pictures_number',
                'postalCode':'postal_code',
                'lastSeen':'last_seen'}
autos.rename(new_col_names, axis = 1, inplace = True)
 
# Method2. Rename each columns' name using str.replace

#creating a copy of columns' series
#copy = columns.copy()
# copy = copy.str.replace('dateCrawled', 'crawled_date')
# copy = copy.str.replace('T','_t')
# copy = copy.str.replace('abtest','a/b_test')
# copy = copy.str.replace('yearOfRegistration','registration_year')
#etc..
# #assigning changed columns' names to the original dataset
#autos.columns = copy
# autos.columns

#Method 3. initially I thought about mapping, but since there is just one instance of each columns,
#I decided not to do it. Probably, it is better for rows (where we have a lot of instances)

# mapping_dict = {
#     'dateCrawled': 'crawled_date'
#     'name':'name'
#     'seller': 'seller'
#     'offerType': 'offer_type',
# }

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially I will look for:

Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
Numeric data stored as text which can be cleaned and converted.
Check min/max values for unreasonable values and outliers and remove them if needed.

In [4]:

#include = all to include categorical data as well
autos.describe(include = 'all')

Out[4]:

	crawled_date	name	seller	offer_type	price	a/b_test	vehicle_type	registration_year	gearbox	power_PS	model	kilometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	pictures_number	postal_code	last_seen
count	371528	371528	371528	371528	3.715280e+05	371528	333659	371528.000000	351319	371528.000000	351044	371528.000000	371528.000000	338142	371528	299468	371528	371528.0	371528.00000	371528
unique	280500	233531	2	2	NaN	2	8	NaN	2	NaN	251	NaN	NaN	7	40	2	114	NaN	NaN	182806
top	2016-03-24 14:49:47	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:45:59
freq	7	657	371525	371516	NaN	192585	95894	NaN	274214	NaN	30070	NaN	NaN	223857	79640	263182	14450	NaN	NaN	17
mean	NaN	NaN	NaN	NaN	1.729514e+04	NaN	NaN	2004.577997	NaN	115.549477	NaN	125618.688228	5.734445	NaN	NaN	NaN	NaN	0.0	50820.66764	NaN
std	NaN	NaN	NaN	NaN	3.587954e+06	NaN	NaN	92.866598	NaN	192.139578	NaN	40112.337051	3.712412	NaN	NaN	NaN	NaN	0.0	25799.08247	NaN
min	NaN	NaN	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.00000	NaN
25%	NaN	NaN	NaN	NaN	1.150000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30459.00000	NaN
50%	NaN	NaN	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49610.00000	NaN
75%	NaN	NaN	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71546.00000	NaN
max	NaN	NaN	NaN	NaN	2.147484e+09	NaN	NaN	9999.000000	NaN	20000.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.00000	NaN

Following observations can be made:

Almost all cars are private and offer type is "Angebot", so these columns can be dropped from the analysis.
Very strange min and max values in "registration year", needs to be investigated.
Pictures_ number column also needs to be investigated.

In [5]:

print('registration year info:', autos['registration_year'].value_counts(), sep = '\n')
print('number of pictures info:', autos['pictures_number'].value_counts(), sep = '\n')

registration year info:
2000    24551
1999    22767
2005    22316
2006    20230
2001    20218
        ...  
1919        1
1915        1
7800        1
1400        1
8455        1
Name: registration_year, Length: 155, dtype: int64
number of pictures info:
0    371528
Name: pictures_number, dtype: int64

pictures_number column needs to be deleted, since values are 0 for all rows.

Registration year column has some weird numbers in the end that can't be true.

Data cleaning¶

The main goals at this stage will be:

Dropping columns that consist almost entirely of 1 value.
Deal with outliers on "price" and "registration_year" columns.

Let's start with dropping seller, offer_type and pictures_number columns.

In [6]:

autos = autos.drop(columns = ['seller', 'offer_type', 'pictures_number'])

This part of code would be used if I would like to convert text data to numerical one and clean it a bit:

autos["price"] = autos["price"].str.replace('$', '')

autos["price"] = autos["price"].str.replace(',', '').astype(int)

Let's take a closer look to "price" and "kilometer" columns

In [7]:

#start with price
autos.price.unique().shape

Out[7]:

(5597,)

In [8]:

#descriptive stats
autos.price.describe()

Out[8]:

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [9]:

#5 most popular values
autos.price.value_counts().head()

Out[9]:

0       10778
500      5670
1500     5394
1000     4649
1200     4594
Name: price, dtype: int64

In [10]:

#frequency of prices from lowest to highest
autos.price.value_counts().sort_index(inplace = False)

Out[10]:

0             10778
1              1189
2                12
3                 8
4                 1
              ...  
32545461          1
74185296          1
99000000          1
99999999         15
2147483647        1
Name: price, Length: 5597, dtype: int64

In [11]:

#frequency of 10 lower price values
autos.price.value_counts().sort_index(inplace = False).head(20)

Out[11]:

0     10778
1      1189
2        12
3         8
4         1
5        26
7         3
8         9
9         8
10       84
11        5
12        8
13        7
14        5
15       27
16        2
17        5
18        3
19        3
20       51
Name: price, dtype: int64

In [12]:

#frequency of 10 higher price values
autos.price.value_counts().sort_index(inplace = False).tail(20)

Out[12]:

1234566        1
1250000        2
1300000        1
1600000        2
2795000        1
2995000        1
3890000        1
3895000        1
9999999        3
10000000       8
10010011       1
11111111      10
12345678       9
14000500       1
27322222       1
32545461       1
74185296       1
99000000       1
99999999      15
2147483647     1
Name: price, dtype: int64

Some prices look too high, let's examine them

In [13]:

autos.loc[autos['price'] == 1250000]

Out[13]:

	crawled_date	name	price	a/b_test	vehicle_type	registration_year	gearbox	power_PS	model	kilometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
20143	2016-03-18 00:55:53	Porsche_911R_Einer_von_911	1250000	test	coupe	2016	manuell	500	911	5000	3	benzin	porsche	nein	2016-03-17 00:00:00	70435	2016-03-20 04:48:27
78480	2016-03-10 20:43:12	Porsche_991_R	1250000	test	coupe	2016	manuell	500	911	5000	3	benzin	porsche	nein	2016-03-10 00:00:00	10719	2016-03-15 15:19:48

This one looks reasonable.

In [14]:

autos.loc[autos['price'] == 11111111]

Out[14]:

	crawled_date	name	price	a/b_test	vehicle_type	registration_year	gearbox	power_PS	model	kilometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
14663	2016-04-03 06:03:56	Tausch_moeglich_Polo_vw_n9	11111111	control	coupe	2003	manuell	64	polo	150000	2	benzin	volkswagen	NaN	2016-04-03 00:00:00	46535	2016-04-07 10:16:51
54507	2016-03-09 11:40:15	Tausch_DKW_F8_gegen_Lanz_Bulldog	11111111	test	andere	1953	manuell	20	NaN	150000	5	NaN	audi	nein	2016-03-09 00:00:00	7774	2016-04-05 14:46:32
90853	2016-04-01 05:36:16	Verkaufe_oder_Tausche	11111111	control	NaN	2018	NaN	0	NaN	150000	0	NaN	bmw	NaN	2016-04-01 00:00:00	6682	2016-04-05 02:46:37
127769	2016-03-28 12:51:51	Golf_4_laesst_sich_nicht_oeffnen	11111111	test	limousine	2003	manuell	75	golf	125000	9	benzin	volkswagen	NaN	2016-03-28 00:00:00	38122	2016-03-30 03:18:00
177744	2016-03-31 16:37:00	Suche_Trabant_auf_Ratenzahlung	11111111	control	NaN	1960	NaN	0	NaN	150000	0	NaN	trabant	NaN	2016-03-31 00:00:00	6449	2016-04-06 10:46:11
185227	2016-03-14 16:52:51	Suche_fuer_mein_golf	11111111	test	NaN	2000	NaN	0	NaN	5000	0	NaN	volkswagen	NaN	2016-03-14 00:00:00	19288	2016-04-05 23:16:23
210657	2016-03-12 21:50:57	Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000	11111111	test	limousine	1973	manuell	48	escort	50000	3	benzin	ford	nein	2016-03-12 00:00:00	94469	2016-03-12 22:45:27
318143	2016-03-21 13:57:51	Schlachte_Golf_3_gt_tdi	11111111	test	NaN	1995	NaN	0	NaN	150000	0	NaN	volkswagen	NaN	2016-03-21 00:00:00	18519	2016-03-21 14:40:18
332028	2016-03-31 18:39:53	Golf_3_Gti_"16v_Turbo_vr6_bbs"	11111111	control	NaN	1995	NaN	116	golf	150000	6	benzin	volkswagen	NaN	2016-03-31 00:00:00	72127	2016-04-02 13:47:59
338415	2016-03-26 22:49:39	Polo_6n_gti__Zustand_egal_	11111111	test	NaN	2000	NaN	120	polo	125000	0	benzin	volkswagen	NaN	2016-03-26 00:00:00	94577	2016-04-06 15:45:52

This one doesn't make any sense.

In [15]:

autos.loc[autos['price'] == 10000000]

Out[15]:

	crawled_date	name	price	a/b_test	vehicle_type	registration_year	gearbox	power_PS	model	kilometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
54295	2016-03-08 11:51:40	Mazda_5_2.0_CD_DPF_Exclusive	10000000	control	bus	2007	manuell	143	5_reihe	150000	2	diesel	mazda	nein	2016-03-08 00:00:00	83301	2016-03-10 14:17:16
121793	2016-03-05 20:59:54	Alter_VW_Kaefer_Scheunenfund	10000000	control	NaN	1910	NaN	0	kaefer	100000	0	NaN	volkswagen	NaN	2016-03-05 00:00:00	94436	2016-03-05 21:47:23
171824	2016-03-29 23:52:57	suche_maserati_3200_gt_Zustand_unwichtig_laufe...	10000000	control	coupe	1960	manuell	368	NaN	100000	1	benzin	sonstige_autos	nein	2016-03-29 00:00:00	73033	2016-04-06 21:18:11
176070	2016-04-02 14:43:10	Suche_BMW_Fun	10000000	test	NaN	1985	automatik	300	NaN	150000	0	NaN	bmw	NaN	2016-04-02 00:00:00	67480	2016-04-02 14:43:10
196306	2016-04-01 01:59:20	Suche_Motorcross_.!	10000000	control	NaN	2017	manuell	0	golf	5000	0	benzin	volkswagen	NaN	2016-04-01 00:00:00	1454	2016-04-07 05:15:54
281833	2016-03-19 08:50:50	Coole_Autowerkstatt_in_Esens_gesucht!	10000000	test	NaN	2019	manuell	10	NaN	5000	0	NaN	sonstige_autos	ja	2016-03-19 00:00:00	26487	2016-04-06 07:44:29
358868	2016-03-16 16:49:53	Tausche_BMW_E46_Touring_Diesel	10000000	control	kombi	2002	manuell	0	NaN	150000	7	diesel	bmw	nein	2016-03-16 00:00:00	25821	2016-03-17 15:15:40
363434	2016-03-23 06:36:20	Vw_Passat_35i_1.8_schlachtfest	10000000	test	kombi	1996	manuell	90	NaN	150000	0	benzin	volkswagen	ja	2016-03-23 00:00:00	36251	2016-03-26 01:16:15

Also suspucious. Prices lower than 1000 can also be considered as outliers and be removed

In [16]:

autos = autos[(autos["price"] > 999) & (autos["price"] <= 1250000)]

Let's check a shape of dataset to assure that changes have been applied

In [17]:

autos.shape

Out[17]:

(288134, 17)

Now, let's take a quick look at "kilometer" column

In [18]:

autos.kilometer.describe()

Out[18]:

count    288134.000000
mean     122703.742009
std       40825.263039
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: kilometer, dtype: float64

In [19]:

autos.kilometer.value_counts()

Out[19]:

150000    173994
125000     31973
100000     13691
90000      11294
80000      10194
70000       9193
60000       8256
50000       7340
40000       6184
30000       5640
20000       5027
5000        3627
10000       1721
Name: kilometer, dtype: int64

These values look reasonable to me and I can conclude that there is no need for further cleaning here.

In [20]:

autos.fuel_type.value_counts()

Out[20]:

benzin     163613
diesel     101187
lpg          4566
cng           511
hybrid        259
andere         99
elektro        92
Name: fuel_type, dtype: int64

In [21]:

autos.brand.value_counts()

Out[21]:

volkswagen        61679
bmw               35601
mercedes_benz     31811
audi              28711
opel              25474
ford              16722
renault           10890
peugeot            8260
fiat               6123
skoda              5223
seat               5219
smart              4754
toyota             4227
mazda              4189
citroen            3957
nissan             3474
mini               3328
hyundai            3107
sonstige_autos     2879
volvo              2476
kia                2164
porsche            2137
honda              2084
mitsubishi         1908
suzuki             1767
alfa_romeo         1747
chevrolet          1685
chrysler           1063
dacia               872
jeep                769
land_rover          753
jaguar              588
subaru              499
daihatsu            398
saab                398
trabant             275
lancia              264
daewoo              258
rover               222
lada                179
Name: brand, dtype: int64

These two also looks ok

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, I need to convert the data into a numerical representation so we can understand it quantitatively.

Let's see how this columns are formatted now.

In [22]:

autos[['crawled_date','ad_created','last_seen']].head()

Out[22]:

	crawled_date	ad_created	last_seen
1	2016-03-24 10:58:45	2016-03-24 00:00:00	2016-04-07 01:46:50
2	2016-03-14 12:52:21	2016-03-14 00:00:00	2016-04-05 12:47:46
3	2016-03-17 16:54:04	2016-03-17 00:00:00	2016-03-17 17:40:17
4	2016-03-31 17:25:20	2016-03-31 00:00:00	2016-04-06 10:17:21
6	2016-04-01 20:48:51	2016-04-01 00:00:00	2016-04-05 18:18:39

Let's calculate the distribution of values in the crawled_date, ad_created, and last_seen columns (all string columns) as percentages.

In [23]:

# normalize = True to show the percentahges, dropna = False to include null values
autos.crawled_date.value_counts(normalize = True, dropna = False)

Out[23]:

2016-03-05 14:25:23    0.000017
2016-03-20 16:50:22    0.000017
2016-04-01 15:56:46    0.000017
2016-04-01 19:25:23    0.000017
2016-04-02 22:54:55    0.000017
                         ...   
2016-03-19 12:37:58    0.000003
2016-04-02 22:38:54    0.000003
2016-03-26 22:38:35    0.000003
2016-03-08 17:49:51    0.000003
2016-03-15 11:38:11    0.000003
Name: crawled_date, Length: 231993, dtype: float64

In [24]:

autos.crawled_date.value_counts().sort_index(ascending = False)

Out[24]:

2016-04-07 14:36:58    1
2016-04-07 14:36:57    1
2016-04-07 14:36:56    2
2016-04-07 14:36:55    1
2016-04-07 14:36:53    1
                      ..
2016-03-05 14:06:27    2
2016-03-05 14:06:26    1
2016-03-05 14:06:25    2
2016-03-05 14:06:24    2
2016-03-05 14:06:23    1
Name: crawled_date, Length: 231993, dtype: int64

Observations regarding crawled_date:

All the data is collected from 5th March to 7th April.
There is no dominating date (data is more or less equally distributed by dates)

In [25]:

autos.ad_created.value_counts(normalize = True, dropna = False).sort_index(ascending = False)

Out[25]:

2016-04-07 00:00:00    0.001555
2016-04-06 00:00:00    0.003210
2016-04-05 00:00:00    0.011509
2016-04-04 00:00:00    0.038048
2016-04-03 00:00:00    0.039999
                         ...   
2015-08-07 00:00:00    0.000003
2015-06-18 00:00:00    0.000003
2015-06-11 00:00:00    0.000003
2015-03-20 00:00:00    0.000003
2014-03-10 00:00:00    0.000003
Name: ad_created, Length: 112, dtype: float64

In [26]:

autos.ad_created.value_counts()

Out[26]:

2016-04-03 00:00:00    11525
2016-04-04 00:00:00    10963
2016-03-20 00:00:00    10580
2016-03-12 00:00:00    10492
2016-04-02 00:00:00    10334
                       ...  
2015-03-20 00:00:00        1
2015-11-17 00:00:00        1
2014-03-10 00:00:00        1
2015-11-13 00:00:00        1
2015-12-27 00:00:00        1
Name: ad_created, Length: 112, dtype: int64

In [27]:

autos.crawled_date.value_counts().sort_index(ascending = False)

Out[27]:

2016-04-07 14:36:58    1
2016-04-07 14:36:57    1
2016-04-07 14:36:56    2
2016-04-07 14:36:55    1
2016-04-07 14:36:53    1
                      ..
2016-03-05 14:06:27    2
2016-03-05 14:06:26    1
2016-03-05 14:06:25    2
2016-03-05 14:06:24    2
2016-03-05 14:06:23    1
Name: crawled_date, Length: 231993, dtype: int64

Observations regarding ad_created:

Some dates are much more popular to create the ad than others.
Standart value_counts (without precentage) gives much more clear picture in this particlar case

In [28]:

autos.last_seen.value_counts(dropna = False)

Out[28]:

2016-04-07 06:45:59    17
2016-04-06 04:17:20    16
2016-04-07 07:17:35    16
2016-04-06 12:15:45    16
2016-04-06 08:46:37    15
                       ..
2016-03-24 17:46:40     1
2016-04-03 12:52:09     1
2016-03-24 11:48:44     1
2016-03-12 15:45:13     1
2016-03-23 02:15:34     1
Name: last_seen, Length: 145071, dtype: int64

In [29]:

autos.last_seen.value_counts().sort_index(ascending = False)

Out[29]:

2016-04-07 14:58:50    10
2016-04-07 14:58:49     8
2016-04-07 14:58:48     8
2016-04-07 14:58:47     9
2016-04-07 14:58:46     8
                       ..
2016-03-05 14:44:41     1
2016-03-05 14:35:28     1
2016-03-05 14:18:30     1
2016-03-05 14:15:39     1
2016-03-05 14:15:16     1
Name: last_seen, Length: 145071, dtype: int64

Observations regarding last_seen:

Majority of ads have been last seen in the last date of observations (7th April).
Range of dates is consistent between this 3 columns. Therefore, no cleaning is needed here.

Now let's fix registration_year column. Firstly, just a reminder of what was wrong with it.

In [30]:

autos.registration_year.describe()

Out[30]:

count    288134.000000
mean       2005.148965
std          65.538133
min        1000.000000
25%        2001.000000
50%        2005.000000
75%        2009.000000
max        9999.000000
Name: registration_year, dtype: float64

Min and max value looks out of sense since ads were collected up to 2016 and the oldest cars can be expected to be registered in the middle of 20th century. Let's fix minimum registration_year value to 1930 and maximum one to 2016

In [31]:

autos[(autos['registration_year'] <=2016) & (autos['registration_year'] >=1930)].shape

Out[31]:

(277316, 17)

In [32]:

autos.shape

Out[32]:

(288134, 17)

So, only around 11000 rows (roughly 4%) is going to be removed. So, let's remove these unrealistic years from the original dataset.

In [33]:

autos = autos[(autos['registration_year'] <=2016) & (autos['registration_year'] >=1930)]

#alt version:
#autos = autos[autos['registration_year'].between(1930,2016)]

Quick look to verify that everything worked well.

In [34]:

autos.shape

Out[34]:

(277316, 17)

In total around 100000 rows and 3 columns have been removed from the dataset during the cleaning process.

And quick analysis of modified registration_year column.

In [35]:

autos.registration_year.describe()

Out[35]:

count    277316.000000
mean       2003.941702
std           7.009812
min        1930.000000
25%        2001.000000
50%        2005.000000
75%        2008.000000
max        2016.000000
Name: registration_year, dtype: float64

In [36]:

autos.registration_year.value_counts(normalize = True)

Out[36]:

2005    0.074381
2006    0.071366
2004    0.067818
2003    0.065121
2007    0.062636
          ...   
1949    0.000011
1940    0.000007
1942    0.000007
1944    0.000007
1946    0.000007
Name: registration_year, Length: 87, dtype: float64

It works perfectly. This column has been cleaned.

As a final step of data cleaning, let's translate one of the columns (gearbox) from German to English using .map() method.

In [37]:

autos.gearbox.unique()

Out[37]:

array(['manuell', 'automatik', nan], dtype=object)

In [38]:

mapping_dict = {
    'manuell': 'manual',
    'automatik': 'automatic',
    'nan': 'no',
}
autos['gearbox'] = autos['gearbox'].map(mapping_dict)
autos['gearbox'].unique()

Out[38]:

array(['manual', 'automatic', nan], dtype=object)

Data analysis¶

Let's perform some data aggregation and discover some properties of "brands" column.

In [39]:

autos.brand.unique()

Out[39]:

array(['audi', 'jeep', 'volkswagen', 'skoda', 'peugeot', 'ford', 'mazda',
       'nissan', 'renault', 'mercedes_benz', 'bmw', 'honda', 'opel',
       'mini', 'smart', 'hyundai', 'subaru', 'volvo', 'mitsubishi',
       'alfa_romeo', 'kia', 'seat', 'suzuki', 'lancia', 'porsche',
       'citroen', 'fiat', 'toyota', 'chevrolet', 'sonstige_autos',
       'dacia', 'daihatsu', 'chrysler', 'jaguar', 'rover', 'saab',
       'daewoo', 'land_rover', 'trabant', 'lada'], dtype=object)

In [40]:

autos.brand.value_counts()

Out[40]:

volkswagen        58798
bmw               34650
mercedes_benz     30879
audi              27886
opel              24272
ford              16034
renault           10291
peugeot            7993
fiat               5809
skoda              5097
seat               4938
smart              4571
toyota             4104
mazda              4030
citroen            3783
nissan             3342
mini               3224
hyundai            2994
sonstige_autos     2781
volvo              2426
porsche            2119
kia                2081
honda              1986
mitsubishi         1839
suzuki             1714
alfa_romeo         1694
chevrolet          1633
chrysler           1026
dacia               847
land_rover          746
jeep                743
jaguar              581
subaru              484
saab                391
daihatsu            385
trabant             266
lancia              254
daewoo              243
rover               209
lada                173
Name: brand, dtype: int64

As expected, German brands are the most popular in Germany. Let's aggregate for the top 15 brands (popular brands).

In [41]:

Pop_brands = autos.brand.value_counts().index[:15]
Pop_brands 

Out[41]:

Index(['volkswagen', 'bmw', 'mercedes_benz', 'audi', 'opel', 'ford', 'renault',
       'peugeot', 'fiat', 'skoda', 'seat', 'smart', 'toyota', 'mazda',
       'citroen'],
      dtype='object')

Let's calculate a mean for each significant brand

In [42]:

mean_pop_brands = {}

for brand in Pop_brands:
    selected_rows = autos[autos['brand'] == brand]
    mean = selected_rows['price'].mean()
    mean_pop_brands[brand] = mean

mean_pop_brands    

Out[42]:

{'volkswagen': 6659.831320793224,
 'bmw': 9366.378528138528,
 'mercedes_benz': 9302.004404287703,
 'audi': 10135.101090152764,
 'opel': 4310.82325313118,
 'ford': 5359.863976549832,
 'renault': 3559.205908075017,
 'peugeot': 4090.3778306017766,
 'fiat': 4136.7336891031155,
 'skoda': 6927.274475181479,
 'seat': 5767.324827865533,
 'smart': 3887.7481951432947,
 'toyota': 5769.293372319688,
 'mazda': 5205.706451612903,
 'citroen': 4607.255881575469}

In [43]:

dict(sorted(mean_pop_brands.items(), key=lambda item: item[1])) #sorting the dictionary by value

Out[43]:

{'renault': 3559.205908075017,
 'smart': 3887.7481951432947,
 'peugeot': 4090.3778306017766,
 'fiat': 4136.7336891031155,
 'opel': 4310.82325313118,
 'citroen': 4607.255881575469,
 'mazda': 5205.706451612903,
 'ford': 5359.863976549832,
 'seat': 5767.324827865533,
 'toyota': 5769.293372319688,
 'volkswagen': 6659.831320793224,
 'skoda': 6927.274475181479,
 'mercedes_benz': 9302.004404287703,
 'bmw': 9366.378528138528,
 'audi': 10135.101090152764}

Audi, bmw and mercedes are the most expensive brands. Renault and smart are the cheapest ones.

Let's use pandas series constructor to convert a dictionary to a pandas series.

In [44]:

brands_price = pd.Series(mean_pop_brands)
brands_price

Out[44]:

volkswagen        6659.831321
bmw               9366.378528
mercedes_benz     9302.004404
audi             10135.101090
opel              4310.823253
ford              5359.863977
renault           3559.205908
peugeot           4090.377831
fiat              4136.733689
skoda             6927.274475
seat              5767.324828
smart             3887.748195
toyota            5769.293372
mazda             5205.706452
citroen           4607.255882
dtype: float64

Keys from dictionary became an index in pandas series. Now, it's much easier to sort.

In [45]:

#top 5 wxpensive brands
brands_price.sort_values(ascending = False)[:5]

Out[45]:

audi             10135.101090
bmw               9366.378528
mercedes_benz     9302.004404
skoda             6927.274475
volkswagen        6659.831321
dtype: float64

Another option is to construct a new dataframe. But in this case, we need to indicate a column name.

In [46]:

brands_price_kilometer_df = pd.DataFrame(brands_price, columns=['mean_price'])
brands_price_kilometer_df

Out[46]:

	mean_price
volkswagen	6659.831321
bmw	9366.378528
mercedes_benz	9302.004404
audi	10135.101090
opel	4310.823253
ford	5359.863977
renault	3559.205908
peugeot	4090.377831
fiat	4136.733689
skoda	6927.274475
seat	5767.324828
smart	3887.748195
toyota	5769.293372
mazda	5205.706452
citroen	4607.255882

Let's create a dataframe consisting of mean_price of top15 popular brands and mean kilometers the car has driven.

In [47]:

mean_kils = {}

for brand in Pop_brands:
    selected_rows = autos[autos['brand'] == brand]
    mean = selected_rows['kilometer'].mean()
    mean_kils[brand] = mean

mean_kils  

Out[47]:

{'volkswagen': 125505.8845538964,
 'bmw': 132179.07647907647,
 'mercedes_benz': 129764.24106998283,
 'audi': 128042.20755934878,
 'opel': 123018.70468029004,
 'ford': 119245.97729824124,
 'renault': 120898.35778835876,
 'peugeot': 119753.53434254973,
 'fiat': 107404.02823205371,
 'skoda': 112429.86070237395,
 'seat': 114721.54718509519,
 'smart': 97216.14526361847,
 'toyota': 115063.35282651072,
 'mazda': 120976.42679900744,
 'citroen': 115701.82394924662}

In [48]:

brands_price_kilometer_df['mean kilometer'] = mean_kils.values()
brands_price_kilometer_df

Out[48]:

	mean_price	mean kilometer
volkswagen	6659.831321	125505.884554
bmw	9366.378528	132179.076479
mercedes_benz	9302.004404	129764.241070
audi	10135.101090	128042.207559
opel	4310.823253	123018.704680
ford	5359.863977	119245.977298
renault	3559.205908	120898.357788
peugeot	4090.377831	119753.534343
fiat	4136.733689	107404.028232
skoda	6927.274475	112429.860702
seat	5767.324828	114721.547185
smart	3887.748195	97216.145264
toyota	5769.293372	115063.352827
mazda	5205.706452	120976.426799
citroen	4607.255882	115701.823949

I can conclude that there is no correlation between kilometers car has passed and it's mean price. Brand affects it much more.

Now, let's discover what BMW models are the most popular.

In [49]:

bmw = autos.loc[autos['brand'] == 'bmw', 'model']
bmw.value_counts()

Out[49]:

3er        17064
5er         7705
1er         3687
x_reihe     2245
7er         1039
z_reihe      871
m_reihe      339
andere       318
6er          242
i3             7
Name: model, dtype: int64

BMW series 3 and 5 are the most popular.

As the last task, let's print the most popular model for each of the top brands.

In [50]:

top_model = {}

for brand in Pop_brands:
    selected_rows = autos[autos['brand'] == brand]
    first_model = selected_rows['model'].value_counts().index[0]
    top_model[brand] = first_model

top_model_by_brand = pd.Series(top_model)
top_model_by_brand

Out[50]:

volkswagen           golf
bmw                   3er
mercedes_benz    c_klasse
audi                   a4
opel                astra
ford                focus
renault              clio
peugeot           2_reihe
fiat                punto
skoda             octavia
seat                ibiza
smart              fortwo
toyota              yaris
mazda             6_reihe
citroen            andere
dtype: object

Summary¶

In this data analysis project I cleaned and analyzed German Ebay car sales data. In particular following tasks have been made:

Data exploration

Data has been explored in terms of size, dtypes and text style.
Columns have been renamed from CamelType to snake_type.
Objectives for cleaning have been defined.

Data cleaning

3 columns (seller, offer_type, pictures_number) have been dropped because they consisted almost entirely of the same value.
In 2 columns (price, registation_year) out of sense outliers have been removed.
1 column (gearbox) has been translated from German to English language.

In total around 100000 rows and 3 columns have been removed from the dataset during the cleaning process.

Data analysis

The most popular brands and their mean prices have been identified.
No correlation between kilometers that car has passed and its price have been observed.
The most popular models for top15 popular brands have been found.

The main achievement is that I became familiar with basics of pandas library. I believe, data aggregation should not be done using for loops and pandas has some methods for that. Hopefully, I will discover them later in my data science journey.