In this guided project, I'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
This is my first data cleaning/ data analysis project. The main goal is to practice Pandas and Numpy libraries learned fro Dataquest platform.
The data dictionary provided with data is as follows:
Let's start with importing the libraries and the dataset.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | kilometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-24 11:52:17 | Golf_3_1.6 | privat | Angebot | 480 | test | NaN | 1993 | manuell | 0 | golf | 150000 | 0 | benzin | volkswagen | NaN | 2016-03-24 00:00:00 | 0 | 70435 | 2016-04-07 03:16:57 |
1 | 2016-03-24 10:58:45 | A5_Sportback_2.7_Tdi | privat | Angebot | 18300 | test | coupe | 2011 | manuell | 190 | NaN | 125000 | 5 | diesel | audi | ja | 2016-03-24 00:00:00 | 0 | 66954 | 2016-04-07 01:46:50 |
2 | 2016-03-14 12:52:21 | Jeep_Grand_Cherokee_"Overland" | privat | Angebot | 9800 | test | suv | 2004 | automatik | 163 | grand | 125000 | 8 | diesel | jeep | NaN | 2016-03-14 00:00:00 | 0 | 90480 | 2016-04-05 12:47:46 |
3 | 2016-03-17 16:54:04 | GOLF_4_1_4__3TÜRER | privat | Angebot | 1500 | test | kleinwagen | 2001 | manuell | 75 | golf | 150000 | 6 | benzin | volkswagen | nein | 2016-03-17 00:00:00 | 0 | 91074 | 2016-03-17 17:40:17 |
4 | 2016-03-31 17:25:20 | Skoda_Fabia_1.4_TDI_PD_Classic | privat | Angebot | 3600 | test | kleinwagen | 2008 | manuell | 69 | fabia | 90000 | 7 | diesel | skoda | nein | 2016-03-31 00:00:00 | 0 | 60437 | 2016-04-06 10:17:21 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
371523 | 2016-03-14 17:48:27 | Suche_t4___vito_ab_6_sitze | privat | Angebot | 2200 | test | NaN | 2005 | NaN | 0 | NaN | 20000 | 1 | NaN | sonstige_autos | NaN | 2016-03-14 00:00:00 | 0 | 39576 | 2016-04-06 00:46:52 |
371524 | 2016-03-05 19:56:21 | Smart_smart_leistungssteigerung_100ps | privat | Angebot | 1199 | test | cabrio | 2000 | automatik | 101 | fortwo | 125000 | 3 | benzin | smart | nein | 2016-03-05 00:00:00 | 0 | 26135 | 2016-03-11 18:17:12 |
371525 | 2016-03-19 18:57:12 | Volkswagen_Multivan_T4_TDI_7DC_UY2 | privat | Angebot | 9200 | test | bus | 1996 | manuell | 102 | transporter | 150000 | 3 | diesel | volkswagen | nein | 2016-03-19 00:00:00 | 0 | 87439 | 2016-04-07 07:15:26 |
371526 | 2016-03-20 19:41:08 | VW_Golf_Kombi_1_9l_TDI | privat | Angebot | 3400 | test | kombi | 2002 | manuell | 100 | golf | 150000 | 6 | diesel | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 40764 | 2016-03-24 12:45:21 |
371527 | 2016-03-07 19:39:19 | BMW_M135i_vollausgestattet_NP_52.720____Euro | privat | Angebot | 28990 | control | limousine | 2013 | manuell | 320 | m_reihe | 50000 | 8 | benzin | bmw | nein | 2016-03-07 00:00:00 | 0 | 73326 | 2016-03-22 03:17:10 |
371528 rows × 20 columns
Default encoding (UTF-8) couldn't recognize this dataset, so I used Latin-1. Jupyter gives first and last 5 rows of the dataset just by calling it. Now, let's get some info about our data.
print("first 5 rows:", autos.head(), sep='\n')
print(autos.info())
print("shape:", autos.shape)
first 5 rows: dateCrawled name seller offerType \ 0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot 1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot 2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_"Overland" privat Angebot 3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot 4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot price abtest vehicleType yearOfRegistration gearbox powerPS model \ 0 480 test NaN 1993 manuell 0 golf 1 18300 test coupe 2011 manuell 190 NaN 2 9800 test suv 2004 automatik 163 grand 3 1500 test kleinwagen 2001 manuell 75 golf 4 3600 test kleinwagen 2008 manuell 69 fabia kilometer monthOfRegistration fuelType brand notRepairedDamage \ 0 150000 0 benzin volkswagen NaN 1 125000 5 diesel audi ja 2 125000 8 diesel jeep NaN 3 150000 6 benzin volkswagen nein 4 90000 7 diesel skoda nein dateCreated nrOfPictures postalCode lastSeen 0 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57 1 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50 2 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46 3 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17 4 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21 <class 'pandas.core.frame.DataFrame'> RangeIndex: 371528 entries, 0 to 371527 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 371528 non-null object 1 name 371528 non-null object 2 seller 371528 non-null object 3 offerType 371528 non-null object 4 price 371528 non-null int64 5 abtest 371528 non-null object 6 vehicleType 333659 non-null object 7 yearOfRegistration 371528 non-null int64 8 gearbox 351319 non-null object 9 powerPS 371528 non-null int64 10 model 351044 non-null object 11 kilometer 371528 non-null int64 12 monthOfRegistration 371528 non-null int64 13 fuelType 338142 non-null object 14 brand 371528 non-null object 15 notRepairedDamage 299468 non-null object 16 dateCreated 371528 non-null object 17 nrOfPictures 371528 non-null int64 18 postalCode 371528 non-null int64 19 lastSeen 371528 non-null object dtypes: int64(7), object(13) memory usage: 56.7+ MB None shape: (371528, 20)
Several observations can be made based on the obtained preliminary info:
Let's rename columns from CamelCase to snake_case and make them look more descriptive
#quick look at columns' names
columns = autos.columns
#Method1. creating a dictionary with changed columns' names
new_col_names= {'dateCrawled':'crawled_date',
'offerType':'offer_type',
'abtest':'a/b_test',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'powerPS':'power_PS',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'pictures_number',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
autos.rename(new_col_names, axis = 1, inplace = True)
# Method2. Rename each columns' name using str.replace
#creating a copy of columns' series
#copy = columns.copy()
# copy = copy.str.replace('dateCrawled', 'crawled_date')
# copy = copy.str.replace('T','_t')
# copy = copy.str.replace('abtest','a/b_test')
# copy = copy.str.replace('yearOfRegistration','registration_year')
#etc..
# #assigning changed columns' names to the original dataset
#autos.columns = copy
# autos.columns
#Method 3. initially I thought about mapping, but since there is just one instance of each columns,
#I decided not to do it. Probably, it is better for rows (where we have a lot of instances)
# mapping_dict = {
# 'dateCrawled': 'crawled_date'
# 'name':'name'
# 'seller': 'seller'
# 'offerType': 'offer_type',
# }
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially I will look for:
#include = all to include categorical data as well
autos.describe(include = 'all')
crawled_date | name | seller | offer_type | price | a/b_test | vehicle_type | registration_year | gearbox | power_PS | model | kilometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_number | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 371528 | 371528 | 371528 | 371528 | 3.715280e+05 | 371528 | 333659 | 371528.000000 | 351319 | 371528.000000 | 351044 | 371528.000000 | 371528.000000 | 338142 | 371528 | 299468 | 371528 | 371528.0 | 371528.00000 | 371528 |
unique | 280500 | 233531 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 251 | NaN | NaN | 7 | 40 | 2 | 114 | NaN | NaN | 182806 |
top | 2016-03-24 14:49:47 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:45:59 |
freq | 7 | 657 | 371525 | 371516 | NaN | 192585 | 95894 | NaN | 274214 | NaN | 30070 | NaN | NaN | 223857 | 79640 | 263182 | 14450 | NaN | NaN | 17 |
mean | NaN | NaN | NaN | NaN | 1.729514e+04 | NaN | NaN | 2004.577997 | NaN | 115.549477 | NaN | 125618.688228 | 5.734445 | NaN | NaN | NaN | NaN | 0.0 | 50820.66764 | NaN |
std | NaN | NaN | NaN | NaN | 3.587954e+06 | NaN | NaN | 92.866598 | NaN | 192.139578 | NaN | 40112.337051 | 3.712412 | NaN | NaN | NaN | NaN | 0.0 | 25799.08247 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.00000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.150000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30459.00000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49610.00000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71546.00000 | NaN |
max | NaN | NaN | NaN | NaN | 2.147484e+09 | NaN | NaN | 9999.000000 | NaN | 20000.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.00000 | NaN |
Following observations can be made:
print('registration year info:', autos['registration_year'].value_counts(), sep = '\n')
print('number of pictures info:', autos['pictures_number'].value_counts(), sep = '\n')
registration year info: 2000 24551 1999 22767 2005 22316 2006 20230 2001 20218 ... 1919 1 1915 1 7800 1 1400 1 8455 1 Name: registration_year, Length: 155, dtype: int64 number of pictures info: 0 371528 Name: pictures_number, dtype: int64
pictures_number column needs to be deleted, since values are 0 for all rows.
Registration year column has some weird numbers in the end that can't be true.
The main goals at this stage will be:
Let's start with dropping seller, offer_type and pictures_number columns.
autos = autos.drop(columns = ['seller', 'offer_type', 'pictures_number'])
This part of code would be used if I would like to convert text data to numerical one and clean it a bit:
autos["price"] = autos["price"].str.replace('$', '')
autos["price"] = autos["price"].str.replace(',', '').astype(int)
Let's take a closer look to "price" and "kilometer" columns
#start with price
autos.price.unique().shape
(5597,)
#descriptive stats
autos.price.describe()
count 3.715280e+05 mean 1.729514e+04 std 3.587954e+06 min 0.000000e+00 25% 1.150000e+03 50% 2.950000e+03 75% 7.200000e+03 max 2.147484e+09 Name: price, dtype: float64
#5 most popular values
autos.price.value_counts().head()
0 10778 500 5670 1500 5394 1000 4649 1200 4594 Name: price, dtype: int64
#frequency of prices from lowest to highest
autos.price.value_counts().sort_index(inplace = False)
0 10778 1 1189 2 12 3 8 4 1 ... 32545461 1 74185296 1 99000000 1 99999999 15 2147483647 1 Name: price, Length: 5597, dtype: int64
#frequency of 10 lower price values
autos.price.value_counts().sort_index(inplace = False).head(20)
0 10778 1 1189 2 12 3 8 4 1 5 26 7 3 8 9 9 8 10 84 11 5 12 8 13 7 14 5 15 27 16 2 17 5 18 3 19 3 20 51 Name: price, dtype: int64
#frequency of 10 higher price values
autos.price.value_counts().sort_index(inplace = False).tail(20)
1234566 1 1250000 2 1300000 1 1600000 2 2795000 1 2995000 1 3890000 1 3895000 1 9999999 3 10000000 8 10010011 1 11111111 10 12345678 9 14000500 1 27322222 1 32545461 1 74185296 1 99000000 1 99999999 15 2147483647 1 Name: price, dtype: int64
Some prices look too high, let's examine them
autos.loc[autos['price'] == 1250000]
crawled_date | name | price | a/b_test | vehicle_type | registration_year | gearbox | power_PS | model | kilometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20143 | 2016-03-18 00:55:53 | Porsche_911R_Einer_von_911 | 1250000 | test | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-17 00:00:00 | 70435 | 2016-03-20 04:48:27 |
78480 | 2016-03-10 20:43:12 | Porsche_991_R | 1250000 | test | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-10 00:00:00 | 10719 | 2016-03-15 15:19:48 |
This one looks reasonable.
autos.loc[autos['price'] == 11111111]
crawled_date | name | price | a/b_test | vehicle_type | registration_year | gearbox | power_PS | model | kilometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14663 | 2016-04-03 06:03:56 | Tausch_moeglich_Polo_vw_n9 | 11111111 | control | coupe | 2003 | manuell | 64 | polo | 150000 | 2 | benzin | volkswagen | NaN | 2016-04-03 00:00:00 | 46535 | 2016-04-07 10:16:51 |
54507 | 2016-03-09 11:40:15 | Tausch_DKW_F8_gegen_Lanz_Bulldog | 11111111 | test | andere | 1953 | manuell | 20 | NaN | 150000 | 5 | NaN | audi | nein | 2016-03-09 00:00:00 | 7774 | 2016-04-05 14:46:32 |
90853 | 2016-04-01 05:36:16 | Verkaufe_oder_Tausche | 11111111 | control | NaN | 2018 | NaN | 0 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-04-01 00:00:00 | 6682 | 2016-04-05 02:46:37 |
127769 | 2016-03-28 12:51:51 | Golf_4_laesst_sich_nicht_oeffnen | 11111111 | test | limousine | 2003 | manuell | 75 | golf | 125000 | 9 | benzin | volkswagen | NaN | 2016-03-28 00:00:00 | 38122 | 2016-03-30 03:18:00 |
177744 | 2016-03-31 16:37:00 | Suche_Trabant_auf_Ratenzahlung | 11111111 | control | NaN | 1960 | NaN | 0 | NaN | 150000 | 0 | NaN | trabant | NaN | 2016-03-31 00:00:00 | 6449 | 2016-04-06 10:46:11 |
185227 | 2016-03-14 16:52:51 | Suche_fuer_mein_golf | 11111111 | test | NaN | 2000 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-14 00:00:00 | 19288 | 2016-04-05 23:16:23 |
210657 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 11111111 | test | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 94469 | 2016-03-12 22:45:27 |
318143 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | 11111111 | test | NaN | 1995 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 18519 | 2016-03-21 14:40:18 |
332028 | 2016-03-31 18:39:53 | Golf_3_Gti_"16v_Turbo_vr6_bbs" | 11111111 | control | NaN | 1995 | NaN | 116 | golf | 150000 | 6 | benzin | volkswagen | NaN | 2016-03-31 00:00:00 | 72127 | 2016-04-02 13:47:59 |
338415 | 2016-03-26 22:49:39 | Polo_6n_gti__Zustand_egal_ | 11111111 | test | NaN | 2000 | NaN | 120 | polo | 125000 | 0 | benzin | volkswagen | NaN | 2016-03-26 00:00:00 | 94577 | 2016-04-06 15:45:52 |
This one doesn't make any sense.
autos.loc[autos['price'] == 10000000]
crawled_date | name | price | a/b_test | vehicle_type | registration_year | gearbox | power_PS | model | kilometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
54295 | 2016-03-08 11:51:40 | Mazda_5_2.0_CD_DPF_Exclusive | 10000000 | control | bus | 2007 | manuell | 143 | 5_reihe | 150000 | 2 | diesel | mazda | nein | 2016-03-08 00:00:00 | 83301 | 2016-03-10 14:17:16 |
121793 | 2016-03-05 20:59:54 | Alter_VW_Kaefer_Scheunenfund | 10000000 | control | NaN | 1910 | NaN | 0 | kaefer | 100000 | 0 | NaN | volkswagen | NaN | 2016-03-05 00:00:00 | 94436 | 2016-03-05 21:47:23 |
171824 | 2016-03-29 23:52:57 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | 10000000 | control | coupe | 1960 | manuell | 368 | NaN | 100000 | 1 | benzin | sonstige_autos | nein | 2016-03-29 00:00:00 | 73033 | 2016-04-06 21:18:11 |
176070 | 2016-04-02 14:43:10 | Suche_BMW_Fun | 10000000 | test | NaN | 1985 | automatik | 300 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-04-02 00:00:00 | 67480 | 2016-04-02 14:43:10 |
196306 | 2016-04-01 01:59:20 | Suche_Motorcross_.! | 10000000 | control | NaN | 2017 | manuell | 0 | golf | 5000 | 0 | benzin | volkswagen | NaN | 2016-04-01 00:00:00 | 1454 | 2016-04-07 05:15:54 |
281833 | 2016-03-19 08:50:50 | Coole_Autowerkstatt_in_Esens_gesucht! | 10000000 | test | NaN | 2019 | manuell | 10 | NaN | 5000 | 0 | NaN | sonstige_autos | ja | 2016-03-19 00:00:00 | 26487 | 2016-04-06 07:44:29 |
358868 | 2016-03-16 16:49:53 | Tausche_BMW_E46_Touring_Diesel | 10000000 | control | kombi | 2002 | manuell | 0 | NaN | 150000 | 7 | diesel | bmw | nein | 2016-03-16 00:00:00 | 25821 | 2016-03-17 15:15:40 |
363434 | 2016-03-23 06:36:20 | Vw_Passat_35i_1.8_schlachtfest | 10000000 | test | kombi | 1996 | manuell | 90 | NaN | 150000 | 0 | benzin | volkswagen | ja | 2016-03-23 00:00:00 | 36251 | 2016-03-26 01:16:15 |
Also suspucious. Prices lower than 1000 can also be considered as outliers and be removed
autos = autos[(autos["price"] > 999) & (autos["price"] <= 1250000)]
Let's check a shape of dataset to assure that changes have been applied
autos.shape
(288134, 17)
Now, let's take a quick look at "kilometer" column
autos.kilometer.describe()
count 288134.000000 mean 122703.742009 std 40825.263039 min 5000.000000 25% 100000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: kilometer, dtype: float64
autos.kilometer.value_counts()
150000 173994 125000 31973 100000 13691 90000 11294 80000 10194 70000 9193 60000 8256 50000 7340 40000 6184 30000 5640 20000 5027 5000 3627 10000 1721 Name: kilometer, dtype: int64
These values look reasonable to me and I can conclude that there is no need for further cleaning here.
autos.fuel_type.value_counts()
benzin 163613 diesel 101187 lpg 4566 cng 511 hybrid 259 andere 99 elektro 92 Name: fuel_type, dtype: int64
autos.brand.value_counts()
volkswagen 61679 bmw 35601 mercedes_benz 31811 audi 28711 opel 25474 ford 16722 renault 10890 peugeot 8260 fiat 6123 skoda 5223 seat 5219 smart 4754 toyota 4227 mazda 4189 citroen 3957 nissan 3474 mini 3328 hyundai 3107 sonstige_autos 2879 volvo 2476 kia 2164 porsche 2137 honda 2084 mitsubishi 1908 suzuki 1767 alfa_romeo 1747 chevrolet 1685 chrysler 1063 dacia 872 jeep 769 land_rover 753 jaguar 588 subaru 499 daihatsu 398 saab 398 trabant 275 lancia 264 daewoo 258 rover 222 lada 179 Name: brand, dtype: int64
These two also looks ok
Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, I need to convert the data into a numerical representation so we can understand it quantitatively.
Let's see how this columns are formatted now.
autos[['crawled_date','ad_created','last_seen']].head()
crawled_date | ad_created | last_seen | |
---|---|---|---|
1 | 2016-03-24 10:58:45 | 2016-03-24 00:00:00 | 2016-04-07 01:46:50 |
2 | 2016-03-14 12:52:21 | 2016-03-14 00:00:00 | 2016-04-05 12:47:46 |
3 | 2016-03-17 16:54:04 | 2016-03-17 00:00:00 | 2016-03-17 17:40:17 |
4 | 2016-03-31 17:25:20 | 2016-03-31 00:00:00 | 2016-04-06 10:17:21 |
6 | 2016-04-01 20:48:51 | 2016-04-01 00:00:00 | 2016-04-05 18:18:39 |
Let's calculate the distribution of values in the crawled_date, ad_created, and last_seen columns (all string columns) as percentages.
# normalize = True to show the percentahges, dropna = False to include null values
autos.crawled_date.value_counts(normalize = True, dropna = False)
2016-03-05 14:25:23 0.000017 2016-03-20 16:50:22 0.000017 2016-04-01 15:56:46 0.000017 2016-04-01 19:25:23 0.000017 2016-04-02 22:54:55 0.000017 ... 2016-03-19 12:37:58 0.000003 2016-04-02 22:38:54 0.000003 2016-03-26 22:38:35 0.000003 2016-03-08 17:49:51 0.000003 2016-03-15 11:38:11 0.000003 Name: crawled_date, Length: 231993, dtype: float64
autos.crawled_date.value_counts().sort_index(ascending = False)
2016-04-07 14:36:58 1 2016-04-07 14:36:57 1 2016-04-07 14:36:56 2 2016-04-07 14:36:55 1 2016-04-07 14:36:53 1 .. 2016-03-05 14:06:27 2 2016-03-05 14:06:26 1 2016-03-05 14:06:25 2 2016-03-05 14:06:24 2 2016-03-05 14:06:23 1 Name: crawled_date, Length: 231993, dtype: int64
Observations regarding crawled_date:
autos.ad_created.value_counts(normalize = True, dropna = False).sort_index(ascending = False)
2016-04-07 00:00:00 0.001555 2016-04-06 00:00:00 0.003210 2016-04-05 00:00:00 0.011509 2016-04-04 00:00:00 0.038048 2016-04-03 00:00:00 0.039999 ... 2015-08-07 00:00:00 0.000003 2015-06-18 00:00:00 0.000003 2015-06-11 00:00:00 0.000003 2015-03-20 00:00:00 0.000003 2014-03-10 00:00:00 0.000003 Name: ad_created, Length: 112, dtype: float64
autos.ad_created.value_counts()
2016-04-03 00:00:00 11525 2016-04-04 00:00:00 10963 2016-03-20 00:00:00 10580 2016-03-12 00:00:00 10492 2016-04-02 00:00:00 10334 ... 2015-03-20 00:00:00 1 2015-11-17 00:00:00 1 2014-03-10 00:00:00 1 2015-11-13 00:00:00 1 2015-12-27 00:00:00 1 Name: ad_created, Length: 112, dtype: int64
autos.crawled_date.value_counts().sort_index(ascending = False)
2016-04-07 14:36:58 1 2016-04-07 14:36:57 1 2016-04-07 14:36:56 2 2016-04-07 14:36:55 1 2016-04-07 14:36:53 1 .. 2016-03-05 14:06:27 2 2016-03-05 14:06:26 1 2016-03-05 14:06:25 2 2016-03-05 14:06:24 2 2016-03-05 14:06:23 1 Name: crawled_date, Length: 231993, dtype: int64
Observations regarding ad_created:
autos.last_seen.value_counts(dropna = False)
2016-04-07 06:45:59 17 2016-04-06 04:17:20 16 2016-04-07 07:17:35 16 2016-04-06 12:15:45 16 2016-04-06 08:46:37 15 .. 2016-03-24 17:46:40 1 2016-04-03 12:52:09 1 2016-03-24 11:48:44 1 2016-03-12 15:45:13 1 2016-03-23 02:15:34 1 Name: last_seen, Length: 145071, dtype: int64
autos.last_seen.value_counts().sort_index(ascending = False)
2016-04-07 14:58:50 10 2016-04-07 14:58:49 8 2016-04-07 14:58:48 8 2016-04-07 14:58:47 9 2016-04-07 14:58:46 8 .. 2016-03-05 14:44:41 1 2016-03-05 14:35:28 1 2016-03-05 14:18:30 1 2016-03-05 14:15:39 1 2016-03-05 14:15:16 1 Name: last_seen, Length: 145071, dtype: int64
Observations regarding last_seen:
Now let's fix registration_year column. Firstly, just a reminder of what was wrong with it.
autos.registration_year.describe()
count 288134.000000 mean 2005.148965 std 65.538133 min 1000.000000 25% 2001.000000 50% 2005.000000 75% 2009.000000 max 9999.000000 Name: registration_year, dtype: float64
Min and max value looks out of sense since ads were collected up to 2016 and the oldest cars can be expected to be registered in the middle of 20th century. Let's fix minimum registration_year value to 1930 and maximum one to 2016
autos[(autos['registration_year'] <=2016) & (autos['registration_year'] >=1930)].shape
(277316, 17)
autos.shape
(288134, 17)
So, only around 11000 rows (roughly 4%) is going to be removed. So, let's remove these unrealistic years from the original dataset.
autos = autos[(autos['registration_year'] <=2016) & (autos['registration_year'] >=1930)]
#alt version:
#autos = autos[autos['registration_year'].between(1930,2016)]
Quick look to verify that everything worked well.
autos.shape
(277316, 17)
In total around 100000 rows and 3 columns have been removed from the dataset during the cleaning process.
And quick analysis of modified registration_year column.
autos.registration_year.describe()
count 277316.000000 mean 2003.941702 std 7.009812 min 1930.000000 25% 2001.000000 50% 2005.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos.registration_year.value_counts(normalize = True)
2005 0.074381 2006 0.071366 2004 0.067818 2003 0.065121 2007 0.062636 ... 1949 0.000011 1940 0.000007 1942 0.000007 1944 0.000007 1946 0.000007 Name: registration_year, Length: 87, dtype: float64
It works perfectly. This column has been cleaned.
As a final step of data cleaning, let's translate one of the columns (gearbox) from German to English using .map() method.
autos.gearbox.unique()
array(['manuell', 'automatik', nan], dtype=object)
mapping_dict = {
'manuell': 'manual',
'automatik': 'automatic',
'nan': 'no',
}
autos['gearbox'] = autos['gearbox'].map(mapping_dict)
autos['gearbox'].unique()
array(['manual', 'automatic', nan], dtype=object)
Let's perform some data aggregation and discover some properties of "brands" column.
autos.brand.unique()
array(['audi', 'jeep', 'volkswagen', 'skoda', 'peugeot', 'ford', 'mazda', 'nissan', 'renault', 'mercedes_benz', 'bmw', 'honda', 'opel', 'mini', 'smart', 'hyundai', 'subaru', 'volvo', 'mitsubishi', 'alfa_romeo', 'kia', 'seat', 'suzuki', 'lancia', 'porsche', 'citroen', 'fiat', 'toyota', 'chevrolet', 'sonstige_autos', 'dacia', 'daihatsu', 'chrysler', 'jaguar', 'rover', 'saab', 'daewoo', 'land_rover', 'trabant', 'lada'], dtype=object)
autos.brand.value_counts()
volkswagen 58798 bmw 34650 mercedes_benz 30879 audi 27886 opel 24272 ford 16034 renault 10291 peugeot 7993 fiat 5809 skoda 5097 seat 4938 smart 4571 toyota 4104 mazda 4030 citroen 3783 nissan 3342 mini 3224 hyundai 2994 sonstige_autos 2781 volvo 2426 porsche 2119 kia 2081 honda 1986 mitsubishi 1839 suzuki 1714 alfa_romeo 1694 chevrolet 1633 chrysler 1026 dacia 847 land_rover 746 jeep 743 jaguar 581 subaru 484 saab 391 daihatsu 385 trabant 266 lancia 254 daewoo 243 rover 209 lada 173 Name: brand, dtype: int64
As expected, German brands are the most popular in Germany. Let's aggregate for the top 15 brands (popular brands).
Pop_brands = autos.brand.value_counts().index[:15]
Pop_brands
Index(['volkswagen', 'bmw', 'mercedes_benz', 'audi', 'opel', 'ford', 'renault', 'peugeot', 'fiat', 'skoda', 'seat', 'smart', 'toyota', 'mazda', 'citroen'], dtype='object')
Let's calculate a mean for each significant brand
mean_pop_brands = {}
for brand in Pop_brands:
selected_rows = autos[autos['brand'] == brand]
mean = selected_rows['price'].mean()
mean_pop_brands[brand] = mean
mean_pop_brands
{'volkswagen': 6659.831320793224, 'bmw': 9366.378528138528, 'mercedes_benz': 9302.004404287703, 'audi': 10135.101090152764, 'opel': 4310.82325313118, 'ford': 5359.863976549832, 'renault': 3559.205908075017, 'peugeot': 4090.3778306017766, 'fiat': 4136.7336891031155, 'skoda': 6927.274475181479, 'seat': 5767.324827865533, 'smart': 3887.7481951432947, 'toyota': 5769.293372319688, 'mazda': 5205.706451612903, 'citroen': 4607.255881575469}
dict(sorted(mean_pop_brands.items(), key=lambda item: item[1])) #sorting the dictionary by value
{'renault': 3559.205908075017, 'smart': 3887.7481951432947, 'peugeot': 4090.3778306017766, 'fiat': 4136.7336891031155, 'opel': 4310.82325313118, 'citroen': 4607.255881575469, 'mazda': 5205.706451612903, 'ford': 5359.863976549832, 'seat': 5767.324827865533, 'toyota': 5769.293372319688, 'volkswagen': 6659.831320793224, 'skoda': 6927.274475181479, 'mercedes_benz': 9302.004404287703, 'bmw': 9366.378528138528, 'audi': 10135.101090152764}
Audi, bmw and mercedes are the most expensive brands. Renault and smart are the cheapest ones.
Let's use pandas series constructor to convert a dictionary to a pandas series.
brands_price = pd.Series(mean_pop_brands)
brands_price
volkswagen 6659.831321 bmw 9366.378528 mercedes_benz 9302.004404 audi 10135.101090 opel 4310.823253 ford 5359.863977 renault 3559.205908 peugeot 4090.377831 fiat 4136.733689 skoda 6927.274475 seat 5767.324828 smart 3887.748195 toyota 5769.293372 mazda 5205.706452 citroen 4607.255882 dtype: float64
Keys from dictionary became an index in pandas series. Now, it's much easier to sort.
#top 5 wxpensive brands
brands_price.sort_values(ascending = False)[:5]
audi 10135.101090 bmw 9366.378528 mercedes_benz 9302.004404 skoda 6927.274475 volkswagen 6659.831321 dtype: float64
Another option is to construct a new dataframe. But in this case, we need to indicate a column name.
brands_price_kilometer_df = pd.DataFrame(brands_price, columns=['mean_price'])
brands_price_kilometer_df
mean_price | |
---|---|
volkswagen | 6659.831321 |
bmw | 9366.378528 |
mercedes_benz | 9302.004404 |
audi | 10135.101090 |
opel | 4310.823253 |
ford | 5359.863977 |
renault | 3559.205908 |
peugeot | 4090.377831 |
fiat | 4136.733689 |
skoda | 6927.274475 |
seat | 5767.324828 |
smart | 3887.748195 |
toyota | 5769.293372 |
mazda | 5205.706452 |
citroen | 4607.255882 |
Let's create a dataframe consisting of mean_price of top15 popular brands and mean kilometers the car has driven.
mean_kils = {}
for brand in Pop_brands:
selected_rows = autos[autos['brand'] == brand]
mean = selected_rows['kilometer'].mean()
mean_kils[brand] = mean
mean_kils
{'volkswagen': 125505.8845538964, 'bmw': 132179.07647907647, 'mercedes_benz': 129764.24106998283, 'audi': 128042.20755934878, 'opel': 123018.70468029004, 'ford': 119245.97729824124, 'renault': 120898.35778835876, 'peugeot': 119753.53434254973, 'fiat': 107404.02823205371, 'skoda': 112429.86070237395, 'seat': 114721.54718509519, 'smart': 97216.14526361847, 'toyota': 115063.35282651072, 'mazda': 120976.42679900744, 'citroen': 115701.82394924662}
brands_price_kilometer_df['mean kilometer'] = mean_kils.values()
brands_price_kilometer_df
mean_price | mean kilometer | |
---|---|---|
volkswagen | 6659.831321 | 125505.884554 |
bmw | 9366.378528 | 132179.076479 |
mercedes_benz | 9302.004404 | 129764.241070 |
audi | 10135.101090 | 128042.207559 |
opel | 4310.823253 | 123018.704680 |
ford | 5359.863977 | 119245.977298 |
renault | 3559.205908 | 120898.357788 |
peugeot | 4090.377831 | 119753.534343 |
fiat | 4136.733689 | 107404.028232 |
skoda | 6927.274475 | 112429.860702 |
seat | 5767.324828 | 114721.547185 |
smart | 3887.748195 | 97216.145264 |
toyota | 5769.293372 | 115063.352827 |
mazda | 5205.706452 | 120976.426799 |
citroen | 4607.255882 | 115701.823949 |
I can conclude that there is no correlation between kilometers car has passed and it's mean price. Brand affects it much more.
Now, let's discover what BMW models are the most popular.
bmw = autos.loc[autos['brand'] == 'bmw', 'model']
bmw.value_counts()
3er 17064 5er 7705 1er 3687 x_reihe 2245 7er 1039 z_reihe 871 m_reihe 339 andere 318 6er 242 i3 7 Name: model, dtype: int64
BMW series 3 and 5 are the most popular.
As the last task, let's print the most popular model for each of the top brands.
top_model = {}
for brand in Pop_brands:
selected_rows = autos[autos['brand'] == brand]
first_model = selected_rows['model'].value_counts().index[0]
top_model[brand] = first_model
top_model_by_brand = pd.Series(top_model)
top_model_by_brand
volkswagen golf bmw 3er mercedes_benz c_klasse audi a4 opel astra ford focus renault clio peugeot 2_reihe fiat punto skoda octavia seat ibiza smart fortwo toyota yaris mazda 6_reihe citroen andere dtype: object
In this data analysis project I cleaned and analyzed German Ebay car sales data. In particular following tasks have been made:
Data exploration
Data cleaning
In total around 100000 rows and 3 columns have been removed from the dataset during the cleaning process.
Data analysis
The main achievement is that I became familiar with basics of pandas library. I believe, data aggregation should not be done using for loops and pandas has some methods for that. Hopefully, I will discover them later in my data science journey.