Here in this Project I will try to clean the 50000 sample used cars as it was crawled from Ebay-German Website.
We will do the following to our sample: 1- Clean the Data and analyze the included used car listing.
2-And we will familiarize our self with the benefits of jupyter notebooks brings when working with Pandas library.
import numpy as np
import pandas as pd
autos=pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We can see that in this dataset there are some columns which does not have info in their respective cells(NaN) We can see also that we are dealing with Two types of data:
1- string (Shown as Object) 2- integer(Int64) 3- The Dataset contains 20 columns(most of the are Object=string)
Next we are going to change and reword the columns to better reflect the data contains in their cells.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
co={'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps',
'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'no_of_pictures', 'postalCode':'postal_code',
'lastSeen':'last_seen','name':'name','seller':'seller','price':'price','abtest':'abtest','gearbox':'gearbox','model':'model','odometer':'odometer','brand':'brand'}
autos=autos.rename(columns=co)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'no_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
I have change the columns names for better understanding for my self and whoever interested working with data.
The original data columns name was written based on camelCase and I have changed that to snake_case. and in the end I have re assigned the new columns names back to the original dataframe
NEXT I am going to explore my data to determine what kind of cleanin tasks left to be done. for starters, we will take a look at Text columns and chekc whether they are providing good and accountable data for our analysis. Then, we go ahead and try to clean our numeric data and convert them if we need to. The methods that we implement for our data exploration are:
1- Describe() (for both categorical and numeric columns) 2- Value_counts and head()
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-25 19:57:10 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
based on the quick look: I have relized there are number of columns which are not doing any effect in our analysis and there are other columns which need more time to spend on.
the following columns are not play very importantly: 1- Offer_type 2- Seller
How ever the following columns need more attemtion and cleaning as well: 1- odometer 2- price
autos['price'].value_counts().unique
<bound method Series.unique of $0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $8,419 1 $3,410 1 $6,889 1 $10,090 1 $19,911 1 Name: price, Length: 2357, dtype: int64>
autos['price']=autos['price'].str.replace('$','').str.replace(',','').str.replace('.','').astype(int)
autos['price']
0 5000 1 8500 2 8990 3 4350 4 1350 ... 49995 24900 49996 1980 49997 13200 49998 22900 49999 1250 Name: price, Length: 50000, dtype: int32
autos=autos.rename(columns={'odometer':'odometer_km'})
autos['odometer_km']=autos['odometer_km'].str.replace(',','').str.replace('km','').astype(int)
autos['odometer_km']=autos['odometer_km'].astype(int)
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().sort_index(ascending=False).head()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64
autos['odometer_km'].value_counts().sort_index(ascending=True).head()
5000 967 10000 264 20000 784 30000 789 40000 819 Name: odometer_km, dtype: int64
autos['price']=autos['price'].astype(int)
autos['price'].unique().shape
(2357,)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index(ascending=True).head(50)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 70 10 75 5 79 1 80 15 89 1 90 5 99 19 100 134 110 3 111 2 115 2 117 1 120 39 122 1 125 8 129 1 130 15 135 1 139 1 140 9 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending=False).head(50)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 145000 1 139997 1 137999 1 135000 1 130000 1 129000 1 128000 1 120000 2 119900 1 119500 1 116000 1 115991 1 115000 1 114400 1 109999 1 105000 2 104900 1 99900 2 99000 2 98500 1 Name: price, dtype: int64
Ok, So far So good. We have realized that there are some inconsistancies in the prices which we need to get rid of those as they are not correctly reflected into our data.
I am going to remove anything above 350000 Dollars and anything below 100$
autos=autos[autos['price'].between(100,350000)]
autos['price'].value_counts().sort_index().head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending=False)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 ... 5 2 3 1 2 3 1 156 0 1421 Name: price, Length: 2357, dtype: int64
I have successfully removed the outliers anything above or below the considered values
autos.head(100)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2016-03-30 15:59:17 | Jaguar_XJ_5.0_V8_Portfolio | privat | Angebot | 36000 | test | limousine | 2012 | automatik | 385 | andere | 40000 | 8 | benzin | jaguar | nein | 2016-03-30 00:00:00 | 0 | 21720 | 2016-04-07 06:44:54 |
96 | 2016-03-07 20:56:44 | Volkswagen_Golf_Variant_1.9_TDI_Joker | privat | Angebot | 1000 | test | kombi | 1997 | manuell | 110 | golf | 150000 | 6 | diesel | volkswagen | nein | 2016-03-07 00:00:00 | 0 | 96106 | 2016-04-05 23:45:36 |
97 | 2016-04-01 16:54:40 | Treuer_Begleiter_abzugeben | privat | Angebot | 800 | control | bus | 2000 | manuell | 63 | andere | 125000 | 8 | benzin | mazda | ja | 2016-04-01 00:00:00 | 0 | 87648 | 2016-04-05 12:17:40 |
98 | 2016-04-04 15:47:36 | Audi_A4_Cabriolet_2.0_TFSI_S_Line | privat | Angebot | 13500 | test | cabrio | 2008 | automatik | 200 | a4 | 125000 | 5 | benzin | audi | NaN | 2016-04-04 00:00:00 | 0 | 66459 | 2016-04-06 17:17:56 |
99 | 2016-04-05 09:48:54 | Peugeot_207_CC___Cabrio_Bj_2011 | privat | Angebot | 0 | control | cabrio | 2011 | manuell | 0 | 2_reihe | 60000 | 7 | diesel | peugeot | nein | 2016-04-05 00:00:00 | 0 | 99735 | 2016-04-07 12:17:34 |
100 rows × 20 columns
autos['abtest'].value_counts().sort_index().head()
control 24244 test 25756 Name: abtest, dtype: int64
autos['abtest'].unique()
array(['control', 'test'], dtype=object)
autos['abtest'].describe()
count 50000 unique 2 top test freq 25756 Name: abtest, dtype: object
autos['date_crawled'].head()
0 2016-03-26 17:47:46 1 2016-04-04 13:38:56 2 2016-03-26 18:57:24 3 2016-03-12 16:58:10 4 2016-04-01 14:38:50 Name: date_crawled, dtype: object
autos['last_seen'].head()
0 2016-04-06 06:45:54 1 2016-04-06 14:45:08 2 2016-04-06 20:15:37 3 2016-03-15 03:16:28 4 2016-04-01 14:38:50 Name: last_seen, dtype: object
autos['ad_created'].head()
0 2016-03-26 00:00:00 1 2016-04-04 00:00:00 2 2016-03-26 00:00:00 3 2016-03-12 00:00:00 4 2016-04-01 00:00:00 Name: ad_created, dtype: object
autos['registration_month'].head()
0 3 1 6 2 7 3 6 4 7 Name: registration_month, dtype: int64
autos['registration_year'].head()
0 2004 1 1997 2 2009 3 2007 4 2003 Name: registration_year, dtype: int64
autos['registration_month'].describe()
count 50000.000000 mean 5.723360 std 3.711984 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
autos['registration_month'].value_counts().sort_index()
0 5075 1 3282 2 3008 3 5071 4 4102 5 4107 6 4368 7 3949 8 3191 9 3389 10 3651 11 3360 12 3447 Name: registration_month, dtype: int64
crawled=autos['date_crawled'].str[:10]
crawled.value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
Here I have used str method on pandas series 'date_crawled' to focus on the date and day of the crawled data.
Then, I have used value_counts with normalized argument to change the number of occurence of specific date to Percentage. and then I used sort_index function to sort them out based on ascending order.
I am going to do the same for 'ad_created' and 'last_seen' columns since they are both recognized as Oject(string) in Pandas.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 ... 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64
autos['last_seen'].describe()
count 50000 unique 39481 top 2016-04-07 06:17:27 freq 8 Name: last_seen, dtype: object
autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
As you can see from the result above there are Two particular number which do not make any sense. min value is =1000 max value is=9999
I am going to limit the domain and consider values between 1900 up to 2016
autos=autos[autos['registration_year'].between(1960,2017)]
autos['registration_year'].value_counts().sort_index()
1960 34 1961 6 1962 4 1963 9 1964 12 1965 17 1966 22 1967 27 1968 26 1969 19 1970 45 1971 27 1972 35 1973 26 1974 24 1975 19 1976 27 1977 22 1978 47 1979 35 1980 97 1981 31 1982 43 1983 53 1984 53 1985 105 1986 76 1987 75 1988 142 1989 181 1990 395 1991 356 1992 391 1993 445 1994 660 1995 1313 1996 1444 1997 2028 1998 2453 1999 3000 2000 3354 2001 2703 2002 2533 2003 2727 2004 2737 2005 3015 2006 2708 2007 2304 2008 2231 2009 2098 2010 1597 2011 1634 2012 1323 2013 806 2014 666 2015 399 2016 1316 2017 1453 Name: registration_year, dtype: int64
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
brands=autos['brand'].value_counts()
brands
volkswagen 10564 opel 5387 bmw 5382 mercedes_benz 4682 audi 4254 ford 3445 renault 2367 peugeot 1449 fiat 1288 seat 914 skoda 781 mazda 751 nissan 749 smart 695 citroen 692 toyota 615 sonstige_autos 513 hyundai 485 volvo 453 mini 424 mitsubishi 399 honda 391 kia 353 alfa_romeo 326 porsche 292 suzuki 289 chevrolet 277 chrysler 181 dacia 128 daihatsu 126 jeep 109 subaru 107 land_rover 98 saab 78 daewoo 78 trabant 76 jaguar 76 rover 69 lancia 56 lada 29 Name: brand, dtype: int64
brands.index[0:20]
Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart', 'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'], dtype='object')
I have decided to create aggregate data of top 20 brands and price average and put it in the dictionary variable.
my_dict={}
for i in brands.index[0:6]:
a=autos[autos['brand']==i]
b=a['price'].mean()
my_dict[i]=int(b)
print(my_dict)
{'volkswagen': 6413, 'opel': 5144, 'bmw': 8278, 'mercedes_benz': 29679, 'audi': 8989, 'ford': 7111}
Audi, BMW and Mercedes Benz are more expensive Ford and Opel are less expensive Volkswagen is in between
avg_kilometers={}
for i in brands.index[0:6]:
a=autos[autos['brand']==i]
b=a['odometer_km'].mean()
avg_kilometers[i]=int(b)
print(avg_kilometers)
{'volkswagen': 128977, 'opel': 129356, 'bmw': 132544, 'mercedes_benz': 131168, 'audi': 129561, 'ford': 124184}
Here we calculated the average milage based on top 6 brands
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'no_of_pictures', 'postal_code', 'last_seen'], dtype='object')
Next I am going to converts those dictionaries created to Pandas. Series
avg_price_brand=pd.Series(my_dict)
avg_milage_brand=pd.Series(avg_kilometers)
No I am going to converts these series into DataFram
Avg_Price_Mile=pd.DataFrame(avg_price_brand)
Avg_Price_Mile['avg_milage_brand']=avg_milage_brand
Avg_Price_Mile=Avg_Price_Mile.rename(columns={0:'avg_price_brand'})
Avg_Price_Mile.sort_index(ascending=False)
avg_price_brand | avg_milage_brand | |
---|---|---|
volkswagen | 6413 | 128977 |
opel | 5144 | 129356 |
mercedes_benz | 29679 | 131168 |
ford | 7111 | 124184 |
bmw | 8278 | 132544 |
audi | 8989 | 129561 |
The table above shows the Average Price/Milage The minimum average mileage belong to Fold and maximum mileage belong to BMW
The minimum average price belong to Opel and maximum average price accross all top 6 brands is belong to Mercedes_Benz