This dataset contains information about used cars. This dataset came from a section of German eBay (eBay Kleinanzeigen). The original is here. We will work with a modified version with 50 000 data points and 20 columns, scrawled in 2016 by one Kaggle user.
The aim of this project is to clean the data and analyze the included used car listings.
Let us start to import the library that we probably will need:
import pandas as pd
import numpy as np
autos=pd.read_csv("autos.csv")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___T�V_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Checking the above information, we can take the following notes:
we have 20 columns and 50 000 rows.
columns labels:
columns data type vs information:
int64
, the remaining 16 columns has data type object
.price
and odometer
even contains numeric information are object because contains "$" and "km". We can transformed them in numeric type, removing this special characters (please check following point 2.3)dateCrawled
, dateCreated
and lastSee
, which type is object. Probably it is better format those information as date time. (please check following point 2.4)columns with null values:
vehicleType
, gearbox
, model
, fuelType
, notRepairedDamage
. We will need to inspect and decide what we can do, if it can be revised, or if we will need to delete those rows.text information is in German. We may need to translate it to English (using.map
, to understand easily the meaning).
column name
contains a lot information, which we can split and insert in different new columns. For example before first underscore we have the brand name - as we have already one column with this information, we don't need to duplicate this information. Between next two underscore we have the car model, however we also have one column with car model. Anyway, if we identify any interesting info, we can isolate it in one new column, and do some specific analyse
#list of columns names:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#reanme following columns:
autos.rename({
"dateCrawled":"date_crawled",
"offerType":"offer_type",
"vehicleType":"vehicle_type",
"yearOfRegistration":"registration_year",
"monthOfRegistration":"registration_month",
"fuelType":"fuel_type",
"notRepairedDamage":"unrepaired_damage",
"dateCreated":"ad_created",
"nrOfPictures":"pic_num",
"postalCode":"postal_code",
"lastSeen":"last_seen"},
axis=1, inplace=True)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'pic_num', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pic_num | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
price
¶As we detected before, we can conver price
and odemeter
to numeric type, just removing "$" and "km", respectively.
autos["price"]=autos["price"].str.replace("$", "")
autos["price"]=autos["price"].str.replace(",", "").astype(float)
<ipython-input-9-48ac3c5553ea>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. autos["price"]=autos["price"].str.replace("$", "")
#print(autos["price"])
print(autos["price"].dtype)
float64
#autos["price"]="{:,}".format(autos["price"])
#print(autos["price"])
autos.rename({"price":"price_dollar"}, axis=1, inplace=True)
print(autos["price_dollar"])
0 5000.0 1 8500.0 2 8990.0 3 4350.0 4 1350.0 ... 49995 24900.0 49996 1980.0 49997 13200.0 49998 22900.0 49999 1250.0 Name: price_dollar, Length: 50000, dtype: float64
print(autos['price_dollar'].describe())
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollar, dtype: float64
When we tried to inspect the price column with series.descibe() we found the result came in scientific notation. To solve this question we will use format the float numbers with comma on thousand position and with 3 decimal digits:
pd.options.display.float_format = '{:,.3f}'.format
#rechecking price column:
print(autos['price_dollar'].describe())
count 50,000.000 mean 9,840.044 std 481,104.381 min 0.000 25% 1,100.000 50% 2,950.000 75% 7,200.000 max 99,999,999.000 Name: price_dollar, dtype: float64
odometer
¶autos["odometer"]=autos["odometer"].str.replace("km", "")
autos["odometer"]=autos["odometer"].str.replace(",", "").astype(float)
#print(autos["odometer"])
print(autos["odometer"].dtype)
float64
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)
print(autos["odometer_km"])
0 150,000.000 1 150,000.000 2 70,000.000 3 70,000.000 4 150,000.000 ... 49995 100,000.000 49996 150,000.000 49997 5,000.000 49998 40,000.000 49999 150,000.000 Name: odometer_km, Length: 50000, dtype: float64
import datetime
autos[['date_crawled','ad_created','last_seen','registration_year','registration_month']][0:5]
date_crawled | ad_created | last_seen | registration_year | registration_month | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 | 2004 | 3 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 | 1997 | 6 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 | 2009 | 7 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 | 2007 | 6 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 | 2003 | 7 |
#(autos['date_crawled'].str[:10]).value_counts(dropna=False).sort_index()
autos['date_crawled']=pd.to_datetime(autos['date_crawled'], format='%Y-%m-%d %H:%M:%S')
#(autos['ad_created'].str[:10]).value_counts(dropna=False).sort_index()
autos['ad_created']=pd.to_datetime(autos['ad_created'], format='%Y-%m-%d %H:%M:%S')
#(autos['last_seen'].str[:10]).value_counts(dropna=False).sort_index()
autos['last_seen']=pd.to_datetime(autos['last_seen'], format='%Y-%m-%d %H:%M:%S')
autos['registration_year']=pd.to_datetime(autos['registration_year'], format='%Y',errors='ignore')
#we cannot converter to datetime format while we have wrong information on this column
#as we have some rows without month information, we need to replace the zero by nan:
#we can do it with the following:
#autos.loc[autos['registration_month']==0, "registration_month"]=np.nan
#or using error"coerce"
autos['registration_month']=pd.to_datetime(autos['registration_month'], format='%m',errors='coerce').dt.month
#we will not leave this column in datetime format because we want only see the month information, so it will be int84
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null datetime64[ns] 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price_dollar 50000 non-null float64 5 abtest 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 50000 non-null float64 12 registration_month 44925 non-null float64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null datetime64[ns] 17 pic_num 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null datetime64[ns] dtypes: datetime64[ns](3), float64(3), int64(4), object(10) memory usage: 7.6+ MB
Checking statistics of datetime type:
autos.describe(include="datetime64", datetime_is_numeric=True)
date_crawled | ad_created | last_seen | |
---|---|---|---|
count | 50000 | 50000 | 50000 |
mean | 2016-03-21 13:43:25.697280 | 2016-03-20 19:41:34.656000 | 2016-03-30 04:04:36.309079808 |
min | 2016-03-05 14:06:30 | 2015-06-11 00:00:00 | 2016-03-05 14:45:46 |
25% | 2016-03-13 14:51:25.750000128 | 2016-03-13 00:00:00 | 2016-03-23 11:11:33.500000 |
50% | 2016-03-21 17:53:56 | 2016-03-21 00:00:00 | 2016-04-04 01:17:24 |
75% | 2016-03-29 14:36:47.750000128 | 2016-03-29 00:00:00 | 2016-04-06 10:45:28.249999872 |
max | 2016-04-07 14:36:56 | 2016-04-07 00:00:00 | 2016-04-07 14:58:50 |
Checking above information, we don't see anything strange. So, we will not remove anything fromt these columns
autos.describe()
price_dollar | registration_year | powerPS | odometer_km | registration_month | pic_num | postal_code | |
---|---|---|---|---|---|---|---|
count | 50,000.000 | 50,000.000 | 50,000.000 | 50,000.000 | 44,925.000 | 50,000.000 | 50,000.000 |
mean | 9,840.044 | 2,005.073 | 116.356 | 125,732.700 | 6.370 | 0.000 | 50,813.627 |
std | 481,104.381 | 105.713 | 209.217 | 40,042.212 | 3.349 | 0.000 | 25,779.748 |
min | 0.000 | 1,000.000 | 0.000 | 5,000.000 | 1.000 | 0.000 | 1,067.000 |
25% | 1,100.000 | 1,999.000 | 70.000 | 125,000.000 | 3.000 | 0.000 | 30,451.000 |
50% | 2,950.000 | 2,003.000 | 105.000 | 150,000.000 | 6.000 | 0.000 | 49,577.000 |
75% | 7,200.000 | 2,008.000 | 150.000 | 150,000.000 | 9.000 | 0.000 | 71,540.000 |
max | 99,999,999.000 | 9,999.000 | 17,700.000 | 150,000.000 | 12.000 | 0.000 | 99,998.000 |
Using the function dataframe.describe
we can see some relevant information, which we will need to solve.
price_dollar
, we can see that the minimum price is 0 (zero), which should be a mistake.registration_year
, we can see that the minimum year is 1000 and maximum ist 9999, which are obviously incorrect.registration_month
, we can see that the minimum month is 0 (zero), which is not correct.postal_code
information, we can see that the minimum postal code has only 4 digits, and other statistics has 5 digits, which could mean some mistake.autos["price_dollar"].value_counts().sort_index()
0.000 1421 1.000 156 2.000 3 3.000 1 5.000 2 ... 10,000,000.000 1 11,111,111.000 2 12,345,678.000 3 27,322,222.000 1 99,999,999.000 1 Name: price_dollar, Length: 2357, dtype: int64
autos["price_dollar"].value_counts().sort_index().head(25)
0.000 1421 1.000 156 2.000 3 3.000 1 5.000 2 8.000 1 9.000 1 10.000 7 11.000 2 12.000 3 13.000 2 14.000 1 15.000 2 17.000 3 18.000 1 20.000 4 25.000 5 29.000 1 30.000 7 35.000 1 40.000 6 45.000 4 47.000 1 49.000 4 50.000 49 Name: price_dollar, dtype: int64
autos["price_dollar"].value_counts().sort_index(ascending=False).head(12)
99,999,999.000 1 27,322,222.000 1 12,345,678.000 3 11,111,111.000 2 10,000,000.000 1 3,890,000.000 1 1,300,000.000 1 1,234,566.000 1 999,999.000 2 999,990.000 1 350,000.000 1 345,000.000 1 Name: price_dollar, dtype: int64
autos.loc[autos["price_dollar"]<100,("price_dollar","registration_year")]
price_dollar | registration_year | |
---|---|---|
25 | 90.000 | 1996 |
27 | 0.000 | 2005 |
30 | 80.000 | 2002 |
55 | 1.000 | 2017 |
64 | 40.000 | 2011 |
... | ... | ... |
49884 | 0.000 | 1999 |
49943 | 0.000 | 2016 |
49960 | 0.000 | 1999 |
49974 | 0.000 | 1983 |
49984 | 0.000 | 2000 |
1762 rows × 2 columns
--> We have 1421 rows without price information, some rows with very low price like 1usd, 2usd, 5usd. otherwise we have columns with very high price like 99,999,999usd which should not make sense. We must remove at least the values lower than 50 and higher or equal than 999990
We tried to understand if prices equals to zero have some correlation with registration_year (if cars are too old, but we didn't find any correlation.
To take a decision which values we will remove we will try to check price distribution for intervals.
We we will create following price ranges:
Let's see how this categories behave.
# Creating the empty dictionary
price_dist = {'p_0':0,'p_until_50':0,'p_100_1000':0, 'p_1000_10000':0, 'p10000_350000':0,
'p_more_350000':0}
# Populating it
for row in autos['price_dollar']:
if row==0:
price_dist['p_0'] += 1
elif row <= 50:
price_dist['p_until_50'] += 1
elif 100 < row <= 1000:
price_dist['p_100_1000'] += 1
elif 1000 < row <= 10000:
price_dist['p_1000_10000'] += 1
elif 10000 < row <= 350000:
price_dist['p10000_350000'] += 1
else:
price_dist['p_more_350000'] += 1
print(price_dist)
{'p_0': 1421, 'p_until_50': 267, 'p_100_1000': 10103, 'p_1000_10000': 29928, 'p10000_350000': 8059, 'p_more_350000': 222}
--> Based on this results we will remove pricpointes lower than 50 and higher than 350k dollars. (check point 4)
print("percentage of price lower than 50 is :{:.2f}%".format(((price_dist["p_0"]+price_dist["p_until_50"])/50000)*100))
print("percentage of price higher than 350000 is :{:.2f}%".format(((price_dist["p_more_350000"])/50000)*100))
percentage of price lower than 50 is :3.38% percentage of price higher than 350000 is :0.44%
registration_year
column details¶autos["registration_year"].value_counts(dropna=False).sort_index().head(10)
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 Name: registration_year, dtype: int64
-->We must remove year lower than 1910
autos["registration_year"].value_counts(dropna=False).sort_index(ascending=False).head(20)
9999 4 9996 1 9000 2 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 3 2018 492 2017 1453 2016 1316 2015 399 2014 666 2013 806 2012 1323 2011 1634 Name: registration_year, dtype: int64
-->We must remove year higher than 2016 (date of this dataset)
registration_month
column details¶#autos["registration_month"].dt.month.value_counts()
autos["registration_month"].value_counts(dropna=False)
NaN 5075 3.000 5071 6.000 4368 5.000 4107 4.000 4102 7.000 3949 10.000 3651 12.000 3447 9.000 3389 11.000 3360 1.000 3282 8.000 3191 2.000 3008 Name: registration_month, dtype: int64
--> We have 5075 rows, without month infortmation! However it represents 10% of values. We should not just remove it... Let us proceed to try to understand if this will have impact on our study.
postal_code
column details¶autos["postal_code"].value_counts().sort_index().head(10)
1067 11 1069 7 1097 4 1099 10 1108 2 1109 9 1127 3 1129 8 1139 15 1156 4 Name: postal_code, dtype: int64
--> We can conclude that it is usual to have zip code only with 4 digits, so it should not represent one problem. We will not remove anything here
odometer_km
column details¶autos["odometer_km"].value_counts().sort_index()
5,000.000 967 10,000.000 264 20,000.000 784 30,000.000 789 40,000.000 819 50,000.000 1027 60,000.000 1164 70,000.000 1230 80,000.000 1436 90,000.000 1757 100,000.000 2169 125,000.000 5170 150,000.000 32424 Name: odometer_km, dtype: int64
--> Following what we can see, there is nothing strange. so we will not remove anything here.
#autos.describe(include="all")
autos.describe(include="object")
name | seller | offer_type | abtest | vehicle_type | gearbox | model | fuel_type | brand | unrepaired_damage | |
---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 44905 | 47320 | 47242 | 45518 | 50000 | 40171 |
unique | 38754 | 2 | 2 | 2 | 8 | 2 | 245 | 7 | 40 | 2 |
top | Ford_Fiesta | privat | Angebot | test | limousine | manuell | golf | benzin | volkswagen | nein |
freq | 78 | 49999 | 49999 | 25756 | 12859 | 36993 | 4024 | 30107 | 10687 | 35232 |
Using the function dataframe.describe
with include=object
, we can see statistcs for non-numerical columns, and we can see some relevant information, which we will need to solve.
Checking the results of seller
and offer_type
, we can see that almost all values in these columns has the same value (only one seller is not privat and only one offer is not Angebot type).
Checking the results of abtest
,gearbox
and unrepaired_damage
, we can see that the results are split between 2 values, which is correct, if there is not nulls in these columns:
Checking the results of vehicle_type
and fuel_type
, we can see that we have 8 and 7 different types of each. We should inspect this uniques values and confirm if it is correct.
seller
and offer tyoe
columns details¶print(autos["offer_type"].value_counts(dropna=False))
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
print(autos["seller"].value_counts(dropna=False))
privat 49999 gewerblich 1 Name: seller, dtype: int64
abtest
, gearbox
and offer tyoe
columns details¶print(autos["abtest"].value_counts(dropna=False))
test 25756 control 24244 Name: abtest, dtype: int64
print(autos["gearbox"].value_counts(dropna=False))
manuell 36993 automatik 10327 NaN 2680 Name: gearbox, dtype: int64
print(autos["unrepaired_damage"].value_counts(dropna=False))
nein 35232 NaN 9829 ja 4939 Name: unrepaired_damage, dtype: int64
--> According to our results, we can see that:
gearbox
and unrepaired_damaged
, so we cannot remove it directly. Let us see if it has impact on our analyze. If don't have we will keep this NaN values here.print(autos["vehicle_type"].value_counts(dropna=False))
limousine 12859 kleinwagen 10822 kombi 9127 NaN 5095 bus 4093 cabrio 3061 coupe 2537 suv 1986 andere 420 Name: vehicle_type, dtype: int64
print(autos["fuel_type"].value_counts(dropna=False))
benzin 30107 diesel 14567 NaN 4482 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fuel_type, dtype: int64
--> According to our following inspection,except NaN values, other values are regarding real different types of vehicles or fuel.
dic_gearbox={"manuell":"manual", "automatik":"automatic"}
autos["gearbox"]=autos["gearbox"].map(dic_gearbox)
print(autos["gearbox"].value_counts(dropna=False))
manual 36993 automatic 10327 NaN 2680 Name: gearbox, dtype: int64
dic_unrepaired={"nein":"no", "ja":"yes"}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(dic_unrepaired)
print(autos["unrepaired_damage"].value_counts(dropna=False))
no 35232 NaN 9829 yes 4939 Name: unrepaired_damage, dtype: int64
dic_vehicle_type=({
"limousine":"sedan",
"kleinwagen":"small car",
"kombi":"station wagon",
"NaN":"NaN",
"bus":"bus",
"cabrio":"convertible",
"coupe":"coupe",
"suv":"suv",
"andere":"other"})
autos["vehicle_type"]=autos["vehicle_type"].map(dic_vehicle_type)
print(autos["vehicle_type"].value_counts(dropna=False))
sedan 12859 small car 10822 station wagon 9127 NaN 5095 bus 4093 convertible 3061 coupe 2537 suv 1986 other 420 Name: vehicle_type, dtype: int64
dic_fuel_type=({
"benzin":"petrol",
"diesel":"diesel",
"NaN":"NaN",
"lpg":"lpg",
"cng":"cng",
"hybrid":"hybrid",
"andere":"other",
"elektro":"electro"})
autos["fuel_type"]=autos["fuel_type"].map(dic_fuel_type)
print(autos["fuel_type"].value_counts(dropna=False))
petrol 30107 diesel 14567 NaN 4482 lpg 691 cng 75 hybrid 37 other 22 electro 19 Name: fuel_type, dtype: int64
#autos["model"]=autos["name"].str.split("_").str[1]
#autos["model"]
print("Dataset original shape:")
print(autos.shape)
Dataset original shape: (50000, 20)
#from price_dollar- as we conclude we will remove some rows
#autos_select=autos[autos["price_dollar"]==0].remove()
autos=autos.loc[autos["price_dollar"].between(50,350000)]
print("Shape after delete outliers of price:")
print(autos.shape)
Shape after delete outliers of price: (48347, 20)
#from odometer_km - based on results that we got, we will not remove any rows
#from registration_year - basde on our previous conclusions we will remove some rows:
autos=autos.loc[autos["registration_year"].between(1910,2016)]
print("Shape after delete outliers of registration_year:")
print(autos.shape)
Shape after delete outliers of registration_year: (46473, 20)
#autos_select["price_dollar"].value_counts(normalize=True).sort_index()
autos.describe(include="all", datetime_is_numeric=True)
date_crawled | name | seller | offer_type | price_dollar | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pic_num | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 46473 | 46473 | 46473 | 46473 | 46,473.000 | 46473 | 43872 | 46,473.000 | 44453 | 46,473.000 | 44321 | 46,473.000 | 42,554.000 | 43252 | 46473 | 38311 | 46473 | 46,473.000 | 46,473.000 | 46473 |
unique | NaN | 35613 | 2 | 1 | NaN | 2 | 8 | NaN | 2 | NaN | 244 | NaN | NaN | 7 | 40 | 2 | NaN | NaN | NaN | NaN |
top | NaN | BMW_316i | privat | Angebot | NaN | test | sedan | NaN | manual | NaN | golf | NaN | NaN | petrol | volkswagen | no | NaN | NaN | NaN | NaN |
freq | NaN | 75 | 46472 | 46473 | NaN | 23948 | 12567 | NaN | 34619 | NaN | 3695 | NaN | NaN | 28449 | 9821 | 33794 | NaN | NaN | NaN | NaN |
mean | 2016-03-21 13:21:44.805177344 | NaN | NaN | NaN | 6,004.439 | NaN | NaN | 2,002.931 | NaN | 118.168 | NaN | 125,688.787 | 6.378 | NaN | NaN | NaN | 2016-03-20 19:19:03.141178880 | 0.000 | 51,118.913 | 2016-03-30 05:52:13.654874880 |
min | 2016-03-05 14:06:30 | NaN | NaN | NaN | 50.000 | NaN | NaN | 1,910.000 | NaN | 0.000 | NaN | 5,000.000 | 1.000 | NaN | NaN | NaN | 2015-06-11 00:00:00 | 0.000 | 1,067.000 | 2016-03-05 14:45:46 |
25% | 2016-03-13 13:52:31 | NaN | NaN | NaN | 1,250.000 | NaN | NaN | 1,999.000 | NaN | 75.000 | NaN | 100,000.000 | 3.000 | NaN | NaN | NaN | 2016-03-13 00:00:00 | 0.000 | 30,853.000 | 2016-03-23 13:48:53 |
50% | 2016-03-21 17:46:58 | NaN | NaN | NaN | 3,149.000 | NaN | NaN | 2,003.000 | NaN | 109.000 | NaN | 150,000.000 | 6.000 | NaN | NaN | NaN | 2016-03-21 00:00:00 | 0.000 | 49,832.000 | 2016-04-04 07:47:19 |
75% | 2016-03-29 14:46:30 | NaN | NaN | NaN | 7,500.000 | NaN | NaN | 2,008.000 | NaN | 150.000 | NaN | 150,000.000 | 9.000 | NaN | NaN | NaN | 2016-03-29 00:00:00 | 0.000 | 72,070.000 | 2016-04-06 11:15:36 |
max | 2016-04-07 14:36:56 | NaN | NaN | NaN | 350,000.000 | NaN | NaN | 2,016.000 | NaN | 17,700.000 | NaN | 150,000.000 | 12.000 | NaN | NaN | NaN | 2016-04-07 00:00:00 | 0.000 | 99,998.000 | 2016-04-07 14:58:50 |
std | NaN | NaN | NaN | NaN | 9,189.710 | NaN | NaN | 7.129 | NaN | 185.230 | NaN | 39,687.056 | 3.351 | NaN | NaN | NaN | NaN | 0.000 | 25,751.370 | NaN |
First, we will aggregate the items by car brands
# Checking the unique brand values
print("What are the uniques brandes in our dataset: \n", autos["brand"].unique(),"\n")
print("How many brands do we have in our dataset? \n", autos["brand"].unique().shape[0])
What are the uniques brandes in our dataset: ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia'] How many brands do we have in our dataset? 40
print("TOP 15 of quantity of cars per brand:")
print(autos["brand"].value_counts().head(15))
print("\n")
print("TOP 15 percentage of cars per brand:")
print((((autos["brand"].value_counts())/autos["brand"].shape)*100).head(15))
TOP 15 of quantity of cars per brand: volkswagen 9821 bmw 5110 opel 4987 mercedes_benz 4485 audi 4025 ford 3252 renault 2193 peugeot 1390 fiat 1195 seat 848 skoda 763 nissan 712 mazda 709 smart 659 citroen 653 Name: brand, dtype: int64 TOP 15 percentage of cars per brand: volkswagen 21.133 bmw 10.996 opel 10.731 mercedes_benz 9.651 audi 8.661 ford 6.998 renault 4.719 peugeot 2.991 fiat 2.571 seat 1.825 skoda 1.642 nissan 1.532 mazda 1.526 smart 1.418 citroen 1.405 Name: brand, dtype: float64
Based on this values, we can easily conclude that the brands with more cars in this dataset are volkswagen
,bmw
andopel
<\span>
freq_perc=((autos["brand"].value_counts()/autos["brand"].shape)*100)
print(freq_perc["jaguar"])
0.15708045531814172
#creating empty dictionary to strore our aggregate data
brand_avg_price={}
#populating our dictionary with average price per brand (because we think it can be useful, we will also add the brand freq)
for each in autos["brand"].unique():
avg_price=autos.loc[autos["brand"]==each, "price_dollar"].mean()
brand_avg_price[each]=[avg_price,freq_perc[each]]
print(brand_avg_price)
{'peugeot': [3100.6928057553955, 2.9909840122221505], 'bmw': [8376.805088062622, 10.99563187226992], 'volkswagen': [5424.9185419000105, 21.132700707937943], 'smart': [3591.0212443095597, 1.4180276719815805], 'ford': [3762.1165436654364, 6.997611516364341], 'chrysler': [3465.743902439024, 0.35289307770103073], 'seat': [4423.116745283019, 1.824715426161427], 'renault': [2483.889192886457, 4.718869020721709], 'mercedes_benz': [8663.059308807135, 9.65076496029953], 'audi': [9373.771428571428, 8.66094291308932], 'sonstige_autos': [12642.134228187919, 0.9618488154412239], 'opel': [2996.0850210547424, 10.730962063994147], 'mazda': [4112.596614950635, 1.525617024939212], 'porsche': [46455.89323843416, 0.604652163621888], 'mini': [10639.450980392157, 0.8779291201342715], 'toyota': [5167.091062394604, 1.2760097260775074], 'dacia': [5915.528455284553, 0.264669808275773], 'nissan': [4750.063202247191, 1.5320723861166698], 'jeep': [11650.5, 0.22808942827017836], 'saab': [3211.6493506493507, 0.16568760355475223], 'volvo': [4993.208037825059, 0.9102059260215609], 'mitsubishi': [3430.25, 0.817679082477998], 'jaguar': [11635.493150684932, 0.15708045531814172], 'fiat': [2818.456066945607, 2.5713855356873885], 'skoda': [6393.018348623853, 1.6418135261334537], 'subaru': [4033.7551020408164, 0.21087513179695738], 'kia': [5982.330303030303, 0.7100897295203666], 'citroen': [3784.924961715161, 1.405116949626665], 'chevrolet': [6709.358490566037, 0.5702235706754459], 'hyundai': [5376.740899357602, 1.0048845566242766], 'honda': [4119.109589041096, 0.7854022765907086], 'daewoo': [1064.0579710144928, 0.14847330708153123], 'suzuki': [4111.572463768116, 0.5938932283261249], 'trabant': [1818.609375, 0.1377143717857681], 'land_rover': [19108.091836734693, 0.21087513179695738], 'alfa_romeo': [4100.915857605178, 0.6649022012781616], 'lada': [2688.296296296296, 0.05809825059712091], 'rover': [1602.2903225806451, 0.13341079766746283], 'daihatsu': [1636.1965811965813, 0.2517590859208573], 'lancia': [3444.877551020408, 0.10543756589847869]}
However, it is not easy read the values print in a dictionary format.
Let us see, how will be the dictionary if we use pretty print from json:
import json
print(json.dumps(brand_avg_price, indent=4))
{ "peugeot": [ 3100.6928057553955, 2.9909840122221505 ], "bmw": [ 8376.805088062622, 10.99563187226992 ], "volkswagen": [ 5424.9185419000105, 21.132700707937943 ], "smart": [ 3591.0212443095597, 1.4180276719815805 ], "ford": [ 3762.1165436654364, 6.997611516364341 ], "chrysler": [ 3465.743902439024, 0.35289307770103073 ], "seat": [ 4423.116745283019, 1.824715426161427 ], "renault": [ 2483.889192886457, 4.718869020721709 ], "mercedes_benz": [ 8663.059308807135, 9.65076496029953 ], "audi": [ 9373.771428571428, 8.66094291308932 ], "sonstige_autos": [ 12642.134228187919, 0.9618488154412239 ], "opel": [ 2996.0850210547424, 10.730962063994147 ], "mazda": [ 4112.596614950635, 1.525617024939212 ], "porsche": [ 46455.89323843416, 0.604652163621888 ], "mini": [ 10639.450980392157, 0.8779291201342715 ], "toyota": [ 5167.091062394604, 1.2760097260775074 ], "dacia": [ 5915.528455284553, 0.264669808275773 ], "nissan": [ 4750.063202247191, 1.5320723861166698 ], "jeep": [ 11650.5, 0.22808942827017836 ], "saab": [ 3211.6493506493507, 0.16568760355475223 ], "volvo": [ 4993.208037825059, 0.9102059260215609 ], "mitsubishi": [ 3430.25, 0.817679082477998 ], "jaguar": [ 11635.493150684932, 0.15708045531814172 ], "fiat": [ 2818.456066945607, 2.5713855356873885 ], "skoda": [ 6393.018348623853, 1.6418135261334537 ], "subaru": [ 4033.7551020408164, 0.21087513179695738 ], "kia": [ 5982.330303030303, 0.7100897295203666 ], "citroen": [ 3784.924961715161, 1.405116949626665 ], "chevrolet": [ 6709.358490566037, 0.5702235706754459 ], "hyundai": [ 5376.740899357602, 1.0048845566242766 ], "honda": [ 4119.109589041096, 0.7854022765907086 ], "daewoo": [ 1064.0579710144928, 0.14847330708153123 ], "suzuki": [ 4111.572463768116, 0.5938932283261249 ], "trabant": [ 1818.609375, 0.1377143717857681 ], "land_rover": [ 19108.091836734693, 0.21087513179695738 ], "alfa_romeo": [ 4100.915857605178, 0.6649022012781616 ], "lada": [ 2688.296296296296, 0.05809825059712091 ], "rover": [ 1602.2903225806451, 0.13341079766746283 ], "daihatsu": [ 1636.1965811965813, 0.2517590859208573 ], "lancia": [ 3444.877551020408, 0.10543756589847869 ] }
It is much more readable. However, it is still not easy to compare the values. So, we decide to pass this values to a data frame
#Transform the dictionary in the dataframe:
df_brand_price=pd.DataFrame.from_dict(brand_avg_price,orient="index", columns=["mean_price_dol", "freq_percentage"])
#if we don't include the columns argument in DataFrame constructor, we can put the label in the columns in the following way:
#df_brand_price.rename({0:"mean_price_dol", 1:"freq_percentage"},axis=1, inplace=True)
print("TOP 5 cheaper brands:")
df_brand_price.sort_values(by=["mean_price_dol"]).head(5)
TOP 5 cheaper brands:
mean_price_dol | freq_percentage | |
---|---|---|
daewoo | 1,064.058 | 0.148 |
rover | 1,602.290 | 0.133 |
daihatsu | 1,636.197 | 0.252 |
trabant | 1,818.609 | 0.138 |
renault | 2,483.889 | 4.719 |
print("TOP 5 more expensive brands:")
df_brand_price.sort_values(by=["mean_price_dol"], ascending=False).head(5)
TOP 5 more expensive brands:
mean_price_dol | freq_percentage | |
---|---|---|
porsche | 46,455.893 | 0.605 |
land_rover | 19,108.092 | 0.211 |
sonstige_autos | 12,642.134 | 0.962 |
jeep | 11,650.500 | 0.228 |
jaguar | 11,635.493 | 0.157 |
If we check price by brand, from total cars in this dataset, we can see that the cheaper cars are: daewoo
,rover
,daihatsu
,trabant
and renault
.
And the more expensive brands are porsche
,land_rover
,sonstige_autos
,jeep
and jaguar
.
However if at same time we check, how many cars from this brands we have in our dataset, we can see we have very few cars(almost all have less than 1% in our dataset (the only exception is renault which has 4.7%).
So it could be interesting to check what happens in the brands, for what we have more cars in this dataset:
df_brand_price[freq_perc>5].sort_values(by=["mean_price_dol"], ascending=False)
<ipython-input-64-204e483a8890>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df_brand_price[freq_perc>5].sort_values(by=["mean_price_dol"], ascending=False)
mean_price_dol | freq_percentage | |
---|---|---|
audi | 9,373.771 | 8.661 |
mercedes_benz | 8,663.059 | 9.651 |
bmw | 8,376.805 | 10.996 |
volkswagen | 5,424.919 | 21.133 |
ford | 3,762.117 | 6.998 |
opel | 2,996.085 | 10.731 |
print("How can we read the values in this table?: \n")
print("The {:.2f}% of cars in this dataset are from volkswagen and its price in average is {:.2f}$.".format(df_brand_price.loc['volkswagen','freq_percentage'],df_brand_price.loc['volkswagen','mean_price_dol'] ))
print("The {:.2f}% of cars in this dataset are from opel and its price in average is {:.2f}$.".format(df_brand_price.loc['opel','freq_percentage'],df_brand_price.loc['opel','mean_price_dol'] ))
How can we read the values in this table?: The 21.13% of cars in this dataset are from volkswagen and its price in average is 5424.92$. The 10.73% of cars in this dataset are from opel and its price in average is 2996.09$.
In the above table, we can see the price of most common brands in our dataset, i.e., the price of brands with more than 5% of frequence in the dataset.
From this data selection we can also see that:
audi
, mercedes_benz
and `bmw´ford
and opel
#creating empty dictionary to strore our aggregate data
brand_avg_odom={}
#populating our dictionary with average odemeter per brand
for each in autos["brand"].unique():
avg_odom=autos.loc[autos["brand"]==each, "odometer_km"].mean()
brand_avg_odom[each]=avg_odom
#print(brand_avg_odom)
#converting dictionary in series and dataframe using serie and dataframe constructor:
s_brand_odom=pd.Series(brand_avg_odom)
df_brand_odom=pd.DataFrame(s_brand_odom, columns=["mean_odometer_km"])
#print(df_brand_odom)
print("TOP 5 brands with km:")
df_brand_odom.sort_values(by=["mean_odometer_km"]).head(5)
TOP 5 brands with km:
mean_odometer_km | |
---|---|
trabant | 55,312.500 |
lada | 83,518.519 |
dacia | 84,268.293 |
mini | 88,308.824 |
sonstige_autos | 90,313.199 |
#As we think it can be useful check odometer information comparing with price and brand frequence, we will agregate the new values to our previous dataframe
df_brand_price_odom=df_brand_price.copy()
df_brand_price_odom["mean_odometer_km"]=df_brand_odom
df_brand_price_odom.head(5)
mean_price_dol | freq_percentage | mean_odometer_km | |
---|---|---|---|
peugeot | 3,100.693 | 2.991 | 127,122.302 |
bmw | 8,376.805 | 10.996 | 132,651.663 |
volkswagen | 5,424.919 | 21.133 | 128,771.001 |
smart | 3,591.021 | 1.418 | 99,590.288 |
ford | 3,762.117 | 6.998 | 124,295.818 |
print("TOP 5 brands with less km:")
df_brand_price_odom.sort_values(by=["mean_odometer_km"], ascending=False).head(5)
TOP 5 brands with less km:
mean_price_dol | freq_percentage | mean_odometer_km | |
---|---|---|---|
saab | 3,211.649 | 0.166 | 144,415.584 |
volvo | 4,993.208 | 0.910 | 138,581.560 |
rover | 1,602.290 | 0.133 | 137,661.290 |
bmw | 8,376.805 | 10.996 | 132,651.663 |
chrysler | 3,465.744 | 0.353 | 132,378.049 |
print("Brands with more than 5% of frequence:")
df_brand_price_odom[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)
Brands with more than 5% of frequence:
<ipython-input-69-0ff77f17394f>:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df_brand_price_odom[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)
mean_price_dol | freq_percentage | mean_odometer_km | |
---|---|---|---|
volkswagen | 5,424.919 | 21.133 | 128,771.001 |
bmw | 8,376.805 | 10.996 | 132,651.663 |
opel | 2,996.085 | 10.731 | 129,440.545 |
mercedes_benz | 8,663.059 | 9.651 | 130,923.077 |
audi | 9,373.771 | 8.661 | 129,254.658 |
ford | 3,762.117 | 6.998 | 124,295.818 |
Checking above values we can conclude that the most common brands in our dataset has in average as more or less same mileage, in average it is more than 124k km per brand.
Let us to see what happen if we split the odometer into groups and use the aggregation to see if average prices follows any pattern based on the mileage
print(autos['odometer_km'].describe())
count 46,473.000 mean 125,688.787 std 39,687.056 min 5,000.000 25% 100,000.000 50% 150,000.000 75% 150,000.000 max 150,000.000 Name: odometer_km, dtype: float64
Our odometer groups will be:
price_by_odometer = {'odom_until_10k':0,'odom_10k_30k':0, 'odom_30k_50k':0, 'odom_50k_75k':0,'odom_75k_100k':0,'odom_100k_140k':0,
'odom_more_140k':0}
# Populating it
for each in autos['odometer_km']:
if each<=10000:
price_by_odometer['odom_until_10k'] += 1
elif 10000 < each <= 30000:
price_by_odometer['odom_10k_30k'] += 1
elif 30000 < each <= 50000:
price_by_odometer['odom_30k_50k'] += 1
elif 30000 < each <= 75000:
price_by_odometer['odom_50k_75k'] += 1
elif 10000 < each <= 100000:
price_by_odometer['odom_75k_100k'] += 1
elif 10000 < each <= 140000:
price_by_odometer['odom_100k_140k'] += 1
else:
price_by_odometer['odom_more_140k'] += 1
price_by_odometer
df_price_by_odometer=pd.DataFrame.from_dict(price_by_odometer, orient="index", columns=["freq_odom_group"])
df_price_by_odometer
freq_odom_group | |
---|---|
odom_until_10k | 970 |
odom_10k_30k | 1499 |
odom_30k_50k | 1789 |
odom_50k_75k | 2311 |
odom_75k_100k | 5094 |
odom_100k_140k | 4846 |
odom_more_140k | 29964 |
df_price_by_odometer["mean_price"]=0
df_price_by_odometer
freq_odom_group | mean_price | |
---|---|---|
odom_until_10k | 970 | 0 |
odom_10k_30k | 1499 | 0 |
odom_30k_50k | 1789 | 0 |
odom_50k_75k | 2311 | 0 |
odom_75k_100k | 5094 | 0 |
odom_100k_140k | 4846 | 0 |
odom_more_140k | 29964 | 0 |
df_price_by_odometer.iloc[0,1]=autos.loc[autos["odometer_km"] <=10000, "price_dollar"].mean()
df_price_by_odometer.iloc[1,1]=autos.loc[autos["odometer_km"] <=30000, "price_dollar"].mean()
df_price_by_odometer.iloc[2,1]=autos.loc[autos["odometer_km"] <=50000, "price_dollar"].mean()
df_price_by_odometer.iloc[3,1]=autos.loc[autos["odometer_km"] <=75000, "price_dollar"].mean()
df_price_by_odometer.iloc[4,1]=autos.loc[autos["odometer_km"] <=100000, "price_dollar"].mean()
df_price_by_odometer.iloc[5,1]=autos.loc[autos["odometer_km"] <=140000, "price_dollar"].mean()
df_price_by_odometer.iloc[6,1]=autos.loc[autos["odometer_km"] >140000, "price_dollar"].mean()
df_price_by_odometer
freq_odom_group | mean_price | |
---|---|---|
odom_until_10k | 970 | 12,286.290 |
odom_10k_30k | 1499 | 15,483.654 |
odom_30k_50k | 1789 | 15,100.466 |
odom_50k_75k | 2311 | 13,889.271 |
odom_75k_100k | 5094 | 11,618.389 |
odom_100k_140k | 4846 | 10,036.145 |
odom_more_140k | 29964 | 3,783.125 |
The results are interesting, as more milleage the car has, lower is its price. The only exception for this rule are the cares with 10k km or less, which has lower price than others. This can be explain by some typing mistake on odometer or on price, or car has other kind of problems.
#creating empty dictionary to strore our aggregate data
brand_model={}
#populating our dictionary with average odemeter per brand
for each in autos["brand"].unique():
brand_filter=autos.loc[autos["brand"]==each,"model"]
model_freq=brand_filter.value_counts().head(3) #for each brand we will have the most common 3 models
model_dic= model_freq.to_dict()
brand_model[each]=model_dic
s_brand_model=pd.Series(brand_model)
df_brand_model=pd.DataFrame(s_brand_model, columns=["model"])
df_brand_model["freq_percentage"]=freq_perc #we will add the frequence of each brand
df_brand_model[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)
<ipython-input-76-98d7d4861509>:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df_brand_model[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)
model | freq_percentage | |
---|---|---|
volkswagen | {'golf': 3695, 'polo': 1600, 'passat': 1345} | 21.133 |
bmw | {'3er': 2604, '5er': 1123, '1er': 521} | 10.996 |
opel | {'corsa': 1582, 'astra': 1339, 'vectra': 538} | 10.731 |
mercedes_benz | {'c_klasse': 1136, 'e_klasse': 954, 'a_klasse'... | 9.651 |
audi | {'a4': 1226, 'a3': 825, 'a6': 795} | 8.661 |
ford | {'focus': 757, 'fiesta': 722, 'mondeo': 479} | 6.998 |
"The cars with deffect not repaired cost in average {:,.2f}$, while the other cost in average {:,.2f}$ ".format(autos.loc[autos["unrepaired_damage"]=="yes", "price_dollar"].mean(),autos.loc[autos["unrepaired_damage"]=="no", "price_dollar"].mean())
This mean that, as expected the cars with unrepaired damages are in average much more cheaper than the cars that don't have damage to repaire.
We can see that in average cars with unrepaired damages cost less 3 times the price.
"The cars with automatic gearbox cost in average {:,.2f}$, while the manual gearbox cost in average {:,.2f}$ ".format(autos.loc[autos["gearbox"]=="automatic", "price_dollar"].mean(),autos.loc[autos["gearbox"]=="manual", "price_dollar"].mean())
As expected the cars with automatic gearbox are in average much more expensive than the cars with manual gearbox.
The automatic gearbox cost in average more than double price than the manual gearbox cars.
#creating empty dictionary to strore our aggregate data
vehicle_avg_price={}
#populating our dictionary with average price per brand (because we think it can be useful, we will also add the brand freq)
for each in autos["vehicle_type"].unique():
avg_price=autos.loc[autos["vehicle_type"]==each, "price_dollar"].mean()
vehicle_avg_price[each]=float(avg_price)
#print(vehicle_avg_price)
print(json.dumps(vehicle_avg_price, indent=4))
The results we got are according to our expectation:
coupe
and suv
small cars
and convertible
After a brief overview of each column data of our dataset, we identified what could be clean, and what could be improve in order to take the analyse more easier and eficient.
Then in point 4, we made some analyzes, and we found very interesting results.
We could conclude how vehicle details have impact on price.
If we want to buy a car, we must understand if price makes sense attending average price for the same details: