mlcourse.ai – Open Machine Learning Course

Author: Александр Кацалап (ODS Slack nick: Alexkats)

Prediction of real estate prices at Melbourne housing market

In this project I will analyze Melbourne Housing Market data, collected by Tony Pino and posted on Kaggle.

This data was scraped from publicly available results posted every week from real estate resource Domain.com.au. Dataset includes address, type of real estate, suburb, method of selling, rooms, price, real estate agent, date of sale and distance from C.B.D. - centre of Melbourne.

The purpose of this project is building a model, that will allow to predict the price of property on Melbourne house market, based on its characteristics. So, our task is regression task.

It may be useful to know actual property price in next cases:

  • You are property seller and you want to sell it as soon as possible. You don't want to sell it at a low price and lose your money. And you don't want to find during long time you buyer because of too high price of the property.
  • You are property buyer and you want to buy a good house for a good price and don't want to overpay.
  • You are real estate agency like a Domain.com.au and you want to remove advertisements with suspicious objects on your website. For example, if sale advertisement have very low price in compare with objects with aproximatly same characteristics, it may be a fraud. Prediction of actual prices will help to detect and remove such advertisement and you don't lose your customers.

Part 1. Dataset and features description

Dataset contains information about property sales in Melbourne during the period from January, 2016 to October, 2018.

Let's load dataset and describe given features:

In [1]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
In [2]:
import numpy as np
In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [4]:
warnings.simplefilter("ignore")
In [5]:
full_data = pd.read_csv('Melbourne_housing_FULL.csv', parse_dates=['Date'])
full_data.head()
Out[5]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 2016-03-09 2.5 3067.0 ... 1.0 1.0 126.0 NaN NaN Yarra City Council -37.8014 144.9958 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 2016-03-12 2.5 3067.0 ... 1.0 1.0 202.0 NaN NaN Yarra City Council -37.7996 144.9984 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 2016-04-02 2.5 3067.0 ... 1.0 0.0 156.0 79.0 1900.0 Yarra City Council -37.8079 144.9934 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 2016-04-02 2.5 3067.0 ... 2.0 1.0 0.0 NaN NaN Yarra City Council -37.8114 145.0116 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 2017-04-03 2.5 3067.0 ... 2.0 0.0 134.0 150.0 1900.0 Yarra City Council -37.8093 144.9944 Northern Metropolitan 4019.0

5 rows × 21 columns

In [6]:
full_data = full_data[full_data['Date'] <= '2018-04-01']

Every object in dataset (every row) is a sold of property with its own characteristics and additional information as seller, sell type and sell.

Let's get info about types of columns and skipped values in dataset:

In [7]:
full_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33561 entries, 0 to 34856
Data columns (total 21 columns):
Suburb           33561 non-null object
Address          33561 non-null object
Rooms            33561 non-null int64
Type             33561 non-null object
Price            26218 non-null float64
Method           33561 non-null object
SellerG          33561 non-null object
Date             33561 non-null datetime64[ns]
Distance         33561 non-null float64
Postcode         33561 non-null float64
Bedroom2         25752 non-null float64
Bathroom         25743 non-null float64
Car              25286 non-null float64
Landsize         22461 non-null float64
BuildingArea     13316 non-null float64
YearBuilt        15079 non-null float64
CouncilArea      33559 non-null object
Lattitude        25986 non-null float64
Longtitude       25986 non-null float64
Regionname       33559 non-null object
Propertycount    33559 non-null float64
dtypes: datetime64[ns](1), float64(12), int64(1), object(7)
memory usage: 5.6+ MB
In [ ]:
 

As we see, data contains 21 columns, many of them have missing values.

1.1 Features description

Let's give a more detailed description about columns meaning:

Suburb: Suburb name in Melbourne

Address: Address

Rooms: Number of rooms

Bedroom2: Number of rooms (from different source)

Price: Price in Australian dollars. It's a target value in our task

Method - type of sales method:

  • S - property sold;
  • SP - property sold prior;
  • PI - property passed in;
  • PN - sold prior not disclosed;
  • SN - sold not disclosed;
  • VB - vendor bid;
  • W - withdrawn prior to auction;
  • SA - sold after auction;
  • SS - sold after auction price not disclosed.

Type - type of property:

  • h - house, cottage, villa, semi, terrace;
  • u - unit, duplex;
  • t - townhouse;

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from C.B.D.(Melbourne centre) in Kilometres

Regionname: General Region (West, North West, North, North east ...etc.)

Propertycount: Number of properties that exist in the suburb.

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land size in Metres

BuildingArea: Building size in Metres

YearBuilt: Year the house was built

CouncilArea: Governing council for the area

Lattitude: Lattitude coordinate of property

Longtitude: Longtitude coordinate of property

In [ ]:
 

From this description it's possible to classify most of these features into categorical and numeric (continiuos):

Categorical features:

  • Suburb, Type, Method, SellerG, Postcode, CouncilArea, Regionname

Numeric features:

  • Rooms, Date, Distance, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, Lattitude, Longtitude,Propertycount

Target value:

  • Price

Also there are some complex features, that can't be definitely classified by them type and must be transformed before used in modeling:

  • Address, Postcode
In [ ]:
 
In [ ]:
 
In [ ]:
 

Part 2. Exploratory data analysis

As we see, that there are missing values in our target. The reason is that Method of selling theese properties was among PN, SS. Theese methods didn't imply diclosing sell price. So we shall have to remove such objects from dataset when we will build our model. But before it we save them for calculating some statistics and for missing values imputing.

Let's separate dataset into 2 parts: with target value and without target value:

And split data with not missing target into feature data and target data:

In [8]:
X_full = full_data.copy()
y_full = full_data['Price']
In [ ]:
 

2.1 Features interactions and their influence on the target

Let's explain features influence on the target variable - Price.

  1. It's evident, that features Rooms, Bedroom2, Bathroom, Car, Landsize, BuildingArea are directly proportional to house price. Therefore, it is expected that they should have a high correlation with target.

  2. Date of sold probably will have seasonal influence on price: for example, there is low season at summer and high season at winter.

  3. Distance from CBD may have complex, non-linear influence on price. On the one hand, the price should be the highest in the center and should decrease when moving to the outskirts. On the other hand, in the centre of big city there is a bad ecology and very noisy.

  4. Similar reasoning can be done for the YearBuilt of construction. On the one hand, the price should be the highest for the new buildings and houses. On the other hand, very old buildings may be architectural monuments and have historical value, so very old buildings may have very high prices.

  5. Features Suburb, Postcode, Regionname characterize houses locations in the city, and, as consequences, crime situation and transport accessibility. So, theese features and their different combinations should influence on the house price.

  6. CouncilArea may characterize the quality of local goverment work. The degree of well-being is depends on this work and, as consequence, depend a house prices in different areas.

  7. Type of property certainly matters, because own cottage or villa is more expensive, than duplex with neighbors.

In [ ]:
 

2.2 Target value analysis

Save target value Price without NaNs to variable $y$ for analysis:

In [9]:
y = full_data[full_data.Price.notnull()]['Price']

Let's plot the distribution of target value:

In [10]:
plt.figure(figsize=(14, 7))
sns.distplot(y)
plt.grid()
plt.title('Price distribution');

It is not a nornal distribution, so it is not a good idea to predict this value directly. So we try to take a logarithm of target and plot the distribution of such transformed value:

$$ \widehat{y} = ln (y + 1) $$
In [11]:
plt.figure(figsize=(14, 7))
sns.distplot(np.log( 1.0 + y ))
plt.grid()
plt.title('Logarithm of Price distribution');

So let's make a statistical tests for normality and skewness of distribution of $\widehat{y} $ .

Use the Shapiro-Wilk and Kolmogorov-Smirnov nornality tests:

In [12]:
from scipy.stats import shapiro, kstest, probplot, skew
test_stat, p_value = shapiro(np.log(y))
test_stat, p_value
Out[12]:
(0.993354856967926, 2.062829485851938e-32)
In [ ]:
 
In [13]:
test_stat, p_value = kstest(np.log(y), cdf='norm')
test_stat, p_value
Out[13]:
(1.0, 0.0)
In [ ]:
 

QQ-plot of $\widehat{y}$.

For ideal normal distribution all blue dots lay on red line.

In [14]:
plt.figure(figsize=(7,7))
probplot(np.log(y), dist='norm', plot=plt);
plt.grid()

Skewness test. For symmetrical distribution result of skewness test is equal to zero.

In [15]:
skew(np.log(y))
Out[15]:
0.32664952595735286
In [ ]:
 

Nevertheless distribution of $\widehat{y}$ is slightly non symmetrical, QQ-plot and both normality tests passed allow us to work with value $\widehat{y}$ as with normal distributed value.

So, we shall work with target value $y$ as follows:

  1. Train models on transform target $y^* = ln(y+1)$
  2. Make a prediction as $\widehat{y}_*$
  3. Perform the inverse transform to original target: $\widehat{y} = e^{\widehat{y}_*} - 1$
  4. Check their quality by calculating some metric $f(y, \widehat{y})$, that will be choosed later.
In [ ]:
 

2.3 Missing values processing

Because of a lot of missings in data, first of all, we shall try to fill them.

In [16]:
X_full.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33561 entries, 0 to 34856
Data columns (total 21 columns):
Suburb           33561 non-null object
Address          33561 non-null object
Rooms            33561 non-null int64
Type             33561 non-null object
Price            26218 non-null float64
Method           33561 non-null object
SellerG          33561 non-null object
Date             33561 non-null datetime64[ns]
Distance         33561 non-null float64
Postcode         33561 non-null float64
Bedroom2         25752 non-null float64
Bathroom         25743 non-null float64
Car              25286 non-null float64
Landsize         22461 non-null float64
BuildingArea     13316 non-null float64
YearBuilt        15079 non-null float64
CouncilArea      33559 non-null object
Lattitude        25986 non-null float64
Longtitude       25986 non-null float64
Regionname       33559 non-null object
Propertycount    33559 non-null float64
dtypes: datetime64[ns](1), float64(12), int64(1), object(7)
memory usage: 5.6+ MB

Fill fields Regionname, Propertycount, CouncilArea, Postcode.

Let's look at object with Postcode missed:

In [17]:
X_full[X_full.Postcode.isnull()]
Out[17]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount

0 rows × 21 columns

The most fields are missing, so it's better to drop this object:

In [18]:
X_full = X_full[~X_full.Postcode.isnull()]
In [19]:
X_full.Postcode = X_full.Postcode.astype(int)

For objects with missed Regionname, Propertycount, CouncilArea we get the same case: the most of fields are missing, so we drop them too:

In [20]:
X_full[X_full.Regionname.isnull()]
Out[20]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
18523 Footscray 2/16 Stafford St 2 u 710000.0 S Jas 2017-07-15 5.1 3011 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26888 Camberwell 2/3 Kingsley St 2 h 825000.0 VB Jellis 2017-11-11 7.7 3124 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 21 columns

In [21]:
X_full = X_full[~X_full.Regionname.isnull()]

Check missing values count in data again:

In [22]:
X_full.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33559 entries, 0 to 34856
Data columns (total 21 columns):
Suburb           33559 non-null object
Address          33559 non-null object
Rooms            33559 non-null int64
Type             33559 non-null object
Price            26216 non-null float64
Method           33559 non-null object
SellerG          33559 non-null object
Date             33559 non-null datetime64[ns]
Distance         33559 non-null float64
Postcode         33559 non-null int64
Bedroom2         25752 non-null float64
Bathroom         25743 non-null float64
Car              25286 non-null float64
Landsize         22461 non-null float64
BuildingArea     13316 non-null float64
YearBuilt        15079 non-null float64
CouncilArea      33559 non-null object
Lattitude        25986 non-null float64
Longtitude       25986 non-null float64
Regionname       33559 non-null object
Propertycount    33559 non-null float64
dtypes: datetime64[ns](1), float64(11), int64(2), object(7)
memory usage: 5.6+ MB

Check YearBuilt values. Plot histogram:

In [23]:
plt.figure(figsize=(10,5))
X_full.YearBuilt.hist(bins=100)

plt.text(x=1200, y = 1200, s='Min built year = {}'.format(X_full.YearBuilt.min()))
plt.text(x=1200, y = 1100, s='Max built year = {}'.format(X_full.YearBuilt.max()))
plt.title('Year built');

There are wrong values, including values from future:)

Set them to NaN, and then fill together with other missings.

In [ ]:
 
In [24]:
X_full[(X_full.YearBuilt < 1800) | (X_full.YearBuilt > 2018)]
Out[24]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
2453 Bulleen 3 Maringa St 4 h NaN SP Ray 2016-07-11 11.8 3105 ... 2.0 2.0 729.0 255.0 2106.0 Manningham City Council -37.76370 145.08810 Eastern Metropolitan 4480.0
16424 Mount Waverley 5 Armstrong St 3 h 1200000.0 VB McGrath 2017-06-24 14.2 3149 ... 1.0 4.0 807.0 117.0 1196.0 Monash City Council -37.86788 145.12121 Eastern Metropolitan 13366.0
33033 Bentleigh 1 Wyuna Ct 3 h 1100000.0 VB Woodards 2018-03-17 11.4 3204 ... 1.0 4.0 635.0 242.0 2019.0 Glen Eira City Council -37.92963 145.03666 Southern Metropolitan 6795.0

3 rows × 21 columns

In [25]:
X_full.loc[(X_full.YearBuilt < 1800) | (X_full.YearBuilt > 2018), 'YearBuilt'] = np.nan
In [ ]:
 

Note, that there are a lot of Lattitude and Longtitude missing values. But we can restore them by Adress (using street), PostCode, RegionName, Suburb and CouncilArea values.

Select objects with filled Lattitude and Longtitude, and find mean values for them by grouping by theese values:

In [26]:
coords_features = ['Address', 'Postcode', 'Regionname', 'Suburb', 'CouncilArea', 'Lattitude', 'Longtitude']
In [27]:
coords_data = X_full[~((X_full.Lattitude.isnull()) & (X_full.Longtitude.isnull()))][coords_features]
In [28]:
coords_data.head()
Out[28]:
Address Postcode Regionname Suburb CouncilArea Lattitude Longtitude
0 68 Studley St 3067 Northern Metropolitan Abbotsford Yarra City Council -37.8014 144.9958
1 85 Turner St 3067 Northern Metropolitan Abbotsford Yarra City Council -37.7996 144.9984
2 25 Bloomburg St 3067 Northern Metropolitan Abbotsford Yarra City Council -37.8079 144.9934
3 18/659 Victoria St 3067 Northern Metropolitan Abbotsford Yarra City Council -37.8114 145.0116
4 5 Charles St 3067 Northern Metropolitan Abbotsford Yarra City Council -37.8093 144.9944
In [ ]:
 

Extraxt street name from address:

In [29]:
coords_data['Address_splitted'] = coords_data.Address.str.split(' ')
coords_data['Street'] = coords_data.Address_splitted.apply(lambda s: s[1])
In [30]:
group_features = ['Regionname','Suburb','CouncilArea']
In [31]:
coords_data_mean = coords_data\
                    .groupby(group_features)['Lattitude','Longtitude']\
                    .mean()\
                    .reset_index()\
                    .rename(columns={'Lattitude': 'Lat_new', 'Longtitude': 'Lon_new'})
In [32]:
coords_data_mean.head()
Out[32]:
Regionname Suburb CouncilArea Lat_new Lon_new
0 Eastern Metropolitan Bayswater Knox City Council -37.848626 145.260982
1 Eastern Metropolitan Bayswater North Maroondah City Council -37.825929 145.284352
2 Eastern Metropolitan Bellfield Banyule City Council -37.752281 145.039601
3 Eastern Metropolitan Blackburn Whitehorse City Council -37.821410 145.150006
4 Eastern Metropolitan Blackburn North Whitehorse City Council -37.805467 145.153947
In [ ]:
 

Now add Street name to our dataset and merge it with coords_data_mean :

In [33]:
X_full['Address_splitted'] = X_full.Address.str.split(' ')
X_full['Street'] = X_full.Address_splitted.apply(lambda s: s[1])
X_full['HouseNumber'] = X_full.Address_splitted.apply(lambda s: s[0])
X_full.drop('Address_splitted', axis=1, inplace=True)
In [34]:
X_full_2 = pd.merge(X_full, coords_data_mean, on=group_features, how='left')
In [35]:
X_full_2.head()
Out[35]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount Street HouseNumber Lat_new Lon_new
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 2016-03-09 2.5 3067 ... NaN Yarra City Council -37.8014 144.9958 Northern Metropolitan 4019.0 Studley 68 -37.804113 144.997427
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 2016-03-12 2.5 3067 ... NaN Yarra City Council -37.7996 144.9984 Northern Metropolitan 4019.0 Turner 85 -37.804113 144.997427
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 2016-04-02 2.5 3067 ... 1900.0 Yarra City Council -37.8079 144.9934 Northern Metropolitan 4019.0 Bloomburg 25 -37.804113 144.997427
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 2016-04-02 2.5 3067 ... NaN Yarra City Council -37.8114 145.0116 Northern Metropolitan 4019.0 Victoria 18/659 -37.804113 144.997427
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 2017-04-03 2.5 3067 ... 1900.0 Yarra City Council -37.8093 144.9944 Northern Metropolitan 4019.0 Charles 5 -37.804113 144.997427

5 rows × 25 columns

Now replace Nans in Lattitude and Longtitude by new values Lat_new and Lon_new:

In [36]:
X_full_2.loc[X_full_2.Lattitude.isnull(), 'Lattitude'] = X_full_2['Lat_new']
X_full_2.loc[X_full_2.Longtitude.isnull(), 'Longtitude'] = X_full_2['Lon_new']
In [ ]:
 

And check whether all NaNs in Lattitude and Longtitude are filled:

In [37]:
X_full_2[X_full_2.Lattitude.isnull()].shape[0]
Out[37]:
89

No, there are 87 values unfilled, because there are some Regionname, Suburb and CouncilArea, when Lattitude and Longtitude values was missing totally. So, because count of such objects is very small compared to dataset size, we can just fill them by mean:

In [38]:
X_full_2.Lattitude = X_full_2.Lattitude.fillna(X_full_2.Lattitude.mean())
X_full_2.Longtitude = X_full_2.Longtitude.fillna(X_full_2.Longtitude.mean())

X_full_2.drop(['Lat_new','Lon_new'], axis=1, inplace=True)
In [ ]:
 

Check missing values counts:

In [39]:
X_full_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33559 entries, 0 to 33558
Data columns (total 23 columns):
Suburb           33559 non-null object
Address          33559 non-null object
Rooms            33559 non-null int64
Type             33559 non-null object
Price            26216 non-null float64
Method           33559 non-null object
SellerG          33559 non-null object
Date             33559 non-null datetime64[ns]
Distance         33559 non-null float64
Postcode         33559 non-null int64
Bedroom2         25752 non-null float64
Bathroom         25743 non-null float64
Car              25286 non-null float64
Landsize         22461 non-null float64
BuildingArea     13316 non-null float64
YearBuilt        15076 non-null float64
CouncilArea      33559 non-null object
Lattitude        33559 non-null float64
Longtitude       33559 non-null float64
Regionname       33559 non-null object
Propertycount    33559 non-null float64
Street           33559 non-null object
HouseNumber      33559 non-null object
dtypes: datetime64[ns](1), float64(11), int64(2), object(9)
memory usage: 7.4+ MB

So, we have missings in features: Bedroom2, Bathroom, Car, Landsize, BuildingArea

In [ ]:
 

Before fill missing in them, let's check their distributions. They may have ouliers, that can have bad influence on filling quality and on model quality in futher.

Let's plot distributions of theese features:

In [ ]:
 
In [40]:
# Bathroom
X_full_2[~X_full_2.Bathroom.isnull()].Bathroom.hist(bins=11)
X_full_2[~X_full_2.Bathroom.isnull()].Bathroom.value_counts()
Out[40]:
1.0     12626
2.0     10611
3.0      2105
4.0       259
5.0        73
0.0        46
6.0        14
7.0         4
8.0         3
9.0         1
12.0        1
Name: Bathroom, dtype: int64

There are objects with 7 and more bathrooms! Let's look at them:

In [41]:
X_full[X_full.Bathroom >= 8][['Suburb', 'Address', 'Rooms', 'Type', 'Method',  'Date',
       'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Price']]
Out[41]:
Suburb Address Rooms Type Method Date Distance Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Price
2536 Burwood 36 Cumming St 3 h PI 2016-08-10 11.7 30.0 12.0 0.0 946.0 NaN 1950.0 NaN
8251 Preston 421 Murray Rd 4 h PI 2016-05-22 8.8 9.0 8.0 7.0 1254.0 280.0 1928.0 760000.0
17472 Camberwell 1088 Toorak Rd 8 h PI 2017-08-07 7.7 8.0 8.0 4.0 650.0 NaN NaN 2200000.0
25717 Burwood 79 Elgar Rd 10 h SP 2017-10-28 10.4 10.0 9.0 2.0 612.0 466.0 2015.0 2315000.0
26210 Mitcham 20 Harrison St 16 h S 2017-10-28 17.2 16.0 8.0 8.0 NaN NaN NaN 5000000.0
In [ ]:
 
In [42]:
# Bedroom
X_full_2[~X_full_2.Bedroom2.isnull()].Bedroom2.hist(bins=15)
X_full_2[~X_full_2.Bedroom2.isnull()].Bedroom2.value_counts()
Out[42]:
3.0     11448
4.0      6063
2.0      5685
5.0      1364
1.0       964
6.0       156
7.0        29
0.0        17
8.0        13
9.0         5
10.0        4
30.0        1
20.0        1
12.0        1
16.0        1
Name: Bedroom2, dtype: int64

Is seems very suspicious, especially objects, when bedrooms quantity is equals to bathrooms quantity :) The same for zero count of bedrooms.

So, it's better solution to drop objects with bathrooms quantity more than 6 and with bedrooms quantity more than 8 (or equal to zero):

In [43]:
X_full_3 = X_full_2[(X_full_2.Bathroom.isnull()) | (X_full_2.Bathroom <= 6)]
X_full_3 = X_full_3[(X_full_3.Bedroom2.isnull()) | ((X_full_3.Bedroom2 <= 8) & (X_full_3.Bedroom2 > 0))]
In [ ]:
 

Check Car feature:

In [44]:
# Car
X_full_3[~X_full_3.Car.isnull()].Car.hist(bins=9)
X_full_3[~X_full_3.Car.isnull()].Car.value_counts()
Out[44]:
2.0     11715
1.0      8960
0.0      1598
3.0      1543
4.0      1104
5.0       143
6.0       133
7.0        22
8.0        21
10.0        5
9.0         3
11.0        2
18.0        1
26.0        1
12.0        1
Name: Car, dtype: int64
In [45]:
X_full[X_full.Car > 9 ][['Suburb', 'Address', 'Rooms', 'Type', 'Method',  'Date',
       'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Price']].head()
Out[45]:
Suburb Address Rooms Type Method Date Distance Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Price
4324 Fitzroy North 416 Queens Pde 5 h S 2016-08-28 3.5 4.0 2.0 11.0 1010.0 440.0 NaN NaN
6017 Kew 25 Duke St 2 h S 2016-09-24 5.6 2.0 2.0 26.0 267.0 500.0 2011.0 NaN
15449 Bayswater 95 Orange Gr 4 h SP 2017-06-17 23.2 4.0 1.0 10.0 993.0 128.0 1966.0 925000.0
18923 Dandenong 1462 Heatherton Rd 3 h S 2017-07-22 24.7 3.0 2.0 10.0 734.0 NaN NaN 880000.0
23395 Reservoir 1 Don St 4 h S 2017-08-26 12.0 4.0 2.0 10.0 1002.0 170.0 1985.0 1112000.0

So, we see, that objects with carspots quantity more than 9 have very big Landsize values and suspiciously low prices. So, theese objects aren't look like most others, and we will drop them too:

In [46]:
X_full_3 = X_full_3[(X_full_3.Car.isnull()) | (X_full_3.Car < 9)]
In [ ]:
 

Check Landsize feature:

In [47]:
# Landsize
X_full_3[~X_full_3.Landsize.isnull()].Landsize.hist();

So, distribution is very skewed, because of outliers with very huge values. Let's find mean value and calculate 99.9% quantile for Landsize feature. Also, find objects with top-10 Landsizes in our dataset:

In [48]:
# Find mean value of Landsize:
X_full_3.Landsize.mean()
Out[48]:
589.8170666428763
In [49]:
# Find 99.9 % quantile of Landsize:
q_99 = X_full_3[~X_full_3.Landsize.isnull()].Landsize.quantile(0.999)
q_99
Out[49]:
17200.0
In [50]:
# Find top-10 objects with biggest Landsize:
top_10_landsizes = sorted(X_full_3[~X_full_3.Landsize.isnull()].Landsize, reverse=True)[:10]
In [51]:
top_10_landsizes
Out[51]:
[433014.0,
 146699.0,
 89030.0,
 80000.0,
 76000.0,
 75100.0,
 44500.0,
 42800.0,
 41400.0,
 40500.0]
In [52]:
X_full_3[X_full_3.Landsize.isin(top_10_landsizes)][['Suburb', 'Address', 'Rooms', 'Type', 'Method',  'Date',
       'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Price']]
Out[52]:
Suburb Address Rooms Type Method Date Distance Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Price
1198 Balwyn North 9 Gildan St 3 h VB 2016-08-28 9.2 3.0 1.0 2.0 75100.0 NaN NaN 2000000.0
8599 Reservoir 14 Beenak St 3 h S 2016-04-23 11.2 3.0 1.0 2.0 41400.0 NaN NaN 572000.0
17293 Silvan 16 Eleanor Dr 3 h S 2017-05-27 34.6 3.0 2.0 2.0 76000.0 NaN NaN 1085000.0
18036 Fitzroy 389 Gore St 3 h VB 2017-12-08 2.1 3.0 3.0 1.0 433014.0 NaN NaN 2700000.0
22631 New Gisborne 71 Hamilton Rd 5 h S 2017-09-23 48.1 5.0 3.0 5.0 44500.0 44515.0 NaN 1355000.0
24799 Wildwood 430 Wildwood Rd 5 h S 2017-10-14 31.7 5.0 2.0 2.0 89030.0 NaN NaN 1030000.0
26867 Bullengarook 65 Webb Rd 4 h PI 2017-11-11 45.9 4.0 2.0 1.0 146699.0 NaN NaN 1050000.0
31090 Wandin North 237 Warburton Hwy 4 h PI 2018-03-03 35.2 4.0 2.0 2.0 40500.0 NaN NaN 1150000.0
31279 Diggers Rest 55 Morefield Ct 4 h PI 2018-02-17 27.7 4.0 2.0 NaN 80000.0 NaN NaN NaN
32757 Bullengarook 11 Dohoney Rd 4 h S 2018-02-24 45.9 4.0 2.0 0.0 42800.0 180.0 2000.0 865000.0

So, almost all of them are located far from CBD (mean distance is about 11 km). As with other outliers, we shall drop objects with Landsize more than 99.9%, because they will have bad influence on model quality:

In [53]:
X_full_3 = X_full_3[(X_full_3.Landsize.isnull()) | (X_full_3.Landsize < q_99)]

Plot the distribution after outliers removing:

In [54]:
X_full_3[~X_full_3.Landsize.isnull()].Landsize.hist(bins=50);
In [ ]:
 

Check BuildingArea feature:

In [55]:
# Landsize
X_full_3[~X_full_3.BuildingArea.isnull()].BuildingArea.hist(bins=50);

We have the same situation as with Landsize, so let's repeat the procedure for outliers dropping with BuildingArea feature:

In [56]:
# Find 99 % quantile of Landsize:
q_99 = X_full_3[~X_full_3.BuildingArea.isnull()].BuildingArea.quantile(0.99)
q_99
Out[56]:
467.41999999999825
In [57]:
X_full_3 = X_full_3[(X_full_3.BuildingArea.isnull()) | (X_full_3.BuildingArea < q_99)]

Plot the distribution after outliers removing:

In [58]:
X_full_3[~X_full_3.BuildingArea.isnull()].BuildingArea.hist(bins=50);

Note, that BuildingArea can't be equal to zero! (unlike Landsize). But we have several zero values in BuildingArea feature. It seems to be mistake, so let's change theese zero values to NaNs:

In [59]:
X_full_3.loc[X_full_3.BuildingArea == 0, 'BuildingArea'] = np.nan
In [ ]:
 

So, we dropped objects with abnormal values of several features.

I think, in real business task we just have to build others, separates models for every group of such objects. But in this task our goal is to build one model for most of objects in our dataset.

In [60]:
X_full_3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33352 entries, 0 to 33558
Data columns (total 23 columns):
Suburb           33352 non-null object
Address          33352 non-null object
Rooms            33352 non-null int64
Type             33352 non-null object
Price            26055 non-null float64
Method           33352 non-null object
SellerG          33352 non-null object
Date             33352 non-null datetime64[ns]
Distance         33352 non-null float64
Postcode         33352 non-null int64
Bedroom2         25545 non-null float64
Bathroom         25536 non-null float64
Car              25081 non-null float64
Landsize         22275 non-null float64
BuildingArea     13080 non-null float64
YearBuilt        14951 non-null float64
CouncilArea      33352 non-null object
Lattitude        33352 non-null float64
Longtitude       33352 non-null float64
Regionname       33352 non-null object
Propertycount    33352 non-null float64
Street           33352 non-null object
HouseNumber      33352 non-null object
dtypes: datetime64[ns](1), float64(11), int64(2), object(9)
memory usage: 6.1+ MB
In [ ]:
 
In [ ]:
 

It would be more correct to process NaNs in features like we did it with Lattitude and Longtitude by calculating mean values in groups with similar objects without NaNs and use those values to fill NaNs.

But let't use SimpleImputor from Sklearn for saving time and variety :)

In [61]:
import sklearn
from sklearn.impute import SimpleImputer
In [62]:
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='median')

Select features for imputing:

In [63]:
features_with_nans = X_full_3.columns[X_full_3.isnull().any()].tolist()
features_with_nans.remove('Price')
In [64]:
X_full_3.reset_index(drop=True, inplace=True)
X_to_impute = X_full_3[features_with_nans].copy()
In [65]:
X_imputed_array = imputer_mean.fit_transform(X_to_impute)
X_imputed = pd.DataFrame(data=X_imputed_array, columns=features_with_nans)

Create new dataset with imputed values:

In [66]:
X_full_4 = pd.concat([X_full_3.drop(features_with_nans, axis=1), X_imputed], axis=1)
In [67]:
X_full_4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33352 entries, 0 to 33351
Data columns (total 23 columns):
Suburb           33352 non-null object
Address          33352 non-null object
Rooms            33352 non-null int64
Type             33352 non-null object
Price            26055 non-null float64
Method           33352 non-null object
SellerG          33352 non-null object
Date             33352 non-null datetime64[ns]
Distance         33352 non-null float64
Postcode         33352 non-null int64
CouncilArea      33352 non-null object
Lattitude        33352 non-null float64
Longtitude       33352 non-null float64
Regionname       33352 non-null object
Propertycount    33352 non-null float64
Street           33352 non-null object
HouseNumber      33352 non-null object
Bedroom2         33352 non-null float64
Bathroom         33352 non-null float64
Car              33352 non-null float64
Landsize         33352 non-null float64
BuildingArea     33352 non-null float64
YearBuilt        33352 non-null float64
dtypes: datetime64[ns](1), float64(11), int64(2), object(9)
memory usage: 5.9+ MB

So, now there is no NaNs in our features values. But there are NaNs in target value - Price. Before building the model we will drop objects with missing targets.

Now let's split our data into features dataframe and target vector:

In [68]:
data_total = X_full_4.copy()

Add Street and HouseNumber again:

In [69]:
data_total['Address_splitted'] = data_total.Address.str.split(' ')
data_total['Street'] = data_total.Address_splitted.apply(lambda s: s[1])
data_total['HouseNumber'] = data_total.Address_splitted.apply(lambda s: s[0])
data_total.drop(['Address_splitted', 'Address'], axis=1, inplace=True)
In [70]:
data_total.YearBuilt = data_total.YearBuilt.astype(int)
data_total.Bedroom2 = data_total.Bedroom2.astype(int)
data_total.Bathroom = data_total.Bathroom.astype(int)
data_total.Car = data_total.Car.astype(int)
In [71]:
data = data_total[~data_total.Price.isnull()]
In [72]:
X_total = data_total.drop('Price', axis=1)
In [73]:
# Features for objects with price only
X = data_total[~data_total.Price.isnull()].drop('Price', axis=1)
In [74]:
# Target vector
y = data_total[~data_total.Price.isnull()]['Price']

Check shapes of dataframes:

In [75]:
X.shape, y.shape
Out[75]:
((26055, 21), (26055,))
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Part 3. Visual analysis of the features

Let's split our features into categical and numerical:

In [76]:
numerical_features = ['Rooms','Distance', 'Propertycount', 
                      'Bedroom2', 'Bathroom', 'Car', 'Landsize', 
                      'BuildingArea', 'YearBuilt', 'HouseNumber']
In [77]:
cat_features = ['Suburb', 'Address','Type', 'Method', 'SellerG','CouncilArea','Regionname']
In [ ]:
 

3.1 Numerical features relashonships

Let's use seaborn pairplot to visualize relationships between numerical features:

In [78]:
sns.pairplot(data=data[numerical_features + ['Price']]);