import graphlab
sales = graphlab.SFrame('home_data.gl/')
This non-commercial license of GraphLab Create for academic use is assigned to XiaoFei-97@outlook.com and will expire on October 02, 2019.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\WINDOWS\TEMP\graphlab_server_1538540879.log.0
sales
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront |
---|---|---|---|---|---|---|---|---|
7129300520 | 2014-10-13 00:00:00+00:00 | 221900 | 3 | 1 | 1180 | 5650 | 1 | 0 |
6414100192 | 2014-12-09 00:00:00+00:00 | 538000 | 3 | 2.25 | 2570 | 7242 | 2 | 0 |
5631500400 | 2015-02-25 00:00:00+00:00 | 180000 | 2 | 1 | 770 | 10000 | 1 | 0 |
2487200875 | 2014-12-09 00:00:00+00:00 | 604000 | 4 | 3 | 1960 | 5000 | 1 | 0 |
1954400510 | 2015-02-18 00:00:00+00:00 | 510000 | 3 | 2 | 1680 | 8080 | 1 | 0 |
7237550310 | 2014-05-12 00:00:00+00:00 | 1225000 | 4 | 4.5 | 5420 | 101930 | 1 | 0 |
1321400060 | 2014-06-27 00:00:00+00:00 | 257500 | 3 | 2.25 | 1715 | 6819 | 2 | 0 |
2008000270 | 2015-01-15 00:00:00+00:00 | 291850 | 3 | 1.5 | 1060 | 9711 | 1 | 0 |
2414600126 | 2015-04-15 00:00:00+00:00 | 229500 | 3 | 1 | 1780 | 7470 | 1 | 0 |
3793500160 | 2015-03-12 00:00:00+00:00 | 323000 | 3 | 2.5 | 1890 | 6560 | 2 | 0 |
view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat |
---|---|---|---|---|---|---|---|---|
0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.51123398 |
0 | 3 | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.72102274 |
0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.73792661 |
0 | 5 | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.52082 |
0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.61681228 |
0 | 3 | 11 | 3890 | 1530 | 2001 | 0 | 98053 | 47.65611835 |
0 | 3 | 7 | 1715 | 0 | 1995 | 0 | 98003 | 47.30972002 |
0 | 3 | 7 | 1060 | 0 | 1963 | 0 | 98198 | 47.40949984 |
0 | 3 | 7 | 1050 | 730 | 1960 | 0 | 98146 | 47.51229381 |
0 | 3 | 7 | 1890 | 0 | 2003 | 0 | 98038 | 47.36840673 |
long | sqft_living15 | sqft_lot15 |
---|---|---|
-122.25677536 | 1340.0 | 5650.0 |
-122.3188624 | 1690.0 | 7639.0 |
-122.23319601 | 2720.0 | 8062.0 |
-122.39318505 | 1360.0 | 5000.0 |
-122.04490059 | 1800.0 | 7503.0 |
-122.00528655 | 4760.0 | 101930.0 |
-122.32704857 | 2238.0 | 6819.0 |
-122.31457273 | 1650.0 | 9711.0 |
-122.33659507 | 1780.0 | 8113.0 |
-122.0308176 | 2390.0 | 7570.0 |
graphlab.canvas.set_target('ipynb') # 将生成的图片展示在jupyter中
sales.show(view="Scatter Plot",x="sqft_living", y="price") # 画散点图
train_data, test_data = sales.random_split(0.8, seed=0) # 将数据集随机划分,0.8表示随机百分之八十划分为训练集。剩下百分之二十位测试集,seed=0表示将划分确定
sqft_model = graphlab.linear_regression.create(train_data, target="price", features=["sqft_living"])
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while. You can set ``validation_set=None`` to disable validation tracking.
Linear regression:
--------------------------------------------------------
Number of examples : 16542
Number of features : 1
Number of unpacked features : 1
Number of coefficients : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1 | 2 | 1.102511 | 4347373.632024 | 2281153.308508 | 262643.709810 | 268770.190492 |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.
print test_data['price'].mean() # 均值
543054.042563
print sqft_model.evaluate(test_data)
{'max_error': 4141899.108724547, 'rmse': 255198.5173749575}
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data["sqft_living"], test_data["price"], '.',
test_data["sqft_living"], sqft_model.predict(test_data), '-')
[<matplotlib.lines.Line2D at 0x14f07f60>, <matplotlib.lines.Line2D at 0x1f52e128>]
sqft_model.get('coefficients') # 显示截距与斜率
name | index | value | stderr |
---|---|---|---|
(intercept) | None | -47735.6663956 | 5049.53441257 |
sqft_living | None | 282.187720695 | 2.22337184786 |
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode'] # 卧室,浴室,房屋大小,客厅大小,楼层,地区码
sales[my_features].show()
sales.show(view="BoxWhisker Plot", x=u'地区码', y=u'价格') # 根据地区码来分组
my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features)
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while. You can set ``validation_set=None`` to disable validation tracking.
Linear regression:
--------------------------------------------------------
Number of examples : 16550
Number of features : 6
Number of unpacked features : 6
Number of coefficients : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1 | 2 | 0.071961 | 3754333.116638 | 1692754.123587 | 182897.194603 | 161578.380955 |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.
print sqft_model.evaluate(test_data)
{'max_error': 4141899.108724547, 'rmse': 255198.5173749575}
print my_features_model.evaluate(test_data)
{'max_error': 3469329.7072889456, 'rmse': 179470.55500766632}
house1 = sales[sales['id']=='5309101200'] # 取出这一行数据
print house1['price']
[620000L, ... ]
print sqft_model.predict(house1) # 单变量特征模型
[629514.8632717952]
print my_features_model.predict(house1) # 多特征模型
[726242.3863360658]
house2 = sales[sales['id']=='1925069082']
house2
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront |
---|---|---|---|---|---|---|---|---|
1925069082 | 2015-05-11 00:00:00+00:00 | 2200000 | 5 | 4.25 | 4640 | 22703 | 2 | 1 |
view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat |
---|---|---|---|---|---|---|---|---|
4 | 5 | 8 | 2860 | 1780 | 1952 | 0 | 98052 | 47.63925783 |
long | sqft_living15 | sqft_lot15 |
---|---|---|
-122.09722322 | 3140.0 | 14200.0 |
house2['price']
dtype: int Rows: ? [2200000L, ... ]
print sqft_model.predict(house2)
[1261615.3576280293]
print my_features_model.predict(house2)
[1455703.1602792891]