콩의 캐글 도전기!!!¶

문제 링크: https://www.kaggle.com/c/bike-sharing-demand

Bike Sharing Demand는 주어진 자료들을 바탕으로 몇명이나 자전거릴 빌릴지 예측하는 문제입니다.

공부를 하시다가 궁금한 점 또는 수정사항이 있으시면 블로그에 댓글 남겨주세요.
저도 많이 부족하지만 아는 만큼 최선을 다해서 답해드리겠습니다. 감사합니다!! 열공하세요~!

컬럼 설명¶

datetime - 시간, 연월일시분초 정보가 들어있는 컬럼.
season - 계절 (봄, 여름, 가을, 겨울)
holiday - 공휴일 (0: 공휴일X, 1: 공휴일)
workingday - 근무일 (0: 근무일X, 1: 근무일)
weather - 날씨 (1: 맑음, 2: 안개/구름, 3: 약한 눈/비/천둥, 4: 강한 눈/비/우박
temp - 온도
atemp - 체감 온도
humidity - 습도
windspeed - 풍속
casual - 비회원의 자전거 대여량.
registered - 회원의 자전거 대여량.
count - 총 자전거 대여랑, casual + registered

필요한 모듈 호출¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.style.use('bmh')
%matplotlib inline

데이터 불러오기¶

In [2]:

train = pd.read_csv("train.csv", parse_dates=["datetime"])
print(train.shape)
train.head(3)

(10886, 12)

Out[2]:

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32

In [3]:

test = pd.read_csv("test.csv", parse_dates=["datetime"])
print(test.shape)
test.head(3)

(6493, 9)

Out[3]:

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000

데이터의 정보 및 결함 확인¶

In [4]:

print("train 데이터의 정보")
print("-----------------------------------------------------------------------------------")
train.info()

train 데이터의 정보
-----------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null datetime64[ns]
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.6 KB

In [5]:

print("test 데이터의 정보")
print("-----------------------------------------------------------------------------------")
test.info()

test 데이터의 정보
-----------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null datetime64[ns]
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(5)
memory usage: 456.6 KB

데이터 분석(EDA)¶

casual, registered, count¶

3개의 컬럼은 모두 테스트 데이터엔 없는 컬럼입니다. count는 우리가 맞춰야할 컬럼이지만,
단순히 비회원과 회원이 빌린 자전거의 대수를 합한 값입니다.
따라서 이들의 특성을 정확히 반영하려면 casual과 registered에 대한 예측을 각각 한다음 더하는 방법이 좋을 것 같습니다.

In [6]:

f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

sns.distplot(train["casual"], ax=ax[0])
ax[0].set_title("casual distribution")

sns.distplot(train["registered"], ax=ax[1])
ax[1].set_title("registered distribution")

sns.distplot(train["count"], ax=ax[2])
ax[2].set_title("count distribution")

plt.show()

In [7]:

f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

train["casual_log"] = np.log(train["casual"] + 1)
sns.distplot(train["casual_log"], ax=ax[0])
ax[0].set_title("casual_log distribution")

train["registered_log"] = np.log(train["registered"] + 1)
sns.distplot(train["registered_log"], ax=ax[1])
ax[1].set_title("registered_log distribution")

train["count_log"] = np.log(train["count"] + 1)
sns.distplot(train["count_log"], ax=ax[2])
ax[2].set_title("count_log distribution")

plt.show()

In [8]:

print("casaul이 0인 데이터 개수")
print(train[train["casual"] == 0].shape)

casaul이 0인 데이터 개수
(986, 15)

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 시각화 자료를 통해 casual, registered, count의 분포를 살펴보았습니다.
모든 자료에서 대여한 자전거의 대수가 0대일때가 많고, 100대 이하인 비중이 높습니다.
그러나 1000대에 가까운 자전거를 빌린 경우도 있어 분포가 많이 왜곡되어 보입니다.

아래쪽 시각화 자료를 통해 log를 씌운 casual, registered, count의 분포를 살펴보았습니다.

$\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$ 이번 경진대회의 평가 공식은 RMSLE(Root Mean Squared Logarithmic Error)이기 때문에 log를 씌우고 1을 더한값으로 분포를 살펴보았습니다.
casual_log는 자전거를 하나도 빌리지 않은 경우가 986개로 전체 데이터 중 10분의 1로 많은 비중을 차지해 log1 이 되어 0의 분포가 많이 나타납니다.
이외에는 비교적 로그를 씌우지 않은 데이터보다 훨씬 정규분포에 가까운 형태를 보여줍니다.

이를 통해 label로 casual과 registered를 사용할때, log를 씌운 것이 좀 더 도움이 될것이라고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

datetime¶

In [9]:

train["datetime-year"] = train["datetime"].dt.year
train["datetime-month"] = train["datetime"].dt.month
train["datetime-day"] = train["datetime"].dt.day
train["datetime-hour"] = train["datetime"].dt.hour
train["datetime-minute"] = train["datetime"].dt.minute
train["datetime-second"] = train["datetime"].dt.second

train["datetime-dayofweek"] = train["datetime"].dt.dayofweek

train[["datetime", "datetime-year", "datetime-month", "datetime-day", "datetime-hour", "datetime-minute", "datetime-second", "datetime-dayofweek"]].head(3)

Out[9]:

	datetime	datetime-year	datetime-month	datetime-day	datetime-hour	datetime-dayofweek
0	2011-01-01 00:00:00	2011	1	1	0	5
1	2011-01-01 01:00:00	2011	1	1	1	5
2	2011-01-01 02:00:00	2011	1	1	2	5

In [10]:

test["datetime-year"] = test["datetime"].dt.year
test["datetime-month"] = test["datetime"].dt.month
test["datetime-day"] = test["datetime"].dt.day
test["datetime-hour"] = test["datetime"].dt.hour
test["datetime-minute"] = test["datetime"].dt.minute
test["datetime-second"] = test["datetime"].dt.second

test["datetime-dayofweek"] = test["datetime"].dt.dayofweek

test[["datetime", "datetime-year", "datetime-month", "datetime-day", "datetime-hour", "datetime-minute", "datetime-second", "datetime-dayofweek"]].head(3)

Out[10]:

	datetime	datetime-year	datetime-month	datetime-day	datetime-hour	datetime-dayofweek
0	2011-01-20 00:00:00	2011	1	20	0	3
1	2011-01-20 01:00:00	2011	1	20	1	3
2	2011-01-20 02:00:00	2011	1	20	2	3

In [11]:

train.loc[train["datetime-dayofweek"] == 0, "datetime-dayofweek(str)"] = "Mon"
train.loc[train["datetime-dayofweek"] == 1, "datetime-dayofweek(str)"] = "Tue"
train.loc[train["datetime-dayofweek"] == 2, "datetime-dayofweek(str)"] = "Wed"
train.loc[train["datetime-dayofweek"] == 3, "datetime-dayofweek(str)"] = "Thu"
train.loc[train["datetime-dayofweek"] == 4, "datetime-dayofweek(str)"] = "Fri"
train.loc[train["datetime-dayofweek"] == 5, "datetime-dayofweek(str)"] = "Sat"
train.loc[train["datetime-dayofweek"] == 6, "datetime-dayofweek(str)"] = "Sun"
train[["datetime", "datetime-dayofweek", "datetime-dayofweek(str)"]].head(3)

Out[11]:

	datetime	datetime-dayofweek	datetime-dayofweek(str)
0	2011-01-01 00:00:00	5	Sat
1	2011-01-01 01:00:00	5	Sat
2	2011-01-01 02:00:00	5	Sat

In [12]:

test.loc[test["datetime-dayofweek"] == 0, "datetime-dayofweek(str)"] = "Mon"
test.loc[test["datetime-dayofweek"] == 1, "datetime-dayofweek(str)"] = "Tue"
test.loc[test["datetime-dayofweek"] == 2, "datetime-dayofweek(str)"] = "Wed"
test.loc[test["datetime-dayofweek"] == 3, "datetime-dayofweek(str)"] = "Thu"
test.loc[test["datetime-dayofweek"] == 4, "datetime-dayofweek(str)"] = "Fri"
test.loc[test["datetime-dayofweek"] == 5, "datetime-dayofweek(str)"] = "Sat"
test.loc[test["datetime-dayofweek"] == 6, "datetime-dayofweek(str)"] = "Sun"
test[["datetime", "datetime-dayofweek", "datetime-dayofweek(str)"]].head(3)

Out[12]:

	datetime	datetime-dayofweek	datetime-dayofweek(str)
0	2011-01-20 00:00:00	3	Thu
1	2011-01-20 01:00:00	3	Thu
2	2011-01-20 02:00:00	3	Thu

datetime-all¶

In [13]:

f, ax = plt.subplots(nrows=2, ncols=3, figsize=(18,10))

sns.barplot(data=train, x="datetime-year", y="count", ax=ax[0][0])
ax[0][0].set_title("Annual rental count")
sns.barplot(data=train, x="datetime-month", y="count", ax=ax[0][1])
ax[0][1].set_title("Monthly rent count")
sns.barplot(data=train, x="datetime-day", y="count", ax=ax[0][2])
ax[0][2].set_title("Daily rent count")
sns.barplot(data=train, x="datetime-hour", y="count", ax=ax[1][0])
ax[1][0].set_title("Rent count per hour")
sns.barplot(data=train, x="datetime-minute", y="count", ax=ax[1][1])
ax[1][1].set_title("Rent count per minute")
sns.barplot(data=train, x="datetime-second", y="count", ax=ax[1][2])
ax[1][2].set_title("Rent count per second")

plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
시각화 자료를 통해 연,월,일,시,분,초 별 count를 살펴보았습니다.
필요한 자료들을 더 자세히 살펴보는 것은 밑에서 이어서 할 예정입니다. 여기선 거를 수 있는 데이터가 무엇인지 파악해보겠습니다.

datetime-day: train 데이터에는 1일부터 19일까지만 존재합니다. 그 이후는 test 데이터에 존재합니다.
datetime-minute: 모두 0으로 의미가 없습니다.
datatime-second: 모두 0으로 의미가 없습니다.

이를 통해 일, 분, 초의 정보는 feature에서 빼는것이 좋다는 것을 알 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

year¶

In [14]:

f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

sns.barplot(data=train, x="datetime-year", y="casual", ax=ax[0])
ax[0].set_title("Annual rental count: Non-member")
sns.barplot(data=train, x="datetime-year", y="registered", ax=ax[1])
ax[1].set_title("Annual rental count: Members")
sns.barplot(data=train, x="datetime-year", y="count", ax=ax[2])
ax[2].set_title("Annual rental count")

plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
시각화 자료를 통해 year에 대한 casual, registered, count를 살펴보았습니다.
비회원, 회원 데이터 모두 2011년에 비해서 2012년에 자전거 대여 횟수가 증가하였습니다.

이를 통해 자전거 대여 업체가 2011년보다 2012년에 성장했다고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

month¶

In [15]:

f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

sns.barplot(data=train, x="datetime-month", y="casual", ax=ax[0])
ax[0].set_title("Monthly rent count: Non-member")
sns.barplot(data=train, x="datetime-month", y="registered", ax=ax[1])
ax[1].set_title("Monthly rent count: Member")
sns.barplot(data=train, x="datetime-month", y="count", ax=ax[2])
ax[2].set_title("Monthly rent count")

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5))
sns.barplot(data=train, x="datetime-month", y="temp", ax=ax[0])
ax[0].set_title("Monthly temperature")
sns.barplot(data=train, x="datetime-month", y="atemp", ax=ax[1])
ax[1].set_title("Monthly sensible temparature")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 3개의 시각화 자료를 통해 month에 대한 casual, registered, count를 살펴보았습니다.
6월~9월에 자전거를 대여하는 사람이 많고, 1~2월에 빌리는 사람이 적습니다. 이는 온도와 관련이 있을 것으로 추측할 수 있습니다.

아래 2개의 시각화 자료를 통해 월별 온도와 체감온도를 살펴보았습니다.
위에서 나온 추측을 바탕으로 바로 온도와 체감온도를 월별로 살펴본 결과 위의 그래프와 대부분 비슷하지만 특이한점이 하나 있습니다.
1월과 12월의 온도차이가 크지않은데 12월의 대여량이 많다는 점입니다.

월별 대여량은 온도와 관련이 있다고 볼 수 있습니다. 그러나 왜 12월의 대여량이 많은지는 좀 더 살펴보아야 할 것 같습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

year-month¶

In [16]:

train["year-month"] = train["datetime-year"].astype('str') + "-" + train["datetime-month"].astype('str')
train[["datetime", "year-month"]].head(3)

Out[16]:

	datetime	year-month
0	2011-01-01 00:00:00	2011-1
1	2011-01-01 01:00:00	2011-1
2	2011-01-01 02:00:00	2011-1

In [17]:

plt.figure(figsize=(18, 5))
sns.barplot(data=train, x="year-month", y="count").set_title("Year-Month rent count")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
year 컬럼과 month 컬럼을 합쳐 연월별 자전거 대여량을 살펴보았습니다.
앞서 month 컬럼의 시각화 자료를 보면 1월과 12월은 겨울로 낮은 온도 또한 비슷하지만 대여량의 차이가 많이나는 것이 이상하다고 여겨졌습니다.
2011년 12월과 2012년 1월은 자전거 대여량의 차이가 크지 않습니다. 2011년 1월과 2012년 12월의 대여량 차이가 큰 이유는
year컬럼 시각화 자료에서 살펴보았듯 자전거 대여 업체가 성장하면서 이용자가 늘었다고 추측할 수 있습니다.

12월에 1월보다 많이 빌리는 것이 아닌 자전거 대여 업체의 성장에 따른 것으로 파악할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

hour¶

In [18]:

f, ax = plt.subplots(nrows=5, ncols=1, figsize=(18, 25))
plt.subplots_adjust(hspace = 0.3)
sns.pointplot(data=train, x="datetime-hour", y="casual", ax=ax[0])
ax[0].set_title("Rent count per hour: Non-member")
sns.pointplot(data=train, x="datetime-hour", y="registered", ax=ax[1])
ax[1].set_title("Rent count per hour: Member")
sns.pointplot(data=train, x="datetime-hour", y="count", ax=ax[2])
ax[2].set_title("Rent count per hour")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="workingday", ax=ax[3])
ax[3].set_title("Rent count per hour by workingday")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="datetime-dayofweek(str)", ax=ax[4])
ax[4].set_title("Rent count per hour by dayofweek")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 3개 시각화 자료를 통해 시간별 casual, registered, count를 살펴보았습니다.
회원의 대여량인 registered는 count와 비슷한 형태입니다. 반면 비회원의 대여량인 casual은 완전히 다른 형태입니다.
casaul과 register가 합쳐져서 count같은 형태를 띈다는 것은 비회원의 대여량 수 보다 회원의 대여량 수가 훨씬 많다고 볼 수 있습니다.

4번째 시각화 자료를 통해 시간별 count를 근무일 유무로 나누어 살펴보았습니다.
근무를 하는날엔 회원의 대여량과 비슷한 형태이고, 근무를 하지 않는 날엔 비회원의 대여량과 비슷한 형태입니다. 근무일엔 회원이, 비 근무일엔 비회원이 자전거를 많이 대여한다고 볼 수 있습니다.

마지막 시각화 자료를 통해 시간별 count를 요일별로 나누어 살펴보았습니다.
보통 근무를 하는날인 월, 화, 수, 목, 금은 회원의 대여량, 근무일의 대여량과 비슷하고,
휴일인 토, 일은 비회원의 대여량, 비 근무일의 대여량과 비슷합니다.

시간별 대여량은 회원, 비회원간의 차이가 존재합니다. 이는 근무일, 요일과 밀접한 관계가 있습니다.
회원은 근무일과 월~금, 비회원은 비 근무일과 토, 일에 자전거를 많이 빌린다고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

season, temp, atemp¶

In [19]:

f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

sns.barplot(data=train, x="season", y="count", ax=ax[0])
ax[0].set_title("Seasonal rent count")
sns.barplot(data=train, x="season", y="temp", ax=ax[1])
ax[1].set_title("Seasonal temperature")
sns.barplot(data=train, x="season", y="atemp", ax=ax[2])
ax[2].set_title("Seasonal sensible temperature")

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 5))
sns.pointplot(data=train, x="datetime-month", y="count", hue="season", ax=ax)
ax.set_title("Monthly rent count: season")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 2개의 시각화 자료를 통해 계절별 자전거 대여량과 계절별 온도, 계절별 체감온도를 살펴보았습니다.
계절별 자전거 대여량은 가을, 여름, 겨울, 봄 순으로 많습니다. 보통 생각으로는 봄이 겨울보다 많을 것 같지만,
계절별 온도를 살펴보면 봄의 온도가 겨울의 온도보다 낮게 나타났습니다.
계절별 대여량의 시각화 자료는 계절별 온도, 계절별 체감온도의 시각화 자료와 아주 유사합니다.
계절에 따른 대여량의 변화는 온도와 체감온도와 아주 밀접한 관계가 있다는 것을 알 수 있습니다.

아래쪽 시각화 자료를 통해 월별 자전거 대여량을 계절로 나누어 살표보았습니다.
1~3월을 봄, 4~6월을 여름, 7~9월을 가을, 10~12월을 겨울로 나눈 것을 확인할 수 있습니다.
봄이 겨울보다 춥고, 가을이 여름보다 온도가 높다는 점은 조금 의아하긴 하지만 계절별로 확실히 대여량의 차이가 나기때문에
학습할 때 의미있는 컬럼이 될 수 있을 것입니다.

계절, 온도, 체감온도는 아주 밀접한 관계에 있으며 모두 대여량에 영향을 주는 요인입니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

holiday¶

In [20]:

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(18, 10)) 

sns.pointplot(data=train, x="datetime-hour", y="count", hue="holiday", ax=ax[0])
ax[0].set_title("Rent count per hour by holiday")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="workingday", ax=ax[1])
ax[1].set_title("Rent count per hour by workingday")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
공휴일과 근무일에 따른 시간별 자전거 대여량을 살펴보았습니다.
공휴일과 근무일은 정확히 반대 개념은 아닙니다. 하지만 시각화 자료를 보면
공휴일인 경우 비 근무일과 비슷한 형태를 보이고, 공휴일이 아닌 경우 근무일과 비슷한 형태를 보여줍니다.

따라서 공휴일 또한 자전거 대여량에 영향을 미치는 요소라고 볼 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

humidity, windspeed¶

In [21]:

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5)) 
sns.distplot(train["humidity"], ax=ax[0])
ax[0].set_title("humidity distribution")
sns.distplot(train["windspeed"], ax=ax[1])
ax[1].set_title("windspeed distribution")
plt.show()

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
좌측 시각화 자료를 통해 습도의 분포를 살펴보았습니다.
습도는 비교적 눈에 띄는 데이터가 적고 고르게 퍼져 정규분포의 형태를 띄고 있습니다.

우측 시각화 자료를 통해 풍속의 분포를 살펴보았습니다.
풍속은 바람이 없는날인 0을 제외하면 고른 분포를 보여줍니다.

두 컬럼은 눈에띄는 이상치가 있는지를 파악하고 컬럼에 넣고, 빼면서 좋은 성능을 내는지를 확인해보는 것이 중요할 것입니다.

weather¶

In [22]:

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5)) 
sns.countplot(data=train, x="weather", ax=ax[0])
ax[0].set_title("Amount of data by weather")
sns.barplot(data=train, x="weather", y="count", ax=ax[1])
ax[1].set_title("Rent count by weather")
plt.show()
print("train 데이터에 존재하는 날씨가 1인 데이터 개수:", train[train["weather"] == 1].shape[0])
print("train 데이터에 존재하는 날씨가 2인 데이터 개수:", train[train["weather"] == 2].shape[0])
print("train 데이터에 존재하는 날씨가 3인 데이터 개수:", train[train["weather"] == 3].shape[0])
print("train 데이터에 존재하는 날씨가 4인 데이터 개수:", train[train["weather"] == 4].shape[0])

train 데이터에 존재하는 날씨가 1인 데이터 개수: 7192
train 데이터에 존재하는 날씨가 2인 데이터 개수: 2834
train 데이터에 존재하는 날씨가 3인 데이터 개수: 859
train 데이터에 존재하는 날씨가 4인 데이터 개수: 1

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
좌측 시각화 자료를 날씨에 따른 데이터의 양을 살펴보았습니다.
1에 가까울수록 날씨가 좋고, 4에 가까울수록 날씨가 좋지 않은 것인데 1의 수가 압도적으로 많고, 나머지 데이터는 많이 적습니다. 그러나 2, 3의 데이터는 전체 데이터의 크기를 고려했을때 아주적은 편은 아닙니다. 문제는 1개로 나타난 날씨 4입니다. 하지만 만개가 넘는 데이터 중 딱 1개의 데이터만 존재하기 때문에 큰 영향을 미칠 것 같진 않습니다.

우측 시각화 자료를 통해 날씨별 대여량을 살펴보았습니다.
날씨가 좋을수록 많이 빌리는 경향이 있습니다. 하지만 날씨가 매우 좋지 않은 날 또한 대여량이 많은데 이는 좌측에서 살펴본 것 처럼 딱 1개의 데이터만 가지고 평균을 냈기 때문입니다.

아주 눈에 띄는 이상치인 날씨가 4인 데이터가 존재하지만 그 수는 1개로 전체 데이터 중 아주 적으므로 무시해도 될 것입니다.

데이터 전처리¶

season 전처리¶

season - train 전처리¶

In [23]:

train_season = pd.get_dummies(train['season'], prefix='season')
train = pd.concat([train, train_season],axis=1)
train.head(3)

Out[23]:

	datetime	season	weather	temp	atemp	humidity	casual	...	datetime-hour	datetime-dayofweek	datetime-dayofweek(str)	year-month	season_1
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	...	0	5	Sat	2011-1	1
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	...	1	5	Sat	2011-1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	...	2	5	Sat	2011-1	1

3 rows × 28 columns

season - test 전처리¶

In [24]:

test_season = pd.get_dummies(test['season'], prefix='season')
test = pd.concat([test, test_season],axis=1)
test.head(3)

Out[24]:

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	datetime-year	...	datetime-day	datetime-hour	datetime-dayofweek	datetime-dayofweek(str)	season_1
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027	2011	...	20	0	3	Thu	1
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	20	1	3	Thu	1
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	20	2	3	Thu	1

3 rows × 21 columns

weather 전처리¶

weather - train 전처리¶

In [25]:

train_weather = pd.get_dummies(train['weather'], prefix='weather')
train = pd.concat([train, train_weather],axis=1)
train.head(3)

Out[25]:

	datetime	season	weather	temp	atemp	humidity	casual	...	datetime-dayofweek(str)	year-month	season_1	weather_1
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	...	Sat	2011-1	1	1
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	...	Sat	2011-1	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	...	Sat	2011-1	1	1

3 rows × 32 columns

weather - test 전처리¶

In [26]:

test_weather = pd.get_dummies(test['weather'], prefix='weather')
test = pd.concat([test, test_weather],axis=1)
test.head(3)

Out[26]:

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	datetime-year	...	datetime-dayofweek	datetime-dayofweek(str)	season_1	weather_1
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027	2011	...	3	Thu	1	1
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	3	Thu	1	1
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	3	Thu	1	1

3 rows × 25 columns

dayofweek 전처리¶

dayofweek - train 전처리¶

In [27]:

train_dayofweek = pd.get_dummies(train["datetime-dayofweek(str)"], prefix = "dayofweek")
train = pd.concat([train, train_dayofweek], axis = 1)
train.head(3)

Out[27]:

	datetime	season	weather	temp	atemp	humidity	casual	...	dayofweek_Sat
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	...	1
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	...	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	...	1

3 rows × 39 columns

dayofweek - test 전처리¶

In [28]:

test_dayofweek = pd.get_dummies(test["datetime-dayofweek(str)"], prefix = "dayofweek")
test = pd.concat([test, test_dayofweek], axis = 1)
test.head(3)

Out[28]:

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	datetime-year	...	dayofweek_Thu
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027	2011	...	1
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	1
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000	2011	...	1

3 rows × 32 columns

학습¶

feature 설정¶

In [29]:

feature_list = ["holiday", "workingday", "temp", "atemp", "humidity", "windspeed", "datetime-year", "datetime-hour"]
feature_list = feature_list + list(train_dayofweek.columns)
feature_list = feature_list + list(train_weather.columns)
feature_list = feature_list + list(train_season.columns)
feature_list

Out[29]:

['holiday',
 'workingday',
 'temp',
 'atemp',
 'humidity',
 'windspeed',
 'datetime-year',
 'datetime-hour',
 'dayofweek_Fri',
 'dayofweek_Mon',
 'dayofweek_Sat',
 'dayofweek_Sun',
 'dayofweek_Thu',
 'dayofweek_Tue',
 'dayofweek_Wed',
 'weather_1',
 'weather_2',
 'weather_3',
 'weather_4',
 'season_1',
 'season_2',
 'season_3',
 'season_4']

x_train, x_test, y_train_c, y_train_r 설정¶

x_train: 학습할 feature 값
x_test: 예측할 feature 값
y_train_c: 학습할 label(비회원 대여량) 값
y_train_r: 학습할 label(회원 대여량) 값

In [30]:

x_train = train[feature_list]
x_train.head(3)

Out[30]:

	temp	atemp	humidity	datetime-year	datetime-hour	...	weather_1	season_1
0	9.84	14.395	81	2011	0	...	1	1
1	9.02	13.635	80	2011	1	...	1	1
2	9.02	13.635	80	2011	2	...	1	1

3 rows × 23 columns

In [31]:

x_test = test[feature_list]
x_test.head(3)

Out[31]:

	workingday	temp	atemp	humidity	windspeed	datetime-year	datetime-hour	...	weather_1	season_1
0	1	10.66	11.365	56	26.0027	2011	0	...	1	1
1	1	10.66	13.635	56	0.0000	2011	1	...	1	1
2	1	10.66	13.635	56	0.0000	2011	2	...	1	1

3 rows × 23 columns

In [32]:

y_train_c = train["casual_log"]
y_train_c.head(3)

Out[32]:

0    1.386294
1    2.197225
2    1.791759
Name: casual_log, dtype: float64

In [33]:

y_train_r = train["registered_log"]
y_train_r.head(3)

Out[33]:

0    2.639057
1    3.496508
2    3.332205
Name: registered_log, dtype: float64

하이퍼파라메터 튜닝¶

Scorer 구현¶

In [34]:

from sklearn.metrics import make_scorer
def rmse(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)

    distance = predict - actual
    square_distance = distance ** 2
    mean_square_distance = square_distance.mean()
    score = np.sqrt(mean_square_distance)
    return score

rmse_score = make_scorer(rmse)
rmse_score

Out[34]:

make_scorer(rmse)

casual_log 에 대한 coarse search¶

In [35]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

n_estimators = 300
num_epoch = 100
coarse_hyperparameters_list = []

for epoch in range(num_epoch):
    max_depth = np.random.randint(low=2, high=100)
    max_features = np.random.uniform(low=0.1, high=1.0)

    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  n_jobs=-1,
                                  random_state=37)

    score = cross_val_score(model, x_train, y_train_c, cv=20, scoring=rmse_score).mean()
    
    hyperparameters = {
        'epoch': epoch,
        'score': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    }

    coarse_hyperparameters_list.append(hyperparameters)

    print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")

coarse_hyperparameters_list = pd.DataFrame.from_dict(coarse_hyperparameters_list)
coarse_hyperparameters_list = coarse_hyperparameters_list.sort_values(by="score")
print(coarse_hyperparameters_list.shape)
coarse_hyperparameters_list.head(10)

 0 n_estimators = 300, max_depth = 16, max_features = 0.815879, Score = 0.55552
 1 n_estimators = 300, max_depth = 33, max_features = 0.363584, Score = 0.55461
 2 n_estimators = 300, max_depth = 94, max_features = 0.935101, Score = 0.55978
 3 n_estimators = 300, max_depth =  6, max_features = 0.586503, Score = 0.63070
 4 n_estimators = 300, max_depth = 30, max_features = 0.308049, Score = 0.56076
 5 n_estimators = 300, max_depth = 42, max_features = 0.645142, Score = 0.55193
 6 n_estimators = 300, max_depth = 61, max_features = 0.201181, Score = 0.60189
 7 n_estimators = 300, max_depth = 25, max_features = 0.123580, Score = 0.67106
 8 n_estimators = 300, max_depth = 47, max_features = 0.179029, Score = 0.60189
 9 n_estimators = 300, max_depth = 86, max_features = 0.368857, Score = 0.55455
10 n_estimators = 300, max_depth = 11, max_features = 0.404364, Score = 0.56017
11 n_estimators = 300, max_depth = 11, max_features = 0.119115, Score = 0.71720
12 n_estimators = 300, max_depth = 66, max_features = 0.615212, Score = 0.55193
13 n_estimators = 300, max_depth = 33, max_features = 0.482311, Score = 0.55152
14 n_estimators = 300, max_depth = 65, max_features = 0.166754, Score = 0.63065
15 n_estimators = 300, max_depth = 74, max_features = 0.525012, Score = 0.55038
16 n_estimators = 300, max_depth = 76, max_features = 0.557047, Score = 0.55038
17 n_estimators = 300, max_depth = 95, max_features = 0.911051, Score = 0.55790
18 n_estimators = 300, max_depth = 26, max_features = 0.129033, Score = 0.66777
19 n_estimators = 300, max_depth = 79, max_features = 0.827648, Score = 0.55753
20 n_estimators = 300, max_depth = 71, max_features = 0.232804, Score = 0.58103
21 n_estimators = 300, max_depth = 36, max_features = 0.792791, Score = 0.55611
22 n_estimators = 300, max_depth = 93, max_features = 0.527166, Score = 0.55038
23 n_estimators = 300, max_depth = 86, max_features = 0.435793, Score = 0.55152
24 n_estimators = 300, max_depth = 62, max_features = 0.492440, Score = 0.55161
25 n_estimators = 300, max_depth = 41, max_features = 0.471920, Score = 0.55152
26 n_estimators = 300, max_depth = 79, max_features = 0.881475, Score = 0.55790
27 n_estimators = 300, max_depth = 97, max_features = 0.860086, Score = 0.55753
28 n_estimators = 300, max_depth = 49, max_features = 0.338910, Score = 0.56110
29 n_estimators = 300, max_depth = 69, max_features = 0.205281, Score = 0.60189
30 n_estimators = 300, max_depth = 32, max_features = 0.387262, Score = 0.55454
31 n_estimators = 300, max_depth = 78, max_features = 0.504067, Score = 0.55161
32 n_estimators = 300, max_depth = 55, max_features = 0.977990, Score = 0.56103
33 n_estimators = 300, max_depth = 89, max_features = 0.812943, Score = 0.55612
34 n_estimators = 300, max_depth = 25, max_features = 0.566416, Score = 0.55100
35 n_estimators = 300, max_depth = 74, max_features = 0.739495, Score = 0.55459
36 n_estimators = 300, max_depth = 76, max_features = 0.273045, Score = 0.56757
37 n_estimators = 300, max_depth =  9, max_features = 0.183777, Score = 0.66358
38 n_estimators = 300, max_depth = 68, max_features = 0.775070, Score = 0.55459
39 n_estimators = 300, max_depth = 77, max_features = 0.129648, Score = 0.66924
40 n_estimators = 300, max_depth = 53, max_features = 0.466049, Score = 0.55152
41 n_estimators = 300, max_depth =  5, max_features = 0.239799, Score = 0.77236
42 n_estimators = 300, max_depth = 80, max_features = 0.871558, Score = 0.55790
43 n_estimators = 300, max_depth = 65, max_features = 0.729474, Score = 0.55366
44 n_estimators = 300, max_depth = 77, max_features = 0.139745, Score = 0.63065
45 n_estimators = 300, max_depth = 59, max_features = 0.803047, Score = 0.55612
46 n_estimators = 300, max_depth = 54, max_features = 0.977734, Score = 0.56103
47 n_estimators = 300, max_depth = 93, max_features = 0.785698, Score = 0.55612
48 n_estimators = 300, max_depth = 33, max_features = 0.394663, Score = 0.55291
49 n_estimators = 300, max_depth = 68, max_features = 0.565145, Score = 0.55038
50 n_estimators = 300, max_depth = 46, max_features = 0.101771, Score = 0.66924
51 n_estimators = 300, max_depth = 49, max_features = 0.878606, Score = 0.55790
52 n_estimators = 300, max_depth =  9, max_features = 0.716723, Score = 0.56207
53 n_estimators = 300, max_depth = 81, max_features = 0.529706, Score = 0.55038
54 n_estimators = 300, max_depth = 85, max_features = 0.936787, Score = 0.55978
55 n_estimators = 300, max_depth = 91, max_features = 0.832848, Score = 0.55753
56 n_estimators = 300, max_depth = 16, max_features = 0.318296, Score = 0.56095
57 n_estimators = 300, max_depth = 31, max_features = 0.499658, Score = 0.55157
58 n_estimators = 300, max_depth = 18, max_features = 0.368265, Score = 0.55531
59 n_estimators = 300, max_depth = 45, max_features = 0.265388, Score = 0.56757
60 n_estimators = 300, max_depth = 59, max_features = 0.445939, Score = 0.55152
61 n_estimators = 300, max_depth = 10, max_features = 0.714810, Score = 0.55537
62 n_estimators = 300, max_depth = 95, max_features = 0.445322, Score = 0.55152
63 n_estimators = 300, max_depth = 44, max_features = 0.454704, Score = 0.55152
64 n_estimators = 300, max_depth = 11, max_features = 0.172253, Score = 0.66715
65 n_estimators = 300, max_depth = 24, max_features = 0.913853, Score = 0.55938
66 n_estimators = 300, max_depth = 13, max_features = 0.365549, Score = 0.55621
67 n_estimators = 300, max_depth = 73, max_features = 0.717295, Score = 0.55366
68 n_estimators = 300, max_depth = 32, max_features = 0.631876, Score = 0.55192
69 n_estimators = 300, max_depth = 23, max_features = 0.947490, Score = 0.55998
70 n_estimators = 300, max_depth = 18, max_features = 0.247156, Score = 0.58166
71 n_estimators = 300, max_depth = 68, max_features = 0.449915, Score = 0.55152
72 n_estimators = 300, max_depth = 85, max_features = 0.564755, Score = 0.55038
73 n_estimators = 300, max_depth = 52, max_features = 0.487883, Score = 0.55161
74 n_estimators = 300, max_depth = 38, max_features = 0.968391, Score = 0.56103
75 n_estimators = 300, max_depth = 68, max_features = 0.822809, Score = 0.55612
76 n_estimators = 300, max_depth = 93, max_features = 0.567974, Score = 0.55138
77 n_estimators = 300, max_depth = 42, max_features = 0.835279, Score = 0.55753
78 n_estimators = 300, max_depth = 46, max_features = 0.740971, Score = 0.55459
79 n_estimators = 300, max_depth = 56, max_features = 0.550496, Score = 0.55038
80 n_estimators = 300, max_depth = 43, max_features = 0.721874, Score = 0.55366
81 n_estimators = 300, max_depth = 16, max_features = 0.565649, Score = 0.55093
82 n_estimators = 300, max_depth = 53, max_features = 0.615084, Score = 0.55193
83 n_estimators = 300, max_depth = 26, max_features = 0.617739, Score = 0.55209
84 n_estimators = 300, max_depth = 35, max_features = 0.376806, Score = 0.55456
85 n_estimators = 300, max_depth = 97, max_features = 0.455316, Score = 0.55152
86 n_estimators = 300, max_depth = 29, max_features = 0.145428, Score = 0.63128
87 n_estimators = 300, max_depth = 29, max_features = 0.945950, Score = 0.55991
88 n_estimators = 300, max_depth = 97, max_features = 0.195061, Score = 0.60189
89 n_estimators = 300, max_depth = 88, max_features = 0.677106, Score = 0.55287
90 n_estimators = 300, max_depth = 99, max_features = 0.334953, Score = 0.56110
91 n_estimators = 300, max_depth =  8, max_features = 0.179081, Score = 0.68709
92 n_estimators = 300, max_depth = 88, max_features = 0.411522, Score = 0.55288
93 n_estimators = 300, max_depth = 84, max_features = 0.440546, Score = 0.55152
94 n_estimators = 300, max_depth = 62, max_features = 0.759401, Score = 0.55459
95 n_estimators = 300, max_depth =  8, max_features = 0.806251, Score = 0.57222
96 n_estimators = 300, max_depth = 36, max_features = 0.763491, Score = 0.55458
97 n_estimators = 300, max_depth =  9, max_features = 0.517754, Score = 0.56838
98 n_estimators = 300, max_depth = 95, max_features = 0.343703, Score = 0.56110
99 n_estimators = 300, max_depth = 50, max_features = 0.322365, Score = 0.56110
(100, 5)

Out[35]:

	epoch	max_depth	max_features	n_estimators	score
49	49	68	0.565145	300	0.550381
72	72	85	0.564755	300	0.550381
53	53	81	0.529706	300	0.550381
79	79	56	0.550496	300	0.550381
16	16	76	0.557047	300	0.550381
15	15	74	0.525012	300	0.550381
22	22	93	0.527166	300	0.550381
81	81	16	0.565649	300	0.550933
34	34	25	0.566416	300	0.550995
76	76	93	0.567974	300	0.551382

casual_log 에 대한 finer search¶

In [39]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

n_estimators = 300
num_epoch = 100
finer_hyperparameters_list = []

for epoch in range(num_epoch):
    max_depth = np.random.randint(low=56, high=94)
    max_features = np.random.uniform(low=0.525012, high=0.565145)
    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  n_jobs=-1,
                                  random_state=37)

    score = cross_val_score(model, x_train, y_train_c, cv=20, scoring=rmse_score).mean()

    hyperparameters = {
        'epoch': epoch,
        'score': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    }

    finer_hyperparameters_list.append(hyperparameters)

    print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")

finer_hyperparameters_list = pd.DataFrame.from_dict(finer_hyperparameters_list)
finer_hyperparameters_list = finer_hyperparameters_list.sort_values(by="score")
print(finer_hyperparameters_list.shape)
finer_hyperparameters_list.head(10)

 0 n_estimators = 300, max_depth = 62, max_features = 0.547745, Score = 0.55038
 1 n_estimators = 300, max_depth = 57, max_features = 0.553065, Score = 0.55038
 2 n_estimators = 300, max_depth = 69, max_features = 0.532412, Score = 0.55038
 3 n_estimators = 300, max_depth = 84, max_features = 0.528785, Score = 0.55038
 4 n_estimators = 300, max_depth = 65, max_features = 0.555427, Score = 0.55038
 5 n_estimators = 300, max_depth = 93, max_features = 0.544712, Score = 0.55038
 6 n_estimators = 300, max_depth = 84, max_features = 0.556438, Score = 0.55038
 7 n_estimators = 300, max_depth = 65, max_features = 0.536290, Score = 0.55038
 8 n_estimators = 300, max_depth = 72, max_features = 0.539385, Score = 0.55038
 9 n_estimators = 300, max_depth = 59, max_features = 0.539697, Score = 0.55038
10 n_estimators = 300, max_depth = 82, max_features = 0.546212, Score = 0.55038
11 n_estimators = 300, max_depth = 75, max_features = 0.526521, Score = 0.55038
12 n_estimators = 300, max_depth = 78, max_features = 0.555156, Score = 0.55038
13 n_estimators = 300, max_depth = 80, max_features = 0.553852, Score = 0.55038
14 n_estimators = 300, max_depth = 75, max_features = 0.563532, Score = 0.55038
15 n_estimators = 300, max_depth = 66, max_features = 0.538285, Score = 0.55038
16 n_estimators = 300, max_depth = 66, max_features = 0.556377, Score = 0.55038
17 n_estimators = 300, max_depth = 64, max_features = 0.531851, Score = 0.55038
18 n_estimators = 300, max_depth = 56, max_features = 0.552213, Score = 0.55038
19 n_estimators = 300, max_depth = 57, max_features = 0.543790, Score = 0.55038
20 n_estimators = 300, max_depth = 91, max_features = 0.527565, Score = 0.55038
21 n_estimators = 300, max_depth = 70, max_features = 0.537141, Score = 0.55038
22 n_estimators = 300, max_depth = 93, max_features = 0.541342, Score = 0.55038
23 n_estimators = 300, max_depth = 86, max_features = 0.541864, Score = 0.55038
24 n_estimators = 300, max_depth = 92, max_features = 0.554943, Score = 0.55038
25 n_estimators = 300, max_depth = 74, max_features = 0.548143, Score = 0.55038
26 n_estimators = 300, max_depth = 88, max_features = 0.544564, Score = 0.55038
27 n_estimators = 300, max_depth = 56, max_features = 0.535635, Score = 0.55038
28 n_estimators = 300, max_depth = 84, max_features = 0.563060, Score = 0.55038
29 n_estimators = 300, max_depth = 89, max_features = 0.533901, Score = 0.55038
30 n_estimators = 300, max_depth = 57, max_features = 0.554264, Score = 0.55038
31 n_estimators = 300, max_depth = 59, max_features = 0.544091, Score = 0.55038
32 n_estimators = 300, max_depth = 84, max_features = 0.551048, Score = 0.55038
33 n_estimators = 300, max_depth = 81, max_features = 0.536713, Score = 0.55038
34 n_estimators = 300, max_depth = 91, max_features = 0.527945, Score = 0.55038
35 n_estimators = 300, max_depth = 66, max_features = 0.562273, Score = 0.55038
36 n_estimators = 300, max_depth = 71, max_features = 0.540608, Score = 0.55038
37 n_estimators = 300, max_depth = 58, max_features = 0.560372, Score = 0.55038
38 n_estimators = 300, max_depth = 80, max_features = 0.545986, Score = 0.55038
39 n_estimators = 300, max_depth = 75, max_features = 0.548973, Score = 0.55038
40 n_estimators = 300, max_depth = 81, max_features = 0.536679, Score = 0.55038
41 n_estimators = 300, max_depth = 81, max_features = 0.531096, Score = 0.55038
42 n_estimators = 300, max_depth = 80, max_features = 0.538412, Score = 0.55038
43 n_estimators = 300, max_depth = 77, max_features = 0.550843, Score = 0.55038
44 n_estimators = 300, max_depth = 81, max_features = 0.557774, Score = 0.55038
45 n_estimators = 300, max_depth = 92, max_features = 0.539016, Score = 0.55038
46 n_estimators = 300, max_depth = 88, max_features = 0.537490, Score = 0.55038
47 n_estimators = 300, max_depth = 61, max_features = 0.563794, Score = 0.55038
48 n_estimators = 300, max_depth = 71, max_features = 0.528334, Score = 0.55038
49 n_estimators = 300, max_depth = 93, max_features = 0.525655, Score = 0.55038
50 n_estimators = 300, max_depth = 80, max_features = 0.537061, Score = 0.55038
51 n_estimators = 300, max_depth = 90, max_features = 0.549795, Score = 0.55038
52 n_estimators = 300, max_depth = 60, max_features = 0.540800, Score = 0.55038
53 n_estimators = 300, max_depth = 65, max_features = 0.537408, Score = 0.55038
54 n_estimators = 300, max_depth = 63, max_features = 0.541750, Score = 0.55038
55 n_estimators = 300, max_depth = 72, max_features = 0.556773, Score = 0.55038
56 n_estimators = 300, max_depth = 86, max_features = 0.556364, Score = 0.55038
57 n_estimators = 300, max_depth = 76, max_features = 0.533251, Score = 0.55038
58 n_estimators = 300, max_depth = 77, max_features = 0.527909, Score = 0.55038
59 n_estimators = 300, max_depth = 88, max_features = 0.529027, Score = 0.55038
60 n_estimators = 300, max_depth = 72, max_features = 0.549838, Score = 0.55038
61 n_estimators = 300, max_depth = 83, max_features = 0.564548, Score = 0.55038
62 n_estimators = 300, max_depth = 93, max_features = 0.526313, Score = 0.55038
63 n_estimators = 300, max_depth = 89, max_features = 0.530699, Score = 0.55038
64 n_estimators = 300, max_depth = 88, max_features = 0.527318, Score = 0.55038
65 n_estimators = 300, max_depth = 85, max_features = 0.530366, Score = 0.55038
66 n_estimators = 300, max_depth = 82, max_features = 0.560943, Score = 0.55038
67 n_estimators = 300, max_depth = 63, max_features = 0.525115, Score = 0.55038
68 n_estimators = 300, max_depth = 77, max_features = 0.549623, Score = 0.55038
69 n_estimators = 300, max_depth = 90, max_features = 0.544063, Score = 0.55038
70 n_estimators = 300, max_depth = 92, max_features = 0.553062, Score = 0.55038
71 n_estimators = 300, max_depth = 88, max_features = 0.546806, Score = 0.55038
72 n_estimators = 300, max_depth = 70, max_features = 0.545285, Score = 0.55038
73 n_estimators = 300, max_depth = 75, max_features = 0.537516, Score = 0.55038
74 n_estimators = 300, max_depth = 63, max_features = 0.547614, Score = 0.55038
75 n_estimators = 300, max_depth = 85, max_features = 0.552580, Score = 0.55038
76 n_estimators = 300, max_depth = 85, max_features = 0.563485, Score = 0.55038
77 n_estimators = 300, max_depth = 66, max_features = 0.557812, Score = 0.55038
78 n_estimators = 300, max_depth = 87, max_features = 0.561268, Score = 0.55038
79 n_estimators = 300, max_depth = 63, max_features = 0.530311, Score = 0.55038
80 n_estimators = 300, max_depth = 82, max_features = 0.548564, Score = 0.55038
81 n_estimators = 300, max_depth = 75, max_features = 0.544981, Score = 0.55038
82 n_estimators = 300, max_depth = 93, max_features = 0.542920, Score = 0.55038
83 n_estimators = 300, max_depth = 85, max_features = 0.525104, Score = 0.55038
84 n_estimators = 300, max_depth = 56, max_features = 0.529462, Score = 0.55038
85 n_estimators = 300, max_depth = 70, max_features = 0.560948, Score = 0.55038
86 n_estimators = 300, max_depth = 65, max_features = 0.546866, Score = 0.55038
87 n_estimators = 300, max_depth = 87, max_features = 0.564146, Score = 0.55038
88 n_estimators = 300, max_depth = 65, max_features = 0.545916, Score = 0.55038
89 n_estimators = 300, max_depth = 87, max_features = 0.558589, Score = 0.55038
90 n_estimators = 300, max_depth = 82, max_features = 0.554872, Score = 0.55038
91 n_estimators = 300, max_depth = 57, max_features = 0.543011, Score = 0.55038
92 n_estimators = 300, max_depth = 56, max_features = 0.543771, Score = 0.55038
93 n_estimators = 300, max_depth = 64, max_features = 0.538156, Score = 0.55038
94 n_estimators = 300, max_depth = 68, max_features = 0.551395, Score = 0.55038
95 n_estimators = 300, max_depth = 92, max_features = 0.549778, Score = 0.55038
96 n_estimators = 300, max_depth = 91, max_features = 0.544430, Score = 0.55038
97 n_estimators = 300, max_depth = 65, max_features = 0.531215, Score = 0.55038
98 n_estimators = 300, max_depth = 56, max_features = 0.529161, Score = 0.55038
99 n_estimators = 300, max_depth = 59, max_features = 0.535109, Score = 0.55038
(100, 5)

Out[39]:

	epoch	max_depth	max_features	n_estimators	score
0	0	62	0.547745	300	0.550381
72	72	70	0.545285	300	0.550381
71	71	88	0.546806	300	0.550381
70	70	92	0.553062	300	0.550381
69	69	90	0.544063	300	0.550381
68	68	77	0.549623	300	0.550381
67	67	63	0.525115	300	0.550381
66	66	82	0.560943	300	0.550381
65	65	85	0.530366	300	0.550381
64	64	88	0.527318	300	0.550381

가장 좋은 파라메터를 통해 비회원 대여량 예측을 위한 모델 생성¶

In [40]:

best_hyperparameters = finer_hyperparameters_list.iloc[0]
best_max_depth = best_hyperparameters["max_depth"]
best_max_features = best_hyperparameters["max_features"]
print(f"max_depth(best) = {best_max_depth}, max_features(best) = {best_max_features:.6f}")

max_depth(best) = 62.0, max_features(best) = 0.547745

In [41]:

best_n_estimators = 3000
model = RandomForestRegressor(n_estimators=best_n_estimators,
                              max_depth=best_max_depth,
                              max_features=best_max_features,
                              random_state=37,
                              n_jobs=-1)
model

Out[41]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=62.0,
           max_features=0.5477452272116831, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
           oob_score=False, random_state=37, verbose=0, warm_start=False)

In [42]:

model.fit(x_train, y_train_c)

Out[42]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=62.0,
           max_features=0.5477452272116831, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
           oob_score=False, random_state=37, verbose=0, warm_start=False)

비회원 대여량 예측¶

In [43]:

logC_predictions = model.predict(x_test)
print(logC_predictions.shape)
logC_predictions

(6493,)

Out[43]:

array([0.83244801, 0.41858681, 0.48794546, ..., 1.41082069, 1.39423664,
       1.10825634])

In [44]:

predictions_c = np.exp(logC_predictions) - 1
print(predictions_c.shape)
predictions_c

(6493,)

Out[44]:

array([1.29893969, 0.51981226, 0.628966  , ..., 3.0993183 , 3.03189561,
       2.02907212])

resistered_log 에 대한 coarse search¶

In [36]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

n_estimators = 300
num_epoch = 100
coarse_hyperparameters_list2 = []
for epoch in range(num_epoch):
    max_depth = np.random.randint(low=2, high=100)
    max_features = np.random.uniform(low=0.1, high=1.0)

    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  n_jobs=-1,
                                  random_state=37)

    score = cross_val_score(model, x_train, y_train_r, cv=20, scoring=rmse_score).mean()
    
    hyperparameters = {
        'epoch': epoch,
        'score': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    }

    coarse_hyperparameters_list2.append(hyperparameters)
    print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")

coarse_hyperparameters_list2 = pd.DataFrame.from_dict(coarse_hyperparameters_list2)
coarse_hyperparameters_list2 = coarse_hyperparameters_list2.sort_values(by="score")

print(coarse_hyperparameters_list2.shape)
coarse_hyperparameters_list2.head(10)

 0 n_estimators = 300, max_depth = 55, max_features = 0.741497, Score = 0.34181
 1 n_estimators = 300, max_depth =  6, max_features = 0.671060, Score = 0.53658
 2 n_estimators = 300, max_depth = 22, max_features = 0.470221, Score = 0.36214
 3 n_estimators = 300, max_depth = 85, max_features = 0.558939, Score = 0.34958
 4 n_estimators = 300, max_depth = 96, max_features = 0.932887, Score = 0.34524
 5 n_estimators = 300, max_depth = 98, max_features = 0.355066, Score = 0.38606
 6 n_estimators = 300, max_depth = 11, max_features = 0.527905, Score = 0.38015
 7 n_estimators = 300, max_depth =  5, max_features = 0.777117, Score = 0.58680
 8 n_estimators = 300, max_depth = 37, max_features = 0.693035, Score = 0.34270
 9 n_estimators = 300, max_depth = 70, max_features = 0.429790, Score = 0.37301
10 n_estimators = 300, max_depth = 43, max_features = 0.597404, Score = 0.34740
11 n_estimators = 300, max_depth = 93, max_features = 0.450411, Score = 0.36206
12 n_estimators = 300, max_depth = 95, max_features = 0.680291, Score = 0.34270
13 n_estimators = 300, max_depth = 60, max_features = 0.709664, Score = 0.34149
14 n_estimators = 300, max_depth = 89, max_features = 0.980099, Score = 0.34647
15 n_estimators = 300, max_depth = 63, max_features = 0.251691, Score = 0.46973
16 n_estimators = 300, max_depth = 67, max_features = 0.976746, Score = 0.34647
17 n_estimators = 300, max_depth = 78, max_features = 0.996863, Score = 0.34647
18 n_estimators = 300, max_depth = 21, max_features = 0.796639, Score = 0.34249
19 n_estimators = 300, max_depth = 49, max_features = 0.437017, Score = 0.36206
20 n_estimators = 300, max_depth = 18, max_features = 0.527926, Score = 0.34970
21 n_estimators = 300, max_depth = 98, max_features = 0.482188, Score = 0.35440
22 n_estimators = 300, max_depth = 66, max_features = 0.785142, Score = 0.34129
23 n_estimators = 300, max_depth = 17, max_features = 0.767724, Score = 0.34240
24 n_estimators = 300, max_depth = 42, max_features = 0.757356, Score = 0.34181
25 n_estimators = 300, max_depth = 48, max_features = 0.886744, Score = 0.34358
26 n_estimators = 300, max_depth = 51, max_features = 0.238830, Score = 0.46973
27 n_estimators = 300, max_depth = 41, max_features = 0.235914, Score = 0.46973
28 n_estimators = 300, max_depth = 75, max_features = 0.324585, Score = 0.40701
29 n_estimators = 300, max_depth = 47, max_features = 0.172270, Score = 0.56751
30 n_estimators = 300, max_depth = 90, max_features = 0.451302, Score = 0.36206
31 n_estimators = 300, max_depth = 10, max_features = 0.797670, Score = 0.36912
32 n_estimators = 300, max_depth = 80, max_features = 0.531153, Score = 0.34958
33 n_estimators = 300, max_depth = 43, max_features = 0.984928, Score = 0.34647
34 n_estimators = 300, max_depth = 92, max_features = 0.759502, Score = 0.34181
35 n_estimators = 300, max_depth = 46, max_features = 0.145356, Score = 0.56751
36 n_estimators = 300, max_depth = 82, max_features = 0.141128, Score = 0.56751
37 n_estimators = 300, max_depth = 10, max_features = 0.735761, Score = 0.37585
38 n_estimators = 300, max_depth = 55, max_features = 0.328704, Score = 0.40701
39 n_estimators = 300, max_depth =  7, max_features = 0.622525, Score = 0.49458
40 n_estimators = 300, max_depth = 78, max_features = 0.601841, Score = 0.34740
41 n_estimators = 300, max_depth = 62, max_features = 0.402645, Score = 0.37301
42 n_estimators = 300, max_depth = 55, max_features = 0.459575, Score = 0.36206
43 n_estimators = 300, max_depth = 10, max_features = 0.523311, Score = 0.40020
44 n_estimators = 300, max_depth = 37, max_features = 0.647466, Score = 0.34401
45 n_estimators = 300, max_depth = 57, max_features = 0.382216, Score = 0.38606
46 n_estimators = 300, max_depth = 27, max_features = 0.209308, Score = 0.51259
47 n_estimators = 300, max_depth = 72, max_features = 0.561691, Score = 0.34958
48 n_estimators = 300, max_depth = 55, max_features = 0.554428, Score = 0.34958
49 n_estimators = 300, max_depth = 51, max_features = 0.416894, Score = 0.37301
50 n_estimators = 300, max_depth = 95, max_features = 0.585095, Score = 0.34740
51 n_estimators = 300, max_depth = 37, max_features = 0.119375, Score = 0.63239
52 n_estimators = 300, max_depth = 30, max_features = 0.750781, Score = 0.34171
53 n_estimators = 300, max_depth = 38, max_features = 0.286508, Score = 0.43439
54 n_estimators = 300, max_depth = 81, max_features = 0.663362, Score = 0.34270
55 n_estimators = 300, max_depth = 72, max_features = 0.542318, Score = 0.34958
56 n_estimators = 300, max_depth = 46, max_features = 0.520006, Score = 0.35440
57 n_estimators = 300, max_depth = 41, max_features = 0.922723, Score = 0.34524
58 n_estimators = 300, max_depth = 95, max_features = 0.942000, Score = 0.34524
59 n_estimators = 300, max_depth = 48, max_features = 0.544762, Score = 0.34958
60 n_estimators = 300, max_depth = 55, max_features = 0.425435, Score = 0.37301
61 n_estimators = 300, max_depth = 51, max_features = 0.271925, Score = 0.43439
62 n_estimators = 300, max_depth = 98, max_features = 0.691109, Score = 0.34270
63 n_estimators = 300, max_depth = 24, max_features = 0.533335, Score = 0.35030
64 n_estimators = 300, max_depth = 79, max_features = 0.243902, Score = 0.46973
65 n_estimators = 300, max_depth = 80, max_features = 0.182315, Score = 0.51170
66 n_estimators = 300, max_depth = 82, max_features = 0.465259, Score = 0.36206
67 n_estimators = 300, max_depth = 35, max_features = 0.488040, Score = 0.35440
68 n_estimators = 300, max_depth =  3, max_features = 0.388570, Score = 0.81505
69 n_estimators = 300, max_depth = 69, max_features = 0.303540, Score = 0.43439
70 n_estimators = 300, max_depth = 60, max_features = 0.820772, Score = 0.34129
71 n_estimators = 300, max_depth =  2, max_features = 0.189066, Score = 1.07632
72 n_estimators = 300, max_depth = 10, max_features = 0.540102, Score = 0.40020
73 n_estimators = 300, max_depth = 94, max_features = 0.164939, Score = 0.56751
74 n_estimators = 300, max_depth = 67, max_features = 0.279879, Score = 0.43439
75 n_estimators = 300, max_depth = 65, max_features = 0.311435, Score = 0.40701
76 n_estimators = 300, max_depth = 81, max_features = 0.400092, Score = 0.37301
77 n_estimators = 300, max_depth = 52, max_features = 0.400369, Score = 0.37301
78 n_estimators = 300, max_depth = 83, max_features = 0.584787, Score = 0.34740
79 n_estimators = 300, max_depth = 37, max_features = 0.185660, Score = 0.51170
80 n_estimators = 300, max_depth = 86, max_features = 0.439527, Score = 0.36206
81 n_estimators = 300, max_depth = 31, max_features = 0.478423, Score = 0.35460
82 n_estimators = 300, max_depth = 23, max_features = 0.380094, Score = 0.38653
83 n_estimators = 300, max_depth = 62, max_features = 0.166420, Score = 0.56751
84 n_estimators = 300, max_depth =  7, max_features = 0.545152, Score = 0.50683
85 n_estimators = 300, max_depth = 38, max_features = 0.253419, Score = 0.46973
86 n_estimators = 300, max_depth = 39, max_features = 0.467996, Score = 0.36206
87 n_estimators = 300, max_depth = 42, max_features = 0.178515, Score = 0.51170
88 n_estimators = 300, max_depth = 48, max_features = 0.253318, Score = 0.46973
89 n_estimators = 300, max_depth = 81, max_features = 0.593100, Score = 0.34740
90 n_estimators = 300, max_depth = 20, max_features = 0.699001, Score = 0.34236
91 n_estimators = 300, max_depth = 52, max_features = 0.642897, Score = 0.34401
92 n_estimators = 300, max_depth = 97, max_features = 0.807412, Score = 0.34129
93 n_estimators = 300, max_depth = 96, max_features = 0.822352, Score = 0.34129
94 n_estimators = 300, max_depth = 45, max_features = 0.726982, Score = 0.34149
95 n_estimators = 300, max_depth = 22, max_features = 0.530267, Score = 0.34980
96 n_estimators = 300, max_depth = 79, max_features = 0.887531, Score = 0.34358
97 n_estimators = 300, max_depth = 83, max_features = 0.244250, Score = 0.46973
98 n_estimators = 300, max_depth = 58, max_features = 0.747926, Score = 0.34181
99 n_estimators = 300, max_depth = 72, max_features = 0.921403, Score = 0.34524
(100, 5)

Out[36]:

	epoch	max_depth	max_features	n_estimators	score
93	93	96	0.822352	300	0.341292
92	92	97	0.807412	300	0.341292
70	70	60	0.820772	300	0.341292
22	22	66	0.785142	300	0.341292
94	94	45	0.726982	300	0.341488
13	13	60	0.709664	300	0.341488
52	52	30	0.750781	300	0.341713
24	24	42	0.757356	300	0.341813
34	34	92	0.759502	300	0.341813
98	98	58	0.747926	300	0.341813

resistered_log 에 대한 finer search¶

In [45]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

n_estimators = 300
num_epoch = 100
finer_hyperparameters_list2 = []

for epoch in range(num_epoch):
    max_depth = np.random.randint(low=60, high=98)
    max_features = np.random.uniform(low=0.785142, high=0.822352)

    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  n_jobs=-1,
                                  random_state=37)

 
    score = cross_val_score(model, x_train, y_train_r, cv=20, scoring=rmse_score).mean()
   
    hyperparameters = {
        'epoch': epoch,
        'score': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    }

    finer_hyperparameters_list2.append(hyperparameters)
    print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")

finer_hyperparameters_list2 = pd.DataFrame.from_dict(finer_hyperparameters_list2)
finer_hyperparameters_list2 = finer_hyperparameters_list2.sort_values(by="score")
print(finer_hyperparameters_list2.shape)
finer_hyperparameters_list2.head(10)

 0 n_estimators = 300, max_depth = 63, max_features = 0.791589, Score = 0.34129
 1 n_estimators = 300, max_depth = 95, max_features = 0.801994, Score = 0.34129
 2 n_estimators = 300, max_depth = 94, max_features = 0.814558, Score = 0.34129
 3 n_estimators = 300, max_depth = 95, max_features = 0.813673, Score = 0.34129
 4 n_estimators = 300, max_depth = 81, max_features = 0.796884, Score = 0.34129
 5 n_estimators = 300, max_depth = 71, max_features = 0.818093, Score = 0.34129
 6 n_estimators = 300, max_depth = 94, max_features = 0.819967, Score = 0.34129
 7 n_estimators = 300, max_depth = 97, max_features = 0.812466, Score = 0.34129
 8 n_estimators = 300, max_depth = 80, max_features = 0.815922, Score = 0.34129
 9 n_estimators = 300, max_depth = 78, max_features = 0.817271, Score = 0.34129
10 n_estimators = 300, max_depth = 95, max_features = 0.793394, Score = 0.34129
11 n_estimators = 300, max_depth = 85, max_features = 0.818906, Score = 0.34129
12 n_estimators = 300, max_depth = 83, max_features = 0.800944, Score = 0.34129
13 n_estimators = 300, max_depth = 95, max_features = 0.787007, Score = 0.34129
14 n_estimators = 300, max_depth = 60, max_features = 0.808383, Score = 0.34129
15 n_estimators = 300, max_depth = 72, max_features = 0.808345, Score = 0.34129
16 n_estimators = 300, max_depth = 84, max_features = 0.797468, Score = 0.34129
17 n_estimators = 300, max_depth = 90, max_features = 0.811709, Score = 0.34129
18 n_estimators = 300, max_depth = 61, max_features = 0.791274, Score = 0.34129
19 n_estimators = 300, max_depth = 67, max_features = 0.822294, Score = 0.34129
20 n_estimators = 300, max_depth = 69, max_features = 0.785356, Score = 0.34129
21 n_estimators = 300, max_depth = 60, max_features = 0.792668, Score = 0.34129
22 n_estimators = 300, max_depth = 93, max_features = 0.821914, Score = 0.34129
23 n_estimators = 300, max_depth = 84, max_features = 0.786062, Score = 0.34129
24 n_estimators = 300, max_depth = 92, max_features = 0.820913, Score = 0.34129
25 n_estimators = 300, max_depth = 79, max_features = 0.790979, Score = 0.34129
26 n_estimators = 300, max_depth = 60, max_features = 0.811224, Score = 0.34129
27 n_estimators = 300, max_depth = 66, max_features = 0.787275, Score = 0.34129
28 n_estimators = 300, max_depth = 96, max_features = 0.799148, Score = 0.34129
29 n_estimators = 300, max_depth = 89, max_features = 0.786287, Score = 0.34129
30 n_estimators = 300, max_depth = 96, max_features = 0.811415, Score = 0.34129
31 n_estimators = 300, max_depth = 65, max_features = 0.793928, Score = 0.34129
32 n_estimators = 300, max_depth = 88, max_features = 0.795674, Score = 0.34129
33 n_estimators = 300, max_depth = 94, max_features = 0.786748, Score = 0.34129
34 n_estimators = 300, max_depth = 93, max_features = 0.794366, Score = 0.34129
35 n_estimators = 300, max_depth = 71, max_features = 0.795346, Score = 0.34129
36 n_estimators = 300, max_depth = 82, max_features = 0.820458, Score = 0.34129
37 n_estimators = 300, max_depth = 66, max_features = 0.811386, Score = 0.34129
38 n_estimators = 300, max_depth = 92, max_features = 0.804557, Score = 0.34129
39 n_estimators = 300, max_depth = 96, max_features = 0.787059, Score = 0.34129
40 n_estimators = 300, max_depth = 96, max_features = 0.799632, Score = 0.34129
41 n_estimators = 300, max_depth = 80, max_features = 0.804286, Score = 0.34129
42 n_estimators = 300, max_depth = 89, max_features = 0.819000, Score = 0.34129
43 n_estimators = 300, max_depth = 69, max_features = 0.806360, Score = 0.34129
44 n_estimators = 300, max_depth = 87, max_features = 0.798133, Score = 0.34129
45 n_estimators = 300, max_depth = 81, max_features = 0.813769, Score = 0.34129
46 n_estimators = 300, max_depth = 84, max_features = 0.785303, Score = 0.34129
47 n_estimators = 300, max_depth = 72, max_features = 0.810851, Score = 0.34129
48 n_estimators = 300, max_depth = 80, max_features = 0.811310, Score = 0.34129
49 n_estimators = 300, max_depth = 83, max_features = 0.794213, Score = 0.34129
50 n_estimators = 300, max_depth = 62, max_features = 0.795216, Score = 0.34129
51 n_estimators = 300, max_depth = 64, max_features = 0.805456, Score = 0.34129
52 n_estimators = 300, max_depth = 88, max_features = 0.794745, Score = 0.34129
53 n_estimators = 300, max_depth = 69, max_features = 0.787971, Score = 0.34129
54 n_estimators = 300, max_depth = 71, max_features = 0.796900, Score = 0.34129
55 n_estimators = 300, max_depth = 64, max_features = 0.804270, Score = 0.34129
56 n_estimators = 300, max_depth = 89, max_features = 0.789507, Score = 0.34129
57 n_estimators = 300, max_depth = 73, max_features = 0.789292, Score = 0.34129
58 n_estimators = 300, max_depth = 80, max_features = 0.802160, Score = 0.34129
59 n_estimators = 300, max_depth = 90, max_features = 0.810262, Score = 0.34129
60 n_estimators = 300, max_depth = 97, max_features = 0.790554, Score = 0.34129
61 n_estimators = 300, max_depth = 87, max_features = 0.805870, Score = 0.34129
62 n_estimators = 300, max_depth = 85, max_features = 0.805697, Score = 0.34129
63 n_estimators = 300, max_depth = 71, max_features = 0.805643, Score = 0.34129
64 n_estimators = 300, max_depth = 76, max_features = 0.808495, Score = 0.34129
65 n_estimators = 300, max_depth = 72, max_features = 0.804726, Score = 0.34129
66 n_estimators = 300, max_depth = 81, max_features = 0.811274, Score = 0.34129
67 n_estimators = 300, max_depth = 84, max_features = 0.822164, Score = 0.34129
68 n_estimators = 300, max_depth = 75, max_features = 0.792803, Score = 0.34129
69 n_estimators = 300, max_depth = 97, max_features = 0.797391, Score = 0.34129
70 n_estimators = 300, max_depth = 95, max_features = 0.814474, Score = 0.34129
71 n_estimators = 300, max_depth = 96, max_features = 0.791569, Score = 0.34129
72 n_estimators = 300, max_depth = 64, max_features = 0.790990, Score = 0.34129
73 n_estimators = 300, max_depth = 62, max_features = 0.787643, Score = 0.34129
74 n_estimators = 300, max_depth = 88, max_features = 0.803658, Score = 0.34129
75 n_estimators = 300, max_depth = 86, max_features = 0.814407, Score = 0.34129
76 n_estimators = 300, max_depth = 81, max_features = 0.799465, Score = 0.34129
77 n_estimators = 300, max_depth = 61, max_features = 0.817336, Score = 0.34129
78 n_estimators = 300, max_depth = 63, max_features = 0.808093, Score = 0.34129
79 n_estimators = 300, max_depth = 62, max_features = 0.817593, Score = 0.34129
80 n_estimators = 300, max_depth = 63, max_features = 0.816205, Score = 0.34129
81 n_estimators = 300, max_depth = 68, max_features = 0.797734, Score = 0.34129
82 n_estimators = 300, max_depth = 71, max_features = 0.787607, Score = 0.34129
83 n_estimators = 300, max_depth = 65, max_features = 0.795881, Score = 0.34129
84 n_estimators = 300, max_depth = 60, max_features = 0.820788, Score = 0.34129
85 n_estimators = 300, max_depth = 77, max_features = 0.801881, Score = 0.34129
86 n_estimators = 300, max_depth = 72, max_features = 0.806487, Score = 0.34129
87 n_estimators = 300, max_depth = 75, max_features = 0.816394, Score = 0.34129
88 n_estimators = 300, max_depth = 78, max_features = 0.812284, Score = 0.34129
89 n_estimators = 300, max_depth = 68, max_features = 0.797305, Score = 0.34129
90 n_estimators = 300, max_depth = 79, max_features = 0.807080, Score = 0.34129
91 n_estimators = 300, max_depth = 83, max_features = 0.813621, Score = 0.34129
92 n_estimators = 300, max_depth = 68, max_features = 0.796623, Score = 0.34129
93 n_estimators = 300, max_depth = 73, max_features = 0.795066, Score = 0.34129
94 n_estimators = 300, max_depth = 97, max_features = 0.816419, Score = 0.34129
95 n_estimators = 300, max_depth = 91, max_features = 0.819309, Score = 0.34129
96 n_estimators = 300, max_depth = 63, max_features = 0.790070, Score = 0.34129
97 n_estimators = 300, max_depth = 66, max_features = 0.793431, Score = 0.34129
98 n_estimators = 300, max_depth = 94, max_features = 0.787573, Score = 0.34129
99 n_estimators = 300, max_depth = 68, max_features = 0.820232, Score = 0.34129
(100, 5)

Out[45]:

	epoch	max_depth	max_features	n_estimators	score
0	0	63	0.791589	300	0.341292
72	72	64	0.790990	300	0.341292
71	71	96	0.791569	300	0.341292
70	70	95	0.814474	300	0.341292
69	69	97	0.797391	300	0.341292
68	68	75	0.792803	300	0.341292
67	67	84	0.822164	300	0.341292
66	66	81	0.811274	300	0.341292
65	65	72	0.804726	300	0.341292
64	64	76	0.808495	300	0.341292

가장 좋은 파라메터를 통해 회원 대여량 예측을 위한 모델 생성¶

In [46]:

best_hyperparameters = finer_hyperparameters_list2.iloc[0]
best_max_depth = best_hyperparameters["max_depth"]
best_max_features = best_hyperparameters["max_features"]
print(f"max_depth(best) = {best_max_depth}, max_features(best) = {best_max_features:.6f}")

max_depth(best) = 63.0, max_features(best) = 0.791589

In [47]:

best_n_estimators = 3000
model = RandomForestRegressor(n_estimators=best_n_estimators,
                              max_depth=best_max_depth,
                              max_features=best_max_features,
                              random_state=37,
                              n_jobs=-1)
model

Out[47]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=63.0,
           max_features=0.7915885049682436, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
           oob_score=False, random_state=37, verbose=0, warm_start=False)

In [48]:

model.fit(x_train, y_train_r)

Out[48]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=63.0,
           max_features=0.7915885049682436, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
           oob_score=False, random_state=37, verbose=0, warm_start=False)

회원 대여량 예측¶

In [49]:

logR_predictions = model.predict(x_test)
print(logR_predictions.shape)
logR_predictions

(6493,)

Out[49]:

array([2.36580644, 1.70954977, 1.04493393, ..., 4.60040191, 4.53352732,
       3.8785016 ])

In [50]:

predictions_r = np.exp(logR_predictions) - 1
print(predictions_r.shape)
predictions_r

(6493,)

Out[50]:

array([ 9.65262609,  4.52647275,  1.84321065, ..., 98.52430758,
       92.08632794, 47.35171066])

Submit¶

비회원 대여량(casual) + 회원 대여량(registered) = 총 대여량(count)¶

In [51]:

predictions = predictions_c + predictions_r
predictions

Out[51]:

array([ 10.95156578,   5.04628501,   2.47217666, ..., 101.62362588,
        95.11822355,  49.38078278])

In [52]:

submission = pd.read_csv("submit.csv")
submission.head()

Out[52]:

	datetime	count
0	2011-01-20 00:00:00	0
1	2011-01-20 01:00:00	0
2	2011-01-20 02:00:00	0
3	2011-01-20 03:00:00	0
4	2011-01-20 04:00:00	0

In [53]:

submission["count"] = predictions
submission.head()

Out[53]:

	datetime	count
0	2011-01-20 00:00:00	10.951566
1	2011-01-20 01:00:00	5.046285
2	2011-01-20 02:00:00	2.472177
3	2011-01-20 03:00:00	2.313137
4	2011-01-20 04:00:00	2.121748

In [54]:

submission.to_csv("bike-038205.csv", index=False)

In [37]:

from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di 

di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">코드 보기/숨기기</button>''', raw=True)