문제 링크: https://www.kaggle.com/c/bike-sharing-demand
Bike Sharing Demand는 주어진 자료들을 바탕으로 몇명이나 자전거릴 빌릴지 예측하는 문제입니다.
공부를 하시다가 궁금한 점 또는 수정사항이 있으시면 블로그에 댓글 남겨주세요.
저도 많이 부족하지만 아는 만큼 최선을 다해서 답해드리겠습니다. 감사합니다!! 열공하세요~!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.style.use('bmh')
%matplotlib inline
train = pd.read_csv("train.csv", parse_dates=["datetime"])
print(train.shape)
train.head(3)
(10886, 12)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
test = pd.read_csv("test.csv", parse_dates=["datetime"])
print(test.shape)
test.head(3)
(6493, 9)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
print("train 데이터의 정보")
print("-----------------------------------------------------------------------------------")
train.info()
train 데이터의 정보 ----------------------------------------------------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): datetime 10886 non-null datetime64[ns] season 10886 non-null int64 holiday 10886 non-null int64 workingday 10886 non-null int64 weather 10886 non-null int64 temp 10886 non-null float64 atemp 10886 non-null float64 humidity 10886 non-null int64 windspeed 10886 non-null float64 casual 10886 non-null int64 registered 10886 non-null int64 count 10886 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(8) memory usage: 1020.6 KB
print("test 데이터의 정보")
print("-----------------------------------------------------------------------------------")
test.info()
test 데이터의 정보 ----------------------------------------------------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 6493 entries, 0 to 6492 Data columns (total 9 columns): datetime 6493 non-null datetime64[ns] season 6493 non-null int64 holiday 6493 non-null int64 workingday 6493 non-null int64 weather 6493 non-null int64 temp 6493 non-null float64 atemp 6493 non-null float64 humidity 6493 non-null int64 windspeed 6493 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5) memory usage: 456.6 KB
3개의 컬럼은 모두 테스트 데이터엔 없는 컬럼입니다. count는 우리가 맞춰야할 컬럼이지만,
단순히 비회원과 회원이 빌린 자전거의 대수를 합한 값입니다.
따라서 이들의 특성을 정확히 반영하려면 casual과 registered에 대한 예측을 각각 한다음 더하는 방법이 좋을 것 같습니다.
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
sns.distplot(train["casual"], ax=ax[0])
ax[0].set_title("casual distribution")
sns.distplot(train["registered"], ax=ax[1])
ax[1].set_title("registered distribution")
sns.distplot(train["count"], ax=ax[2])
ax[2].set_title("count distribution")
plt.show()
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
train["casual_log"] = np.log(train["casual"] + 1)
sns.distplot(train["casual_log"], ax=ax[0])
ax[0].set_title("casual_log distribution")
train["registered_log"] = np.log(train["registered"] + 1)
sns.distplot(train["registered_log"], ax=ax[1])
ax[1].set_title("registered_log distribution")
train["count_log"] = np.log(train["count"] + 1)
sns.distplot(train["count_log"], ax=ax[2])
ax[2].set_title("count_log distribution")
plt.show()
print("casaul이 0인 데이터 개수")
print(train[train["casual"] == 0].shape)
casaul이 0인 데이터 개수 (986, 15)
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 시각화 자료를 통해 casual, registered, count의 분포를 살펴보았습니다.
모든 자료에서 대여한 자전거의 대수가 0대일때가 많고, 100대 이하인 비중이 높습니다.
그러나 1000대에 가까운 자전거를 빌린 경우도 있어 분포가 많이 왜곡되어 보입니다.
아래쪽 시각화 자료를 통해 log를 씌운 casual, registered, count의 분포를 살펴보았습니다.
√1nn∑i=1(log(pi+1)−log(ai+1))2
이를 통해 label로 casual과 registered를 사용할때, log를 씌운 것이 좀 더 도움이 될것이라고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
train["datetime-year"] = train["datetime"].dt.year
train["datetime-month"] = train["datetime"].dt.month
train["datetime-day"] = train["datetime"].dt.day
train["datetime-hour"] = train["datetime"].dt.hour
train["datetime-minute"] = train["datetime"].dt.minute
train["datetime-second"] = train["datetime"].dt.second
train["datetime-dayofweek"] = train["datetime"].dt.dayofweek
train[["datetime", "datetime-year", "datetime-month", "datetime-day", "datetime-hour", "datetime-minute", "datetime-second", "datetime-dayofweek"]].head(3)
datetime | datetime-year | datetime-month | datetime-day | datetime-hour | datetime-minute | datetime-second | datetime-dayofweek | |
---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 2011 | 1 | 1 | 0 | 0 | 0 | 5 |
1 | 2011-01-01 01:00:00 | 2011 | 1 | 1 | 1 | 0 | 0 | 5 |
2 | 2011-01-01 02:00:00 | 2011 | 1 | 1 | 2 | 0 | 0 | 5 |
test["datetime-year"] = test["datetime"].dt.year
test["datetime-month"] = test["datetime"].dt.month
test["datetime-day"] = test["datetime"].dt.day
test["datetime-hour"] = test["datetime"].dt.hour
test["datetime-minute"] = test["datetime"].dt.minute
test["datetime-second"] = test["datetime"].dt.second
test["datetime-dayofweek"] = test["datetime"].dt.dayofweek
test[["datetime", "datetime-year", "datetime-month", "datetime-day", "datetime-hour", "datetime-minute", "datetime-second", "datetime-dayofweek"]].head(3)
datetime | datetime-year | datetime-month | datetime-day | datetime-hour | datetime-minute | datetime-second | datetime-dayofweek | |
---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 2011 | 1 | 20 | 0 | 0 | 0 | 3 |
1 | 2011-01-20 01:00:00 | 2011 | 1 | 20 | 1 | 0 | 0 | 3 |
2 | 2011-01-20 02:00:00 | 2011 | 1 | 20 | 2 | 0 | 0 | 3 |
train.loc[train["datetime-dayofweek"] == 0, "datetime-dayofweek(str)"] = "Mon"
train.loc[train["datetime-dayofweek"] == 1, "datetime-dayofweek(str)"] = "Tue"
train.loc[train["datetime-dayofweek"] == 2, "datetime-dayofweek(str)"] = "Wed"
train.loc[train["datetime-dayofweek"] == 3, "datetime-dayofweek(str)"] = "Thu"
train.loc[train["datetime-dayofweek"] == 4, "datetime-dayofweek(str)"] = "Fri"
train.loc[train["datetime-dayofweek"] == 5, "datetime-dayofweek(str)"] = "Sat"
train.loc[train["datetime-dayofweek"] == 6, "datetime-dayofweek(str)"] = "Sun"
train[["datetime", "datetime-dayofweek", "datetime-dayofweek(str)"]].head(3)
datetime | datetime-dayofweek | datetime-dayofweek(str) | |
---|---|---|---|
0 | 2011-01-01 00:00:00 | 5 | Sat |
1 | 2011-01-01 01:00:00 | 5 | Sat |
2 | 2011-01-01 02:00:00 | 5 | Sat |
test.loc[test["datetime-dayofweek"] == 0, "datetime-dayofweek(str)"] = "Mon"
test.loc[test["datetime-dayofweek"] == 1, "datetime-dayofweek(str)"] = "Tue"
test.loc[test["datetime-dayofweek"] == 2, "datetime-dayofweek(str)"] = "Wed"
test.loc[test["datetime-dayofweek"] == 3, "datetime-dayofweek(str)"] = "Thu"
test.loc[test["datetime-dayofweek"] == 4, "datetime-dayofweek(str)"] = "Fri"
test.loc[test["datetime-dayofweek"] == 5, "datetime-dayofweek(str)"] = "Sat"
test.loc[test["datetime-dayofweek"] == 6, "datetime-dayofweek(str)"] = "Sun"
test[["datetime", "datetime-dayofweek", "datetime-dayofweek(str)"]].head(3)
datetime | datetime-dayofweek | datetime-dayofweek(str) | |
---|---|---|---|
0 | 2011-01-20 00:00:00 | 3 | Thu |
1 | 2011-01-20 01:00:00 | 3 | Thu |
2 | 2011-01-20 02:00:00 | 3 | Thu |
f, ax = plt.subplots(nrows=2, ncols=3, figsize=(18,10))
sns.barplot(data=train, x="datetime-year", y="count", ax=ax[0][0])
ax[0][0].set_title("Annual rental count")
sns.barplot(data=train, x="datetime-month", y="count", ax=ax[0][1])
ax[0][1].set_title("Monthly rent count")
sns.barplot(data=train, x="datetime-day", y="count", ax=ax[0][2])
ax[0][2].set_title("Daily rent count")
sns.barplot(data=train, x="datetime-hour", y="count", ax=ax[1][0])
ax[1][0].set_title("Rent count per hour")
sns.barplot(data=train, x="datetime-minute", y="count", ax=ax[1][1])
ax[1][1].set_title("Rent count per minute")
sns.barplot(data=train, x="datetime-second", y="count", ax=ax[1][2])
ax[1][2].set_title("Rent count per second")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
시각화 자료를 통해 연,월,일,시,분,초 별 count를 살펴보았습니다.
필요한 자료들을 더 자세히 살펴보는 것은 밑에서 이어서 할 예정입니다. 여기선 거를 수 있는 데이터가 무엇인지 파악해보겠습니다.
이를 통해 일, 분, 초의 정보는 feature에서 빼는것이 좋다는 것을 알 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
sns.barplot(data=train, x="datetime-year", y="casual", ax=ax[0])
ax[0].set_title("Annual rental count: Non-member")
sns.barplot(data=train, x="datetime-year", y="registered", ax=ax[1])
ax[1].set_title("Annual rental count: Members")
sns.barplot(data=train, x="datetime-year", y="count", ax=ax[2])
ax[2].set_title("Annual rental count")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
시각화 자료를 통해 year에 대한 casual, registered, count를 살펴보았습니다.
비회원, 회원 데이터 모두 2011년에 비해서 2012년에 자전거 대여 횟수가 증가하였습니다.
이를 통해 자전거 대여 업체가 2011년보다 2012년에 성장했다고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
sns.barplot(data=train, x="datetime-month", y="casual", ax=ax[0])
ax[0].set_title("Monthly rent count: Non-member")
sns.barplot(data=train, x="datetime-month", y="registered", ax=ax[1])
ax[1].set_title("Monthly rent count: Member")
sns.barplot(data=train, x="datetime-month", y="count", ax=ax[2])
ax[2].set_title("Monthly rent count")
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5))
sns.barplot(data=train, x="datetime-month", y="temp", ax=ax[0])
ax[0].set_title("Monthly temperature")
sns.barplot(data=train, x="datetime-month", y="atemp", ax=ax[1])
ax[1].set_title("Monthly sensible temparature")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 3개의 시각화 자료를 통해 month에 대한 casual, registered, count를 살펴보았습니다.
6월~9월에 자전거를 대여하는 사람이 많고, 1~2월에 빌리는 사람이 적습니다. 이는 온도와 관련이 있을 것으로 추측할 수 있습니다.
아래 2개의 시각화 자료를 통해 월별 온도와 체감온도를 살펴보았습니다.
위에서 나온 추측을 바탕으로 바로 온도와 체감온도를 월별로 살펴본 결과 위의 그래프와 대부분 비슷하지만 특이한점이 하나 있습니다.
1월과 12월의 온도차이가 크지않은데 12월의 대여량이 많다는 점입니다.
월별 대여량은 온도와 관련이 있다고 볼 수 있습니다. 그러나 왜 12월의 대여량이 많은지는 좀 더 살펴보아야 할 것 같습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
train["year-month"] = train["datetime-year"].astype('str') + "-" + train["datetime-month"].astype('str')
train[["datetime", "year-month"]].head(3)
datetime | year-month | |
---|---|---|
0 | 2011-01-01 00:00:00 | 2011-1 |
1 | 2011-01-01 01:00:00 | 2011-1 |
2 | 2011-01-01 02:00:00 | 2011-1 |
plt.figure(figsize=(18, 5))
sns.barplot(data=train, x="year-month", y="count").set_title("Year-Month rent count")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
year 컬럼과 month 컬럼을 합쳐 연월별 자전거 대여량을 살펴보았습니다.
앞서 month 컬럼의 시각화 자료를 보면 1월과 12월은 겨울로 낮은 온도 또한 비슷하지만 대여량의 차이가 많이나는 것이 이상하다고 여겨졌습니다.
2011년 12월과 2012년 1월은 자전거 대여량의 차이가 크지 않습니다. 2011년 1월과 2012년 12월의 대여량 차이가 큰 이유는
year컬럼 시각화 자료에서 살펴보았듯 자전거 대여 업체가 성장하면서 이용자가 늘었다고 추측할 수 있습니다.
12월에 1월보다 많이 빌리는 것이 아닌 자전거 대여 업체의 성장에 따른 것으로 파악할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=5, ncols=1, figsize=(18, 25))
plt.subplots_adjust(hspace = 0.3)
sns.pointplot(data=train, x="datetime-hour", y="casual", ax=ax[0])
ax[0].set_title("Rent count per hour: Non-member")
sns.pointplot(data=train, x="datetime-hour", y="registered", ax=ax[1])
ax[1].set_title("Rent count per hour: Member")
sns.pointplot(data=train, x="datetime-hour", y="count", ax=ax[2])
ax[2].set_title("Rent count per hour")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="workingday", ax=ax[3])
ax[3].set_title("Rent count per hour by workingday")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="datetime-dayofweek(str)", ax=ax[4])
ax[4].set_title("Rent count per hour by dayofweek")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 3개 시각화 자료를 통해 시간별 casual, registered, count를 살펴보았습니다.
회원의 대여량인 registered는 count와 비슷한 형태입니다. 반면 비회원의 대여량인 casual은 완전히 다른 형태입니다.
casaul과 register가 합쳐져서 count같은 형태를 띈다는 것은 비회원의 대여량 수 보다 회원의 대여량 수가 훨씬 많다고 볼 수 있습니다.
4번째 시각화 자료를 통해 시간별 count를 근무일 유무로 나누어 살펴보았습니다.
근무를 하는날엔 회원의 대여량과 비슷한 형태이고, 근무를 하지 않는 날엔 비회원의 대여량과 비슷한 형태입니다.
근무일엔 회원이, 비 근무일엔 비회원이 자전거를 많이 대여한다고 볼 수 있습니다.
마지막 시각화 자료를 통해 시간별 count를 요일별로 나누어 살펴보았습니다.
보통 근무를 하는날인 월, 화, 수, 목, 금은 회원의 대여량, 근무일의 대여량과 비슷하고,
휴일인 토, 일은 비회원의 대여량, 비 근무일의 대여량과 비슷합니다.
시간별 대여량은 회원, 비회원간의 차이가 존재합니다. 이는 근무일, 요일과 밀접한 관계가 있습니다.
회원은 근무일과 월~금, 비회원은 비 근무일과 토, 일에 자전거를 많이 빌린다고 추측할 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
sns.barplot(data=train, x="season", y="count", ax=ax[0])
ax[0].set_title("Seasonal rent count")
sns.barplot(data=train, x="season", y="temp", ax=ax[1])
ax[1].set_title("Seasonal temperature")
sns.barplot(data=train, x="season", y="atemp", ax=ax[2])
ax[2].set_title("Seasonal sensible temperature")
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 5))
sns.pointplot(data=train, x="datetime-month", y="count", hue="season", ax=ax)
ax.set_title("Monthly rent count: season")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
위쪽 2개의 시각화 자료를 통해 계절별 자전거 대여량과 계절별 온도, 계절별 체감온도를 살펴보았습니다.
계절별 자전거 대여량은 가을, 여름, 겨울, 봄 순으로 많습니다. 보통 생각으로는 봄이 겨울보다 많을 것 같지만,
계절별 온도를 살펴보면 봄의 온도가 겨울의 온도보다 낮게 나타났습니다.
계절별 대여량의 시각화 자료는 계절별 온도, 계절별 체감온도의 시각화 자료와 아주 유사합니다.
계절에 따른 대여량의 변화는 온도와 체감온도와 아주 밀접한 관계가 있다는 것을 알 수 있습니다.
아래쪽 시각화 자료를 통해 월별 자전거 대여량을 계절로 나누어 살표보았습니다.
1~3월을 봄, 4~6월을 여름, 7~9월을 가을, 10~12월을 겨울로 나눈 것을 확인할 수 있습니다.
봄이 겨울보다 춥고, 가을이 여름보다 온도가 높다는 점은 조금 의아하긴 하지만 계절별로 확실히 대여량의 차이가 나기때문에
학습할 때 의미있는 컬럼이 될 수 있을 것입니다.
계절, 온도, 체감온도는 아주 밀접한 관계에 있으며 모두 대여량에 영향을 주는 요인입니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(18, 10))
sns.pointplot(data=train, x="datetime-hour", y="count", hue="holiday", ax=ax[0])
ax[0].set_title("Rent count per hour by holiday")
sns.pointplot(data=train, x="datetime-hour", y="count", hue="workingday", ax=ax[1])
ax[1].set_title("Rent count per hour by workingday")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
공휴일과 근무일에 따른 시간별 자전거 대여량을 살펴보았습니다.
공휴일과 근무일은 정확히 반대 개념은 아닙니다. 하지만 시각화 자료를 보면
공휴일인 경우 비 근무일과 비슷한 형태를 보이고, 공휴일이 아닌 경우 근무일과 비슷한 형태를 보여줍니다.
따라서 공휴일 또한 자전거 대여량에 영향을 미치는 요소라고 볼 수 있습니다.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5))
sns.distplot(train["humidity"], ax=ax[0])
ax[0].set_title("humidity distribution")
sns.distplot(train["windspeed"], ax=ax[1])
ax[1].set_title("windspeed distribution")
plt.show()
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
좌측 시각화 자료를 통해 습도의 분포를 살펴보았습니다.
습도는 비교적 눈에 띄는 데이터가 적고 고르게 퍼져 정규분포의 형태를 띄고 있습니다.
우측 시각화 자료를 통해 풍속의 분포를 살펴보았습니다.
풍속은 바람이 없는날인 0을 제외하면 고른 분포를 보여줍니다.
두 컬럼은 눈에띄는 이상치가 있는지를 파악하고 컬럼에 넣고, 빼면서 좋은 성능을 내는지를 확인해보는 것이 중요할 것입니다.
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5))
sns.countplot(data=train, x="weather", ax=ax[0])
ax[0].set_title("Amount of data by weather")
sns.barplot(data=train, x="weather", y="count", ax=ax[1])
ax[1].set_title("Rent count by weather")
plt.show()
print("train 데이터에 존재하는 날씨가 1인 데이터 개수:", train[train["weather"] == 1].shape[0])
print("train 데이터에 존재하는 날씨가 2인 데이터 개수:", train[train["weather"] == 2].shape[0])
print("train 데이터에 존재하는 날씨가 3인 데이터 개수:", train[train["weather"] == 3].shape[0])
print("train 데이터에 존재하는 날씨가 4인 데이터 개수:", train[train["weather"] == 4].shape[0])
train 데이터에 존재하는 날씨가 1인 데이터 개수: 7192 train 데이터에 존재하는 날씨가 2인 데이터 개수: 2834 train 데이터에 존재하는 날씨가 3인 데이터 개수: 859 train 데이터에 존재하는 날씨가 4인 데이터 개수: 1
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
좌측 시각화 자료를 날씨에 따른 데이터의 양을 살펴보았습니다.
1에 가까울수록 날씨가 좋고, 4에 가까울수록 날씨가 좋지 않은 것인데 1의 수가 압도적으로 많고, 나머지 데이터는 많이 적습니다.
그러나 2, 3의 데이터는 전체 데이터의 크기를 고려했을때 아주적은 편은 아닙니다. 문제는 1개로 나타난 날씨 4입니다.
하지만 만개가 넘는 데이터 중 딱 1개의 데이터만 존재하기 때문에 큰 영향을 미칠 것 같진 않습니다.
우측 시각화 자료를 통해 날씨별 대여량을 살펴보았습니다.
날씨가 좋을수록 많이 빌리는 경향이 있습니다. 하지만 날씨가 매우 좋지 않은 날 또한 대여량이 많은데 이는 좌측에서 살펴본 것 처럼
딱 1개의 데이터만 가지고 평균을 냈기 때문입니다.
아주 눈에 띄는 이상치인 날씨가 4인 데이터가 존재하지만 그 수는 1개로 전체 데이터 중 아주 적으므로 무시해도 될 것입니다.
train_season = pd.get_dummies(train['season'], prefix='season')
train = pd.concat([train, train_season],axis=1)
train.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | ... | datetime-hour | datetime-minute | datetime-second | datetime-dayofweek | datetime-dayofweek(str) | year-month | season_1 | season_2 | season_3 | season_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | ... | 0 | 0 | 0 | 5 | Sat | 2011-1 | 1 | 0 | 0 | 0 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | ... | 1 | 0 | 0 | 5 | Sat | 2011-1 | 1 | 0 | 0 | 0 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | ... | 2 | 0 | 0 | 5 | Sat | 2011-1 | 1 | 0 | 0 | 0 |
3 rows × 28 columns
test_season = pd.get_dummies(test['season'], prefix='season')
test = pd.concat([test, test_season],axis=1)
test.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | datetime-year | ... | datetime-day | datetime-hour | datetime-minute | datetime-second | datetime-dayofweek | datetime-dayofweek(str) | season_1 | season_2 | season_3 | season_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 2011 | ... | 20 | 0 | 0 | 0 | 3 | Thu | 1 | 0 | 0 | 0 |
1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 20 | 1 | 0 | 0 | 3 | Thu | 1 | 0 | 0 | 0 |
2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 20 | 2 | 0 | 0 | 3 | Thu | 1 | 0 | 0 | 0 |
3 rows × 21 columns
train_weather = pd.get_dummies(train['weather'], prefix='weather')
train = pd.concat([train, train_weather],axis=1)
train.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | ... | datetime-dayofweek(str) | year-month | season_1 | season_2 | season_3 | season_4 | weather_1 | weather_2 | weather_3 | weather_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | ... | Sat | 2011-1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | ... | Sat | 2011-1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | ... | Sat | 2011-1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 32 columns
test_weather = pd.get_dummies(test['weather'], prefix='weather')
test = pd.concat([test, test_weather],axis=1)
test.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | datetime-year | ... | datetime-dayofweek | datetime-dayofweek(str) | season_1 | season_2 | season_3 | season_4 | weather_1 | weather_2 | weather_3 | weather_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 2011 | ... | 3 | Thu | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 3 | Thu | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 3 | Thu | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 25 columns
train_dayofweek = pd.get_dummies(train["datetime-dayofweek(str)"], prefix = "dayofweek")
train = pd.concat([train, train_dayofweek], axis = 1)
train.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | ... | weather_2 | weather_3 | weather_4 | dayofweek_Fri | dayofweek_Mon | dayofweek_Sat | dayofweek_Sun | dayofweek_Thu | dayofweek_Tue | dayofweek_Wed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 39 columns
test_dayofweek = pd.get_dummies(test["datetime-dayofweek(str)"], prefix = "dayofweek")
test = pd.concat([test, test_dayofweek], axis = 1)
test.head(3)
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | datetime-year | ... | weather_2 | weather_3 | weather_4 | dayofweek_Fri | dayofweek_Mon | dayofweek_Sat | dayofweek_Sun | dayofweek_Thu | dayofweek_Tue | dayofweek_Wed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 2011 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 32 columns
feature_list = ["holiday", "workingday", "temp", "atemp", "humidity", "windspeed", "datetime-year", "datetime-hour"]
feature_list = feature_list + list(train_dayofweek.columns)
feature_list = feature_list + list(train_weather.columns)
feature_list = feature_list + list(train_season.columns)
feature_list
['holiday', 'workingday', 'temp', 'atemp', 'humidity', 'windspeed', 'datetime-year', 'datetime-hour', 'dayofweek_Fri', 'dayofweek_Mon', 'dayofweek_Sat', 'dayofweek_Sun', 'dayofweek_Thu', 'dayofweek_Tue', 'dayofweek_Wed', 'weather_1', 'weather_2', 'weather_3', 'weather_4', 'season_1', 'season_2', 'season_3', 'season_4']
x_train = train[feature_list]
x_train.head(3)
holiday | workingday | temp | atemp | humidity | windspeed | datetime-year | datetime-hour | dayofweek_Fri | dayofweek_Mon | ... | dayofweek_Tue | dayofweek_Wed | weather_1 | weather_2 | weather_3 | weather_4 | season_1 | season_2 | season_3 | season_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 9.84 | 14.395 | 81 | 0.0 | 2011 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 9.02 | 13.635 | 80 | 0.0 | 2011 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 9.02 | 13.635 | 80 | 0.0 | 2011 | 2 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 23 columns
x_test = test[feature_list]
x_test.head(3)
holiday | workingday | temp | atemp | humidity | windspeed | datetime-year | datetime-hour | dayofweek_Fri | dayofweek_Mon | ... | dayofweek_Tue | dayofweek_Wed | weather_1 | weather_2 | weather_3 | weather_4 | season_1 | season_2 | season_3 | season_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 2011 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 2011 | 2 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 23 columns
y_train_c = train["casual_log"]
y_train_c.head(3)
0 1.386294 1 2.197225 2 1.791759 Name: casual_log, dtype: float64
y_train_r = train["registered_log"]
y_train_r.head(3)
0 2.639057 1 3.496508 2 3.332205 Name: registered_log, dtype: float64
from sklearn.metrics import make_scorer
def rmse(predict, actual):
predict = np.array(predict)
actual = np.array(actual)
distance = predict - actual
square_distance = distance ** 2
mean_square_distance = square_distance.mean()
score = np.sqrt(mean_square_distance)
return score
rmse_score = make_scorer(rmse)
rmse_score
make_scorer(rmse)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
n_estimators = 300
num_epoch = 100
coarse_hyperparameters_list = []
for epoch in range(num_epoch):
max_depth = np.random.randint(low=2, high=100)
max_features = np.random.uniform(low=0.1, high=1.0)
model = RandomForestRegressor(n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
n_jobs=-1,
random_state=37)
score = cross_val_score(model, x_train, y_train_c, cv=20, scoring=rmse_score).mean()
hyperparameters = {
'epoch': epoch,
'score': score,
'n_estimators': n_estimators,
'max_depth': max_depth,
'max_features': max_features,
}
coarse_hyperparameters_list.append(hyperparameters)
print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")
coarse_hyperparameters_list = pd.DataFrame.from_dict(coarse_hyperparameters_list)
coarse_hyperparameters_list = coarse_hyperparameters_list.sort_values(by="score")
print(coarse_hyperparameters_list.shape)
coarse_hyperparameters_list.head(10)
0 n_estimators = 300, max_depth = 16, max_features = 0.815879, Score = 0.55552 1 n_estimators = 300, max_depth = 33, max_features = 0.363584, Score = 0.55461 2 n_estimators = 300, max_depth = 94, max_features = 0.935101, Score = 0.55978 3 n_estimators = 300, max_depth = 6, max_features = 0.586503, Score = 0.63070 4 n_estimators = 300, max_depth = 30, max_features = 0.308049, Score = 0.56076 5 n_estimators = 300, max_depth = 42, max_features = 0.645142, Score = 0.55193 6 n_estimators = 300, max_depth = 61, max_features = 0.201181, Score = 0.60189 7 n_estimators = 300, max_depth = 25, max_features = 0.123580, Score = 0.67106 8 n_estimators = 300, max_depth = 47, max_features = 0.179029, Score = 0.60189 9 n_estimators = 300, max_depth = 86, max_features = 0.368857, Score = 0.55455 10 n_estimators = 300, max_depth = 11, max_features = 0.404364, Score = 0.56017 11 n_estimators = 300, max_depth = 11, max_features = 0.119115, Score = 0.71720 12 n_estimators = 300, max_depth = 66, max_features = 0.615212, Score = 0.55193 13 n_estimators = 300, max_depth = 33, max_features = 0.482311, Score = 0.55152 14 n_estimators = 300, max_depth = 65, max_features = 0.166754, Score = 0.63065 15 n_estimators = 300, max_depth = 74, max_features = 0.525012, Score = 0.55038 16 n_estimators = 300, max_depth = 76, max_features = 0.557047, Score = 0.55038 17 n_estimators = 300, max_depth = 95, max_features = 0.911051, Score = 0.55790 18 n_estimators = 300, max_depth = 26, max_features = 0.129033, Score = 0.66777 19 n_estimators = 300, max_depth = 79, max_features = 0.827648, Score = 0.55753 20 n_estimators = 300, max_depth = 71, max_features = 0.232804, Score = 0.58103 21 n_estimators = 300, max_depth = 36, max_features = 0.792791, Score = 0.55611 22 n_estimators = 300, max_depth = 93, max_features = 0.527166, Score = 0.55038 23 n_estimators = 300, max_depth = 86, max_features = 0.435793, Score = 0.55152 24 n_estimators = 300, max_depth = 62, max_features = 0.492440, Score = 0.55161 25 n_estimators = 300, max_depth = 41, max_features = 0.471920, Score = 0.55152 26 n_estimators = 300, max_depth = 79, max_features = 0.881475, Score = 0.55790 27 n_estimators = 300, max_depth = 97, max_features = 0.860086, Score = 0.55753 28 n_estimators = 300, max_depth = 49, max_features = 0.338910, Score = 0.56110 29 n_estimators = 300, max_depth = 69, max_features = 0.205281, Score = 0.60189 30 n_estimators = 300, max_depth = 32, max_features = 0.387262, Score = 0.55454 31 n_estimators = 300, max_depth = 78, max_features = 0.504067, Score = 0.55161 32 n_estimators = 300, max_depth = 55, max_features = 0.977990, Score = 0.56103 33 n_estimators = 300, max_depth = 89, max_features = 0.812943, Score = 0.55612 34 n_estimators = 300, max_depth = 25, max_features = 0.566416, Score = 0.55100 35 n_estimators = 300, max_depth = 74, max_features = 0.739495, Score = 0.55459 36 n_estimators = 300, max_depth = 76, max_features = 0.273045, Score = 0.56757 37 n_estimators = 300, max_depth = 9, max_features = 0.183777, Score = 0.66358 38 n_estimators = 300, max_depth = 68, max_features = 0.775070, Score = 0.55459 39 n_estimators = 300, max_depth = 77, max_features = 0.129648, Score = 0.66924 40 n_estimators = 300, max_depth = 53, max_features = 0.466049, Score = 0.55152 41 n_estimators = 300, max_depth = 5, max_features = 0.239799, Score = 0.77236 42 n_estimators = 300, max_depth = 80, max_features = 0.871558, Score = 0.55790 43 n_estimators = 300, max_depth = 65, max_features = 0.729474, Score = 0.55366 44 n_estimators = 300, max_depth = 77, max_features = 0.139745, Score = 0.63065 45 n_estimators = 300, max_depth = 59, max_features = 0.803047, Score = 0.55612 46 n_estimators = 300, max_depth = 54, max_features = 0.977734, Score = 0.56103 47 n_estimators = 300, max_depth = 93, max_features = 0.785698, Score = 0.55612 48 n_estimators = 300, max_depth = 33, max_features = 0.394663, Score = 0.55291 49 n_estimators = 300, max_depth = 68, max_features = 0.565145, Score = 0.55038 50 n_estimators = 300, max_depth = 46, max_features = 0.101771, Score = 0.66924 51 n_estimators = 300, max_depth = 49, max_features = 0.878606, Score = 0.55790 52 n_estimators = 300, max_depth = 9, max_features = 0.716723, Score = 0.56207 53 n_estimators = 300, max_depth = 81, max_features = 0.529706, Score = 0.55038 54 n_estimators = 300, max_depth = 85, max_features = 0.936787, Score = 0.55978 55 n_estimators = 300, max_depth = 91, max_features = 0.832848, Score = 0.55753 56 n_estimators = 300, max_depth = 16, max_features = 0.318296, Score = 0.56095 57 n_estimators = 300, max_depth = 31, max_features = 0.499658, Score = 0.55157 58 n_estimators = 300, max_depth = 18, max_features = 0.368265, Score = 0.55531 59 n_estimators = 300, max_depth = 45, max_features = 0.265388, Score = 0.56757 60 n_estimators = 300, max_depth = 59, max_features = 0.445939, Score = 0.55152 61 n_estimators = 300, max_depth = 10, max_features = 0.714810, Score = 0.55537 62 n_estimators = 300, max_depth = 95, max_features = 0.445322, Score = 0.55152 63 n_estimators = 300, max_depth = 44, max_features = 0.454704, Score = 0.55152 64 n_estimators = 300, max_depth = 11, max_features = 0.172253, Score = 0.66715 65 n_estimators = 300, max_depth = 24, max_features = 0.913853, Score = 0.55938 66 n_estimators = 300, max_depth = 13, max_features = 0.365549, Score = 0.55621 67 n_estimators = 300, max_depth = 73, max_features = 0.717295, Score = 0.55366 68 n_estimators = 300, max_depth = 32, max_features = 0.631876, Score = 0.55192 69 n_estimators = 300, max_depth = 23, max_features = 0.947490, Score = 0.55998 70 n_estimators = 300, max_depth = 18, max_features = 0.247156, Score = 0.58166 71 n_estimators = 300, max_depth = 68, max_features = 0.449915, Score = 0.55152 72 n_estimators = 300, max_depth = 85, max_features = 0.564755, Score = 0.55038 73 n_estimators = 300, max_depth = 52, max_features = 0.487883, Score = 0.55161 74 n_estimators = 300, max_depth = 38, max_features = 0.968391, Score = 0.56103 75 n_estimators = 300, max_depth = 68, max_features = 0.822809, Score = 0.55612 76 n_estimators = 300, max_depth = 93, max_features = 0.567974, Score = 0.55138 77 n_estimators = 300, max_depth = 42, max_features = 0.835279, Score = 0.55753 78 n_estimators = 300, max_depth = 46, max_features = 0.740971, Score = 0.55459 79 n_estimators = 300, max_depth = 56, max_features = 0.550496, Score = 0.55038 80 n_estimators = 300, max_depth = 43, max_features = 0.721874, Score = 0.55366 81 n_estimators = 300, max_depth = 16, max_features = 0.565649, Score = 0.55093 82 n_estimators = 300, max_depth = 53, max_features = 0.615084, Score = 0.55193 83 n_estimators = 300, max_depth = 26, max_features = 0.617739, Score = 0.55209 84 n_estimators = 300, max_depth = 35, max_features = 0.376806, Score = 0.55456 85 n_estimators = 300, max_depth = 97, max_features = 0.455316, Score = 0.55152 86 n_estimators = 300, max_depth = 29, max_features = 0.145428, Score = 0.63128 87 n_estimators = 300, max_depth = 29, max_features = 0.945950, Score = 0.55991 88 n_estimators = 300, max_depth = 97, max_features = 0.195061, Score = 0.60189 89 n_estimators = 300, max_depth = 88, max_features = 0.677106, Score = 0.55287 90 n_estimators = 300, max_depth = 99, max_features = 0.334953, Score = 0.56110 91 n_estimators = 300, max_depth = 8, max_features = 0.179081, Score = 0.68709 92 n_estimators = 300, max_depth = 88, max_features = 0.411522, Score = 0.55288 93 n_estimators = 300, max_depth = 84, max_features = 0.440546, Score = 0.55152 94 n_estimators = 300, max_depth = 62, max_features = 0.759401, Score = 0.55459 95 n_estimators = 300, max_depth = 8, max_features = 0.806251, Score = 0.57222 96 n_estimators = 300, max_depth = 36, max_features = 0.763491, Score = 0.55458 97 n_estimators = 300, max_depth = 9, max_features = 0.517754, Score = 0.56838 98 n_estimators = 300, max_depth = 95, max_features = 0.343703, Score = 0.56110 99 n_estimators = 300, max_depth = 50, max_features = 0.322365, Score = 0.56110 (100, 5)
epoch | max_depth | max_features | n_estimators | score | |
---|---|---|---|---|---|
49 | 49 | 68 | 0.565145 | 300 | 0.550381 |
72 | 72 | 85 | 0.564755 | 300 | 0.550381 |
53 | 53 | 81 | 0.529706 | 300 | 0.550381 |
79 | 79 | 56 | 0.550496 | 300 | 0.550381 |
16 | 16 | 76 | 0.557047 | 300 | 0.550381 |
15 | 15 | 74 | 0.525012 | 300 | 0.550381 |
22 | 22 | 93 | 0.527166 | 300 | 0.550381 |
81 | 81 | 16 | 0.565649 | 300 | 0.550933 |
34 | 34 | 25 | 0.566416 | 300 | 0.550995 |
76 | 76 | 93 | 0.567974 | 300 | 0.551382 |
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
n_estimators = 300
num_epoch = 100
finer_hyperparameters_list = []
for epoch in range(num_epoch):
max_depth = np.random.randint(low=56, high=94)
max_features = np.random.uniform(low=0.525012, high=0.565145)
model = RandomForestRegressor(n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
n_jobs=-1,
random_state=37)
score = cross_val_score(model, x_train, y_train_c, cv=20, scoring=rmse_score).mean()
hyperparameters = {
'epoch': epoch,
'score': score,
'n_estimators': n_estimators,
'max_depth': max_depth,
'max_features': max_features,
}
finer_hyperparameters_list.append(hyperparameters)
print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")
finer_hyperparameters_list = pd.DataFrame.from_dict(finer_hyperparameters_list)
finer_hyperparameters_list = finer_hyperparameters_list.sort_values(by="score")
print(finer_hyperparameters_list.shape)
finer_hyperparameters_list.head(10)
0 n_estimators = 300, max_depth = 62, max_features = 0.547745, Score = 0.55038 1 n_estimators = 300, max_depth = 57, max_features = 0.553065, Score = 0.55038 2 n_estimators = 300, max_depth = 69, max_features = 0.532412, Score = 0.55038 3 n_estimators = 300, max_depth = 84, max_features = 0.528785, Score = 0.55038 4 n_estimators = 300, max_depth = 65, max_features = 0.555427, Score = 0.55038 5 n_estimators = 300, max_depth = 93, max_features = 0.544712, Score = 0.55038 6 n_estimators = 300, max_depth = 84, max_features = 0.556438, Score = 0.55038 7 n_estimators = 300, max_depth = 65, max_features = 0.536290, Score = 0.55038 8 n_estimators = 300, max_depth = 72, max_features = 0.539385, Score = 0.55038 9 n_estimators = 300, max_depth = 59, max_features = 0.539697, Score = 0.55038 10 n_estimators = 300, max_depth = 82, max_features = 0.546212, Score = 0.55038 11 n_estimators = 300, max_depth = 75, max_features = 0.526521, Score = 0.55038 12 n_estimators = 300, max_depth = 78, max_features = 0.555156, Score = 0.55038 13 n_estimators = 300, max_depth = 80, max_features = 0.553852, Score = 0.55038 14 n_estimators = 300, max_depth = 75, max_features = 0.563532, Score = 0.55038 15 n_estimators = 300, max_depth = 66, max_features = 0.538285, Score = 0.55038 16 n_estimators = 300, max_depth = 66, max_features = 0.556377, Score = 0.55038 17 n_estimators = 300, max_depth = 64, max_features = 0.531851, Score = 0.55038 18 n_estimators = 300, max_depth = 56, max_features = 0.552213, Score = 0.55038 19 n_estimators = 300, max_depth = 57, max_features = 0.543790, Score = 0.55038 20 n_estimators = 300, max_depth = 91, max_features = 0.527565, Score = 0.55038 21 n_estimators = 300, max_depth = 70, max_features = 0.537141, Score = 0.55038 22 n_estimators = 300, max_depth = 93, max_features = 0.541342, Score = 0.55038 23 n_estimators = 300, max_depth = 86, max_features = 0.541864, Score = 0.55038 24 n_estimators = 300, max_depth = 92, max_features = 0.554943, Score = 0.55038 25 n_estimators = 300, max_depth = 74, max_features = 0.548143, Score = 0.55038 26 n_estimators = 300, max_depth = 88, max_features = 0.544564, Score = 0.55038 27 n_estimators = 300, max_depth = 56, max_features = 0.535635, Score = 0.55038 28 n_estimators = 300, max_depth = 84, max_features = 0.563060, Score = 0.55038 29 n_estimators = 300, max_depth = 89, max_features = 0.533901, Score = 0.55038 30 n_estimators = 300, max_depth = 57, max_features = 0.554264, Score = 0.55038 31 n_estimators = 300, max_depth = 59, max_features = 0.544091, Score = 0.55038 32 n_estimators = 300, max_depth = 84, max_features = 0.551048, Score = 0.55038 33 n_estimators = 300, max_depth = 81, max_features = 0.536713, Score = 0.55038 34 n_estimators = 300, max_depth = 91, max_features = 0.527945, Score = 0.55038 35 n_estimators = 300, max_depth = 66, max_features = 0.562273, Score = 0.55038 36 n_estimators = 300, max_depth = 71, max_features = 0.540608, Score = 0.55038 37 n_estimators = 300, max_depth = 58, max_features = 0.560372, Score = 0.55038 38 n_estimators = 300, max_depth = 80, max_features = 0.545986, Score = 0.55038 39 n_estimators = 300, max_depth = 75, max_features = 0.548973, Score = 0.55038 40 n_estimators = 300, max_depth = 81, max_features = 0.536679, Score = 0.55038 41 n_estimators = 300, max_depth = 81, max_features = 0.531096, Score = 0.55038 42 n_estimators = 300, max_depth = 80, max_features = 0.538412, Score = 0.55038 43 n_estimators = 300, max_depth = 77, max_features = 0.550843, Score = 0.55038 44 n_estimators = 300, max_depth = 81, max_features = 0.557774, Score = 0.55038 45 n_estimators = 300, max_depth = 92, max_features = 0.539016, Score = 0.55038 46 n_estimators = 300, max_depth = 88, max_features = 0.537490, Score = 0.55038 47 n_estimators = 300, max_depth = 61, max_features = 0.563794, Score = 0.55038 48 n_estimators = 300, max_depth = 71, max_features = 0.528334, Score = 0.55038 49 n_estimators = 300, max_depth = 93, max_features = 0.525655, Score = 0.55038 50 n_estimators = 300, max_depth = 80, max_features = 0.537061, Score = 0.55038 51 n_estimators = 300, max_depth = 90, max_features = 0.549795, Score = 0.55038 52 n_estimators = 300, max_depth = 60, max_features = 0.540800, Score = 0.55038 53 n_estimators = 300, max_depth = 65, max_features = 0.537408, Score = 0.55038 54 n_estimators = 300, max_depth = 63, max_features = 0.541750, Score = 0.55038 55 n_estimators = 300, max_depth = 72, max_features = 0.556773, Score = 0.55038 56 n_estimators = 300, max_depth = 86, max_features = 0.556364, Score = 0.55038 57 n_estimators = 300, max_depth = 76, max_features = 0.533251, Score = 0.55038 58 n_estimators = 300, max_depth = 77, max_features = 0.527909, Score = 0.55038 59 n_estimators = 300, max_depth = 88, max_features = 0.529027, Score = 0.55038 60 n_estimators = 300, max_depth = 72, max_features = 0.549838, Score = 0.55038 61 n_estimators = 300, max_depth = 83, max_features = 0.564548, Score = 0.55038 62 n_estimators = 300, max_depth = 93, max_features = 0.526313, Score = 0.55038 63 n_estimators = 300, max_depth = 89, max_features = 0.530699, Score = 0.55038 64 n_estimators = 300, max_depth = 88, max_features = 0.527318, Score = 0.55038 65 n_estimators = 300, max_depth = 85, max_features = 0.530366, Score = 0.55038 66 n_estimators = 300, max_depth = 82, max_features = 0.560943, Score = 0.55038 67 n_estimators = 300, max_depth = 63, max_features = 0.525115, Score = 0.55038 68 n_estimators = 300, max_depth = 77, max_features = 0.549623, Score = 0.55038 69 n_estimators = 300, max_depth = 90, max_features = 0.544063, Score = 0.55038 70 n_estimators = 300, max_depth = 92, max_features = 0.553062, Score = 0.55038 71 n_estimators = 300, max_depth = 88, max_features = 0.546806, Score = 0.55038 72 n_estimators = 300, max_depth = 70, max_features = 0.545285, Score = 0.55038 73 n_estimators = 300, max_depth = 75, max_features = 0.537516, Score = 0.55038 74 n_estimators = 300, max_depth = 63, max_features = 0.547614, Score = 0.55038 75 n_estimators = 300, max_depth = 85, max_features = 0.552580, Score = 0.55038 76 n_estimators = 300, max_depth = 85, max_features = 0.563485, Score = 0.55038 77 n_estimators = 300, max_depth = 66, max_features = 0.557812, Score = 0.55038 78 n_estimators = 300, max_depth = 87, max_features = 0.561268, Score = 0.55038 79 n_estimators = 300, max_depth = 63, max_features = 0.530311, Score = 0.55038 80 n_estimators = 300, max_depth = 82, max_features = 0.548564, Score = 0.55038 81 n_estimators = 300, max_depth = 75, max_features = 0.544981, Score = 0.55038 82 n_estimators = 300, max_depth = 93, max_features = 0.542920, Score = 0.55038 83 n_estimators = 300, max_depth = 85, max_features = 0.525104, Score = 0.55038 84 n_estimators = 300, max_depth = 56, max_features = 0.529462, Score = 0.55038 85 n_estimators = 300, max_depth = 70, max_features = 0.560948, Score = 0.55038 86 n_estimators = 300, max_depth = 65, max_features = 0.546866, Score = 0.55038 87 n_estimators = 300, max_depth = 87, max_features = 0.564146, Score = 0.55038 88 n_estimators = 300, max_depth = 65, max_features = 0.545916, Score = 0.55038 89 n_estimators = 300, max_depth = 87, max_features = 0.558589, Score = 0.55038 90 n_estimators = 300, max_depth = 82, max_features = 0.554872, Score = 0.55038 91 n_estimators = 300, max_depth = 57, max_features = 0.543011, Score = 0.55038 92 n_estimators = 300, max_depth = 56, max_features = 0.543771, Score = 0.55038 93 n_estimators = 300, max_depth = 64, max_features = 0.538156, Score = 0.55038 94 n_estimators = 300, max_depth = 68, max_features = 0.551395, Score = 0.55038 95 n_estimators = 300, max_depth = 92, max_features = 0.549778, Score = 0.55038 96 n_estimators = 300, max_depth = 91, max_features = 0.544430, Score = 0.55038 97 n_estimators = 300, max_depth = 65, max_features = 0.531215, Score = 0.55038 98 n_estimators = 300, max_depth = 56, max_features = 0.529161, Score = 0.55038 99 n_estimators = 300, max_depth = 59, max_features = 0.535109, Score = 0.55038 (100, 5)
epoch | max_depth | max_features | n_estimators | score | |
---|---|---|---|---|---|
0 | 0 | 62 | 0.547745 | 300 | 0.550381 |
72 | 72 | 70 | 0.545285 | 300 | 0.550381 |
71 | 71 | 88 | 0.546806 | 300 | 0.550381 |
70 | 70 | 92 | 0.553062 | 300 | 0.550381 |
69 | 69 | 90 | 0.544063 | 300 | 0.550381 |
68 | 68 | 77 | 0.549623 | 300 | 0.550381 |
67 | 67 | 63 | 0.525115 | 300 | 0.550381 |
66 | 66 | 82 | 0.560943 | 300 | 0.550381 |
65 | 65 | 85 | 0.530366 | 300 | 0.550381 |
64 | 64 | 88 | 0.527318 | 300 | 0.550381 |
best_hyperparameters = finer_hyperparameters_list.iloc[0]
best_max_depth = best_hyperparameters["max_depth"]
best_max_features = best_hyperparameters["max_features"]
print(f"max_depth(best) = {best_max_depth}, max_features(best) = {best_max_features:.6f}")
max_depth(best) = 62.0, max_features(best) = 0.547745
best_n_estimators = 3000
model = RandomForestRegressor(n_estimators=best_n_estimators,
max_depth=best_max_depth,
max_features=best_max_features,
random_state=37,
n_jobs=-1)
model
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=62.0, max_features=0.5477452272116831, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1, oob_score=False, random_state=37, verbose=0, warm_start=False)
model.fit(x_train, y_train_c)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=62.0, max_features=0.5477452272116831, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1, oob_score=False, random_state=37, verbose=0, warm_start=False)
logC_predictions = model.predict(x_test)
print(logC_predictions.shape)
logC_predictions
(6493,)
array([0.83244801, 0.41858681, 0.48794546, ..., 1.41082069, 1.39423664, 1.10825634])
predictions_c = np.exp(logC_predictions) - 1
print(predictions_c.shape)
predictions_c
(6493,)
array([1.29893969, 0.51981226, 0.628966 , ..., 3.0993183 , 3.03189561, 2.02907212])
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
n_estimators = 300
num_epoch = 100
coarse_hyperparameters_list2 = []
for epoch in range(num_epoch):
max_depth = np.random.randint(low=2, high=100)
max_features = np.random.uniform(low=0.1, high=1.0)
model = RandomForestRegressor(n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
n_jobs=-1,
random_state=37)
score = cross_val_score(model, x_train, y_train_r, cv=20, scoring=rmse_score).mean()
hyperparameters = {
'epoch': epoch,
'score': score,
'n_estimators': n_estimators,
'max_depth': max_depth,
'max_features': max_features,
}
coarse_hyperparameters_list2.append(hyperparameters)
print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")
coarse_hyperparameters_list2 = pd.DataFrame.from_dict(coarse_hyperparameters_list2)
coarse_hyperparameters_list2 = coarse_hyperparameters_list2.sort_values(by="score")
print(coarse_hyperparameters_list2.shape)
coarse_hyperparameters_list2.head(10)
0 n_estimators = 300, max_depth = 55, max_features = 0.741497, Score = 0.34181 1 n_estimators = 300, max_depth = 6, max_features = 0.671060, Score = 0.53658 2 n_estimators = 300, max_depth = 22, max_features = 0.470221, Score = 0.36214 3 n_estimators = 300, max_depth = 85, max_features = 0.558939, Score = 0.34958 4 n_estimators = 300, max_depth = 96, max_features = 0.932887, Score = 0.34524 5 n_estimators = 300, max_depth = 98, max_features = 0.355066, Score = 0.38606 6 n_estimators = 300, max_depth = 11, max_features = 0.527905, Score = 0.38015 7 n_estimators = 300, max_depth = 5, max_features = 0.777117, Score = 0.58680 8 n_estimators = 300, max_depth = 37, max_features = 0.693035, Score = 0.34270 9 n_estimators = 300, max_depth = 70, max_features = 0.429790, Score = 0.37301 10 n_estimators = 300, max_depth = 43, max_features = 0.597404, Score = 0.34740 11 n_estimators = 300, max_depth = 93, max_features = 0.450411, Score = 0.36206 12 n_estimators = 300, max_depth = 95, max_features = 0.680291, Score = 0.34270 13 n_estimators = 300, max_depth = 60, max_features = 0.709664, Score = 0.34149 14 n_estimators = 300, max_depth = 89, max_features = 0.980099, Score = 0.34647 15 n_estimators = 300, max_depth = 63, max_features = 0.251691, Score = 0.46973 16 n_estimators = 300, max_depth = 67, max_features = 0.976746, Score = 0.34647 17 n_estimators = 300, max_depth = 78, max_features = 0.996863, Score = 0.34647 18 n_estimators = 300, max_depth = 21, max_features = 0.796639, Score = 0.34249 19 n_estimators = 300, max_depth = 49, max_features = 0.437017, Score = 0.36206 20 n_estimators = 300, max_depth = 18, max_features = 0.527926, Score = 0.34970 21 n_estimators = 300, max_depth = 98, max_features = 0.482188, Score = 0.35440 22 n_estimators = 300, max_depth = 66, max_features = 0.785142, Score = 0.34129 23 n_estimators = 300, max_depth = 17, max_features = 0.767724, Score = 0.34240 24 n_estimators = 300, max_depth = 42, max_features = 0.757356, Score = 0.34181 25 n_estimators = 300, max_depth = 48, max_features = 0.886744, Score = 0.34358 26 n_estimators = 300, max_depth = 51, max_features = 0.238830, Score = 0.46973 27 n_estimators = 300, max_depth = 41, max_features = 0.235914, Score = 0.46973 28 n_estimators = 300, max_depth = 75, max_features = 0.324585, Score = 0.40701 29 n_estimators = 300, max_depth = 47, max_features = 0.172270, Score = 0.56751 30 n_estimators = 300, max_depth = 90, max_features = 0.451302, Score = 0.36206 31 n_estimators = 300, max_depth = 10, max_features = 0.797670, Score = 0.36912 32 n_estimators = 300, max_depth = 80, max_features = 0.531153, Score = 0.34958 33 n_estimators = 300, max_depth = 43, max_features = 0.984928, Score = 0.34647 34 n_estimators = 300, max_depth = 92, max_features = 0.759502, Score = 0.34181 35 n_estimators = 300, max_depth = 46, max_features = 0.145356, Score = 0.56751 36 n_estimators = 300, max_depth = 82, max_features = 0.141128, Score = 0.56751 37 n_estimators = 300, max_depth = 10, max_features = 0.735761, Score = 0.37585 38 n_estimators = 300, max_depth = 55, max_features = 0.328704, Score = 0.40701 39 n_estimators = 300, max_depth = 7, max_features = 0.622525, Score = 0.49458 40 n_estimators = 300, max_depth = 78, max_features = 0.601841, Score = 0.34740 41 n_estimators = 300, max_depth = 62, max_features = 0.402645, Score = 0.37301 42 n_estimators = 300, max_depth = 55, max_features = 0.459575, Score = 0.36206 43 n_estimators = 300, max_depth = 10, max_features = 0.523311, Score = 0.40020 44 n_estimators = 300, max_depth = 37, max_features = 0.647466, Score = 0.34401 45 n_estimators = 300, max_depth = 57, max_features = 0.382216, Score = 0.38606 46 n_estimators = 300, max_depth = 27, max_features = 0.209308, Score = 0.51259 47 n_estimators = 300, max_depth = 72, max_features = 0.561691, Score = 0.34958 48 n_estimators = 300, max_depth = 55, max_features = 0.554428, Score = 0.34958 49 n_estimators = 300, max_depth = 51, max_features = 0.416894, Score = 0.37301 50 n_estimators = 300, max_depth = 95, max_features = 0.585095, Score = 0.34740 51 n_estimators = 300, max_depth = 37, max_features = 0.119375, Score = 0.63239 52 n_estimators = 300, max_depth = 30, max_features = 0.750781, Score = 0.34171 53 n_estimators = 300, max_depth = 38, max_features = 0.286508, Score = 0.43439 54 n_estimators = 300, max_depth = 81, max_features = 0.663362, Score = 0.34270 55 n_estimators = 300, max_depth = 72, max_features = 0.542318, Score = 0.34958 56 n_estimators = 300, max_depth = 46, max_features = 0.520006, Score = 0.35440 57 n_estimators = 300, max_depth = 41, max_features = 0.922723, Score = 0.34524 58 n_estimators = 300, max_depth = 95, max_features = 0.942000, Score = 0.34524 59 n_estimators = 300, max_depth = 48, max_features = 0.544762, Score = 0.34958 60 n_estimators = 300, max_depth = 55, max_features = 0.425435, Score = 0.37301 61 n_estimators = 300, max_depth = 51, max_features = 0.271925, Score = 0.43439 62 n_estimators = 300, max_depth = 98, max_features = 0.691109, Score = 0.34270 63 n_estimators = 300, max_depth = 24, max_features = 0.533335, Score = 0.35030 64 n_estimators = 300, max_depth = 79, max_features = 0.243902, Score = 0.46973 65 n_estimators = 300, max_depth = 80, max_features = 0.182315, Score = 0.51170 66 n_estimators = 300, max_depth = 82, max_features = 0.465259, Score = 0.36206 67 n_estimators = 300, max_depth = 35, max_features = 0.488040, Score = 0.35440 68 n_estimators = 300, max_depth = 3, max_features = 0.388570, Score = 0.81505 69 n_estimators = 300, max_depth = 69, max_features = 0.303540, Score = 0.43439 70 n_estimators = 300, max_depth = 60, max_features = 0.820772, Score = 0.34129 71 n_estimators = 300, max_depth = 2, max_features = 0.189066, Score = 1.07632 72 n_estimators = 300, max_depth = 10, max_features = 0.540102, Score = 0.40020 73 n_estimators = 300, max_depth = 94, max_features = 0.164939, Score = 0.56751 74 n_estimators = 300, max_depth = 67, max_features = 0.279879, Score = 0.43439 75 n_estimators = 300, max_depth = 65, max_features = 0.311435, Score = 0.40701 76 n_estimators = 300, max_depth = 81, max_features = 0.400092, Score = 0.37301 77 n_estimators = 300, max_depth = 52, max_features = 0.400369, Score = 0.37301 78 n_estimators = 300, max_depth = 83, max_features = 0.584787, Score = 0.34740 79 n_estimators = 300, max_depth = 37, max_features = 0.185660, Score = 0.51170 80 n_estimators = 300, max_depth = 86, max_features = 0.439527, Score = 0.36206 81 n_estimators = 300, max_depth = 31, max_features = 0.478423, Score = 0.35460 82 n_estimators = 300, max_depth = 23, max_features = 0.380094, Score = 0.38653 83 n_estimators = 300, max_depth = 62, max_features = 0.166420, Score = 0.56751 84 n_estimators = 300, max_depth = 7, max_features = 0.545152, Score = 0.50683 85 n_estimators = 300, max_depth = 38, max_features = 0.253419, Score = 0.46973 86 n_estimators = 300, max_depth = 39, max_features = 0.467996, Score = 0.36206 87 n_estimators = 300, max_depth = 42, max_features = 0.178515, Score = 0.51170 88 n_estimators = 300, max_depth = 48, max_features = 0.253318, Score = 0.46973 89 n_estimators = 300, max_depth = 81, max_features = 0.593100, Score = 0.34740 90 n_estimators = 300, max_depth = 20, max_features = 0.699001, Score = 0.34236 91 n_estimators = 300, max_depth = 52, max_features = 0.642897, Score = 0.34401 92 n_estimators = 300, max_depth = 97, max_features = 0.807412, Score = 0.34129 93 n_estimators = 300, max_depth = 96, max_features = 0.822352, Score = 0.34129 94 n_estimators = 300, max_depth = 45, max_features = 0.726982, Score = 0.34149 95 n_estimators = 300, max_depth = 22, max_features = 0.530267, Score = 0.34980 96 n_estimators = 300, max_depth = 79, max_features = 0.887531, Score = 0.34358 97 n_estimators = 300, max_depth = 83, max_features = 0.244250, Score = 0.46973 98 n_estimators = 300, max_depth = 58, max_features = 0.747926, Score = 0.34181 99 n_estimators = 300, max_depth = 72, max_features = 0.921403, Score = 0.34524 (100, 5)
epoch | max_depth | max_features | n_estimators | score | |
---|---|---|---|---|---|
93 | 93 | 96 | 0.822352 | 300 | 0.341292 |
92 | 92 | 97 | 0.807412 | 300 | 0.341292 |
70 | 70 | 60 | 0.820772 | 300 | 0.341292 |
22 | 22 | 66 | 0.785142 | 300 | 0.341292 |
94 | 94 | 45 | 0.726982 | 300 | 0.341488 |
13 | 13 | 60 | 0.709664 | 300 | 0.341488 |
52 | 52 | 30 | 0.750781 | 300 | 0.341713 |
24 | 24 | 42 | 0.757356 | 300 | 0.341813 |
34 | 34 | 92 | 0.759502 | 300 | 0.341813 |
98 | 98 | 58 | 0.747926 | 300 | 0.341813 |
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
n_estimators = 300
num_epoch = 100
finer_hyperparameters_list2 = []
for epoch in range(num_epoch):
max_depth = np.random.randint(low=60, high=98)
max_features = np.random.uniform(low=0.785142, high=0.822352)
model = RandomForestRegressor(n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
n_jobs=-1,
random_state=37)
score = cross_val_score(model, x_train, y_train_r, cv=20, scoring=rmse_score).mean()
hyperparameters = {
'epoch': epoch,
'score': score,
'n_estimators': n_estimators,
'max_depth': max_depth,
'max_features': max_features,
}
finer_hyperparameters_list2.append(hyperparameters)
print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")
finer_hyperparameters_list2 = pd.DataFrame.from_dict(finer_hyperparameters_list2)
finer_hyperparameters_list2 = finer_hyperparameters_list2.sort_values(by="score")
print(finer_hyperparameters_list2.shape)
finer_hyperparameters_list2.head(10)
0 n_estimators = 300, max_depth = 63, max_features = 0.791589, Score = 0.34129 1 n_estimators = 300, max_depth = 95, max_features = 0.801994, Score = 0.34129 2 n_estimators = 300, max_depth = 94, max_features = 0.814558, Score = 0.34129 3 n_estimators = 300, max_depth = 95, max_features = 0.813673, Score = 0.34129 4 n_estimators = 300, max_depth = 81, max_features = 0.796884, Score = 0.34129 5 n_estimators = 300, max_depth = 71, max_features = 0.818093, Score = 0.34129 6 n_estimators = 300, max_depth = 94, max_features = 0.819967, Score = 0.34129 7 n_estimators = 300, max_depth = 97, max_features = 0.812466, Score = 0.34129 8 n_estimators = 300, max_depth = 80, max_features = 0.815922, Score = 0.34129 9 n_estimators = 300, max_depth = 78, max_features = 0.817271, Score = 0.34129 10 n_estimators = 300, max_depth = 95, max_features = 0.793394, Score = 0.34129 11 n_estimators = 300, max_depth = 85, max_features = 0.818906, Score = 0.34129 12 n_estimators = 300, max_depth = 83, max_features = 0.800944, Score = 0.34129 13 n_estimators = 300, max_depth = 95, max_features = 0.787007, Score = 0.34129 14 n_estimators = 300, max_depth = 60, max_features = 0.808383, Score = 0.34129 15 n_estimators = 300, max_depth = 72, max_features = 0.808345, Score = 0.34129 16 n_estimators = 300, max_depth = 84, max_features = 0.797468, Score = 0.34129 17 n_estimators = 300, max_depth = 90, max_features = 0.811709, Score = 0.34129 18 n_estimators = 300, max_depth = 61, max_features = 0.791274, Score = 0.34129 19 n_estimators = 300, max_depth = 67, max_features = 0.822294, Score = 0.34129 20 n_estimators = 300, max_depth = 69, max_features = 0.785356, Score = 0.34129 21 n_estimators = 300, max_depth = 60, max_features = 0.792668, Score = 0.34129 22 n_estimators = 300, max_depth = 93, max_features = 0.821914, Score = 0.34129 23 n_estimators = 300, max_depth = 84, max_features = 0.786062, Score = 0.34129 24 n_estimators = 300, max_depth = 92, max_features = 0.820913, Score = 0.34129 25 n_estimators = 300, max_depth = 79, max_features = 0.790979, Score = 0.34129 26 n_estimators = 300, max_depth = 60, max_features = 0.811224, Score = 0.34129 27 n_estimators = 300, max_depth = 66, max_features = 0.787275, Score = 0.34129 28 n_estimators = 300, max_depth = 96, max_features = 0.799148, Score = 0.34129 29 n_estimators = 300, max_depth = 89, max_features = 0.786287, Score = 0.34129 30 n_estimators = 300, max_depth = 96, max_features = 0.811415, Score = 0.34129 31 n_estimators = 300, max_depth = 65, max_features = 0.793928, Score = 0.34129 32 n_estimators = 300, max_depth = 88, max_features = 0.795674, Score = 0.34129 33 n_estimators = 300, max_depth = 94, max_features = 0.786748, Score = 0.34129 34 n_estimators = 300, max_depth = 93, max_features = 0.794366, Score = 0.34129 35 n_estimators = 300, max_depth = 71, max_features = 0.795346, Score = 0.34129 36 n_estimators = 300, max_depth = 82, max_features = 0.820458, Score = 0.34129 37 n_estimators = 300, max_depth = 66, max_features = 0.811386, Score = 0.34129 38 n_estimators = 300, max_depth = 92, max_features = 0.804557, Score = 0.34129 39 n_estimators = 300, max_depth = 96, max_features = 0.787059, Score = 0.34129 40 n_estimators = 300, max_depth = 96, max_features = 0.799632, Score = 0.34129 41 n_estimators = 300, max_depth = 80, max_features = 0.804286, Score = 0.34129 42 n_estimators = 300, max_depth = 89, max_features = 0.819000, Score = 0.34129 43 n_estimators = 300, max_depth = 69, max_features = 0.806360, Score = 0.34129 44 n_estimators = 300, max_depth = 87, max_features = 0.798133, Score = 0.34129 45 n_estimators = 300, max_depth = 81, max_features = 0.813769, Score = 0.34129 46 n_estimators = 300, max_depth = 84, max_features = 0.785303, Score = 0.34129 47 n_estimators = 300, max_depth = 72, max_features = 0.810851, Score = 0.34129 48 n_estimators = 300, max_depth = 80, max_features = 0.811310, Score = 0.34129 49 n_estimators = 300, max_depth = 83, max_features = 0.794213, Score = 0.34129 50 n_estimators = 300, max_depth = 62, max_features = 0.795216, Score = 0.34129 51 n_estimators = 300, max_depth = 64, max_features = 0.805456, Score = 0.34129 52 n_estimators = 300, max_depth = 88, max_features = 0.794745, Score = 0.34129 53 n_estimators = 300, max_depth = 69, max_features = 0.787971, Score = 0.34129 54 n_estimators = 300, max_depth = 71, max_features = 0.796900, Score = 0.34129 55 n_estimators = 300, max_depth = 64, max_features = 0.804270, Score = 0.34129 56 n_estimators = 300, max_depth = 89, max_features = 0.789507, Score = 0.34129 57 n_estimators = 300, max_depth = 73, max_features = 0.789292, Score = 0.34129 58 n_estimators = 300, max_depth = 80, max_features = 0.802160, Score = 0.34129 59 n_estimators = 300, max_depth = 90, max_features = 0.810262, Score = 0.34129 60 n_estimators = 300, max_depth = 97, max_features = 0.790554, Score = 0.34129 61 n_estimators = 300, max_depth = 87, max_features = 0.805870, Score = 0.34129 62 n_estimators = 300, max_depth = 85, max_features = 0.805697, Score = 0.34129 63 n_estimators = 300, max_depth = 71, max_features = 0.805643, Score = 0.34129 64 n_estimators = 300, max_depth = 76, max_features = 0.808495, Score = 0.34129 65 n_estimators = 300, max_depth = 72, max_features = 0.804726, Score = 0.34129 66 n_estimators = 300, max_depth = 81, max_features = 0.811274, Score = 0.34129 67 n_estimators = 300, max_depth = 84, max_features = 0.822164, Score = 0.34129 68 n_estimators = 300, max_depth = 75, max_features = 0.792803, Score = 0.34129 69 n_estimators = 300, max_depth = 97, max_features = 0.797391, Score = 0.34129 70 n_estimators = 300, max_depth = 95, max_features = 0.814474, Score = 0.34129 71 n_estimators = 300, max_depth = 96, max_features = 0.791569, Score = 0.34129 72 n_estimators = 300, max_depth = 64, max_features = 0.790990, Score = 0.34129 73 n_estimators = 300, max_depth = 62, max_features = 0.787643, Score = 0.34129 74 n_estimators = 300, max_depth = 88, max_features = 0.803658, Score = 0.34129 75 n_estimators = 300, max_depth = 86, max_features = 0.814407, Score = 0.34129 76 n_estimators = 300, max_depth = 81, max_features = 0.799465, Score = 0.34129 77 n_estimators = 300, max_depth = 61, max_features = 0.817336, Score = 0.34129 78 n_estimators = 300, max_depth = 63, max_features = 0.808093, Score = 0.34129 79 n_estimators = 300, max_depth = 62, max_features = 0.817593, Score = 0.34129 80 n_estimators = 300, max_depth = 63, max_features = 0.816205, Score = 0.34129 81 n_estimators = 300, max_depth = 68, max_features = 0.797734, Score = 0.34129 82 n_estimators = 300, max_depth = 71, max_features = 0.787607, Score = 0.34129 83 n_estimators = 300, max_depth = 65, max_features = 0.795881, Score = 0.34129 84 n_estimators = 300, max_depth = 60, max_features = 0.820788, Score = 0.34129 85 n_estimators = 300, max_depth = 77, max_features = 0.801881, Score = 0.34129 86 n_estimators = 300, max_depth = 72, max_features = 0.806487, Score = 0.34129 87 n_estimators = 300, max_depth = 75, max_features = 0.816394, Score = 0.34129 88 n_estimators = 300, max_depth = 78, max_features = 0.812284, Score = 0.34129 89 n_estimators = 300, max_depth = 68, max_features = 0.797305, Score = 0.34129 90 n_estimators = 300, max_depth = 79, max_features = 0.807080, Score = 0.34129 91 n_estimators = 300, max_depth = 83, max_features = 0.813621, Score = 0.34129 92 n_estimators = 300, max_depth = 68, max_features = 0.796623, Score = 0.34129 93 n_estimators = 300, max_depth = 73, max_features = 0.795066, Score = 0.34129 94 n_estimators = 300, max_depth = 97, max_features = 0.816419, Score = 0.34129 95 n_estimators = 300, max_depth = 91, max_features = 0.819309, Score = 0.34129 96 n_estimators = 300, max_depth = 63, max_features = 0.790070, Score = 0.34129 97 n_estimators = 300, max_depth = 66, max_features = 0.793431, Score = 0.34129 98 n_estimators = 300, max_depth = 94, max_features = 0.787573, Score = 0.34129 99 n_estimators = 300, max_depth = 68, max_features = 0.820232, Score = 0.34129 (100, 5)
epoch | max_depth | max_features | n_estimators | score | |
---|---|---|---|---|---|
0 | 0 | 63 | 0.791589 | 300 | 0.341292 |
72 | 72 | 64 | 0.790990 | 300 | 0.341292 |
71 | 71 | 96 | 0.791569 | 300 | 0.341292 |
70 | 70 | 95 | 0.814474 | 300 | 0.341292 |
69 | 69 | 97 | 0.797391 | 300 | 0.341292 |
68 | 68 | 75 | 0.792803 | 300 | 0.341292 |
67 | 67 | 84 | 0.822164 | 300 | 0.341292 |
66 | 66 | 81 | 0.811274 | 300 | 0.341292 |
65 | 65 | 72 | 0.804726 | 300 | 0.341292 |
64 | 64 | 76 | 0.808495 | 300 | 0.341292 |
best_hyperparameters = finer_hyperparameters_list2.iloc[0]
best_max_depth = best_hyperparameters["max_depth"]
best_max_features = best_hyperparameters["max_features"]
print(f"max_depth(best) = {best_max_depth}, max_features(best) = {best_max_features:.6f}")
max_depth(best) = 63.0, max_features(best) = 0.791589
best_n_estimators = 3000
model = RandomForestRegressor(n_estimators=best_n_estimators,
max_depth=best_max_depth,
max_features=best_max_features,
random_state=37,
n_jobs=-1)
model
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=63.0, max_features=0.7915885049682436, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1, oob_score=False, random_state=37, verbose=0, warm_start=False)
model.fit(x_train, y_train_r)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=63.0, max_features=0.7915885049682436, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1, oob_score=False, random_state=37, verbose=0, warm_start=False)
logR_predictions = model.predict(x_test)
print(logR_predictions.shape)
logR_predictions
(6493,)
array([2.36580644, 1.70954977, 1.04493393, ..., 4.60040191, 4.53352732, 3.8785016 ])
predictions_r = np.exp(logR_predictions) - 1
print(predictions_r.shape)
predictions_r
(6493,)
array([ 9.65262609, 4.52647275, 1.84321065, ..., 98.52430758, 92.08632794, 47.35171066])
predictions = predictions_c + predictions_r
predictions
array([ 10.95156578, 5.04628501, 2.47217666, ..., 101.62362588, 95.11822355, 49.38078278])
submission = pd.read_csv("submit.csv")
submission.head()
datetime | count | |
---|---|---|
0 | 2011-01-20 00:00:00 | 0 |
1 | 2011-01-20 01:00:00 | 0 |
2 | 2011-01-20 02:00:00 | 0 |
3 | 2011-01-20 03:00:00 | 0 |
4 | 2011-01-20 04:00:00 | 0 |
submission["count"] = predictions
submission.head()
datetime | count | |
---|---|---|
0 | 2011-01-20 00:00:00 | 10.951566 |
1 | 2011-01-20 01:00:00 | 5.046285 |
2 | 2011-01-20 02:00:00 | 2.472177 |
3 | 2011-01-20 03:00:00 | 2.313137 |
4 | 2011-01-20 04:00:00 | 2.121748 |
submission.to_csv("bike-038205.csv", index=False)
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">코드 보기/숨기기</button>''', raw=True)