به نام خدا

تخمین قیمت ارزهای دیجیتال

مجموعه داده
¶

مجموعه داده را می‌توانید از مسیر زیر دانلود کنید

http://dataset.class.vision/rnn/crypto_data.zip

In [1]:

import pandas as pd

df = pd.read_csv("crypto_data/LTC-USD.csv", names=['time', 'low', 'high', 'open', 'close', 'volume'])

print(df.head())

         time        low       high       open      close      volume
0  1528968660  96.580002  96.589996  96.589996  96.580002    9.647200
1  1528968720  96.449997  96.669998  96.589996  96.660004  314.387024
2  1528968780  96.470001  96.570000  96.570000  96.570000   77.129799
3  1528968840  96.449997  96.570000  96.570000  96.500000    7.216067
4  1528968900  96.279999  96.540001  96.500000  96.389999  524.539978

In [2]:

main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:  # begin iteration
    print(ratio)
    dataset = f'crypto_data/{ratio}.csv'  # get the full path to the file.
    df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume'])  # read in specific file

    # rename volume and close to include the ticker so we can still which close/volume is which:
    df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

    df.set_index("time", inplace=True)  # set time as index so we can join them on this shared time
    df = df[[f"{ratio}_close", f"{ratio}_volume"]]  # ignore the other columns besides price and volume

    if len(main_df)==0:  # if the dataframe is empty
        main_df = df  # then it's just the current df
    else:  # otherwise, join this data to the main one
        main_df = main_df.join(df)

main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
main_df.dropna(inplace=True)
print(main_df.head())  # how did we do??

BTC-USD
LTC-USD
BCH-USD
ETH-USD
            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1528968720    6487.379883        7.706374      96.660004      314.387024   
1528968780    6479.410156        3.088252      96.570000       77.129799   
1528968840    6479.410156        1.404100      96.500000        7.216067   
1528968900    6479.979980        0.753000      96.389999      524.539978   
1528968960    6480.000000        1.490900      96.519997       16.991997   

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  
time                                                                      
1528968720     870.859985       26.856577      486.01001       26.019083  
1528968780     870.099976        1.124300      486.00000        8.449400  
1528968840     870.789978        1.749862      485.75000       26.994646  
1528968900     870.000000        1.680500      486.00000       77.355759  
1528968960     869.989990        1.669014      486.00000        7.503300

In [3]:

SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTC-USD"

In [4]:

main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)

In [5]:

main_df.head()

Out[5]:

	BTC-USD_close	BTC-USD_volume	LTC-USD_close	LTC-USD_volume	BCH-USD_close	BCH-USD_volume	ETH-USD_close	ETH-USD_volume	future
time
1528968720	6487.379883	7.706374	96.660004	314.387024	870.859985	26.856577	486.01001	26.019083	96.389999
1528968780	6479.410156	3.088252	96.570000	77.129799	870.099976	1.124300	486.00000	8.449400	96.519997
1528968840	6479.410156	1.404100	96.500000	7.216067	870.789978	1.749862	485.75000	26.994646	96.440002
1528968900	6479.979980	0.753000	96.389999	524.539978	870.000000	1.680500	486.00000	77.355759	96.470001
1528968960	6480.000000	1.490900	96.519997	16.991997	869.989990	1.669014	486.00000	7.503300	96.400002

In [6]:

def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0

In [7]:

main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))

In [8]:

main_df.head()

Out[8]:

	BTC-USD_close	BTC-USD_volume	LTC-USD_close	LTC-USD_volume	BCH-USD_close	BCH-USD_volume	ETH-USD_close	ETH-USD_volume	future	target
time
1528968720	6487.379883	7.706374	96.660004	314.387024	870.859985	26.856577	486.01001	26.019083	96.389999	0
1528968780	6479.410156	3.088252	96.570000	77.129799	870.099976	1.124300	486.00000	8.449400	96.519997	0
1528968840	6479.410156	1.404100	96.500000	7.216067	870.789978	1.749862	485.75000	26.994646	96.440002	0
1528968900	6479.979980	0.753000	96.389999	524.539978	870.000000	1.680500	486.00000	77.355759	96.470001	1
1528968960	6480.000000	1.490900	96.519997	16.991997	869.989990	1.669014	486.00000	7.503300	96.400002	0

جدا کردن دیتای آموزش و ارزیابی
¶

In [9]:

times = sorted(main_df.index.values)  # get the times
last_5pct = sorted(main_df.index.values)[-int(0.05*len(times))]  # get the last 5% of the times

validation_main_df = main_df[(main_df.index >= last_5pct)]  # make the validation data where the index is in the last 5%
main_df = main_df[(main_df.index < last_5pct)]  # now the main_df is all the data up to the last 5%

Next, we need to balance and normalize this data.

By balance, we want to make sure the classes have equal amounts when training, so our model doesn't just always predict one class.

One way to counteract this is to use class weights, which allows you to weight loss higher for lesser-frequent classifications. That said, I've never personally seen this really be comparable to a real balanced dataset.

We also need to take our data and make sequences from it.

So...we've got some work to do! We'll start by making a function that will process the dataframes, so we can just do something like:

train_x, train_y = preprocess_df(main_df) 
validation_x, validation_y = preprocess_df(validation_main_df)

Let's start by removing the future column (the actual target is called literally target and only needed the future column temporarily to create it).

Then, we need to scale our data:

In [10]:

from sklearn import preprocessing  # pip install sklearn ... if you don't have it!

def preprocess_df(df):
    df = df.drop("future", 1)  # don't need this anymore.

    for col in df.columns:  # go through all of the columns
        if col != "target":  # normalize all ... except for the target itself!
            df[col] = df[col].pct_change()  # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
            df.dropna(inplace=True)  # remove the nas created by pct_change
            df[col] = preprocessing.scale(df[col].values)  # scale between 0 and 1.

    df.dropna(inplace=True)  # cleanup again... jic. Those nasty NaNs love to creep in.

Alright, we've normalized and scaled the data!

Next up, we need to create our actual sequences.

To do this:

In [11]:

import numpy as np
from collections import deque
import random


sequential_data = []  # this is a list that will CONTAIN the sequences
prev_days = deque(maxlen=SEQ_LEN)  # These will be our actual sequences. They are made with deque, which keeps the maximum length by popping out older values as new ones come in

for i in df.values:  # iterate over the values
    prev_days.append([n for n in i[:-1]])  # store all but the target
    if len(prev_days) == SEQ_LEN:  # make sure we have 60 sequences!
        sequential_data.append([np.array(prev_days), i[-1]])  # append those bad boys!

random.shuffle(sequential_data)  # shuffle for good measure.

منبع:
¶

https://becominghuman.ai/recurrent-neural-networks-rnn-deep-learning-w-python-tensorflow-keras-p-7-c21bc374d4dc

دوره پیشرفته یادگیری عمیق
علیرضا اخوان پور
آبان و آذر 1399

Class.Vision - AkhavanPour.ir - GitHub

تخمین قیمت ارزهای دیجیتال

مجموعه داده¶

جدا کردن دیتای آموزش و ارزیابی¶

منبع:¶

مجموعه داده
¶

جدا کردن دیتای آموزش و ارزیابی
¶

منبع:
¶