For my project to predict BTC price, i had to get data from different sources with different methods (using package, webscraping and API).
So here are the lines of code i'm using to get a CSV file with :
# Main
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Web
import requests
from bs4 import BeautifulSoup
#from IPython.display import display_html
# More
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
pd.set_option('display.max_rows', 400)
# Date format (ex: 190913)
from datetime import date
#today = date.today().strftime("%Y%m%d")[2:]
# PyTrends
from pytrends.request import TrendReq
pytrend = TrendReq()
Since February 2018, Alternative.me provides a fear and greed index based on:
Note: The API returns a CSV code, but i didn't find how to easily read/convert it.
PS: There are some duplicates and missing values in these data. See below in 1.2.4
url = 'https://api.alternative.me/fng/?limit=0&format=csv&date_format=cn'
r = requests.get(url)
fgi = BeautifulSoup(r.text, 'lxml')
fgi = str(fgi)
fgi = fgi[fgi.find('date\n')+5:fgi.find('\n\t]')]
fgi = fgi.replace('\n',',').split(',')
fgi = pd.DataFrame({'Date':fgi[::3],'Score':fgi[1:][::3],'Category':fgi[2:][::3]})
fgi['Date'] = pd.to_datetime(fgi['Date'])
fgi = fgi.set_index('Date').sort_index()
fgi.head()
Score | Category | |
---|---|---|
Date | ||
2018-02-01 | 30 | Fear |
2018-02-02 | 15 | Extreme Fear |
2018-02-03 | 40 | Fear |
2018-02-04 | 24 | Extreme Fear |
2018-02-05 | 11 | Extreme Fear |
In a recent notebook, i discover Pytrends (https://github.com/GeneralMills/pytrends) which is a great tool for web scraping Google Trends.
(Thanks to its team to share it)
You can get data from 2004. But note that if you choose a too large period (~8 months), Google Trends returns weekly score, not daily.
The period from today (2019-09-13) to first Fear&Greed Index data (2018-02-01) exceeds this limit to get daily data. So the function interest_over_time()
returns weekly data. But here is the magic and what we interest in. Pytrends gives a function build_payload()
in order to set timeframe, localisation, etc.. (more on the documentation).
(More: PyTrends has a function get_historical_interest()
to get hours data, and many others possibilities.)
What we need to do, it to set period time of 8 months in order to have our daily trend score.
# Keywords list (max 5 words/expressions)
kw_list = ['bitcoin trading']
# 2018-02-01 to 2018-05-11
pytrend.build_payload(kw_list, cat=0, timeframe='2018-02-01 2018-05-11', geo='', gprop='')
gst_1 = pytrend.interest_over_time()
# 2018-05-12 to 2019-01-22
pytrend.build_payload(kw_list, cat=0, timeframe='2018-05-12 2019-01-22', geo='', gprop='')
gst_2 = pytrend.interest_over_time()
# 2019-01-23 to 2019-09-23
pytrend.build_payload(kw_list, cat=0, timeframe='2019-01-23 2019-09-23', geo='', gprop='')
gst_3 = pytrend.interest_over_time()
# Concat
frames = [gst_1, gst_2, gst_3]
gst = pd.concat(frames)
gst.head()
bitcoin trading | isPartial | |
---|---|---|
date | ||
2018-02-01 | 82 | False |
2018-02-02 | 98 | False |
2018-02-03 | 64 | False |
2018-02-04 | 67 | False |
2018-02-05 | 83 | False |
No surprise here, i already webscrap a few times the BTC price on coinmarketcap.com.
def get_btc_price(start_date='20170101', end_date='20190913'):
'''
Webscraping BTC price on coinmarketcap.com
'''
url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
btc = BeautifulSoup(requests.get(url).text, 'lxml')
btc = pd.read_html(str(btc.find_all('table', class_='table')[0]))[0]
btc['Date'] = pd.to_datetime(btc['Date'])
return btc
bhp = get_btc_price(start_date='20180201', end_date='20190923')
bhp = bhp.set_index('Date').sort_index()
bhp.head()
Open* | High | Low | Close** | Volume | Market Cap | |
---|---|---|---|---|---|---|
Date | ||||||
2018-02-01 | 10237.30 | 10288.80 | 8812.28 | 9170.54 | 9959400448 | 154428564694 |
2018-02-02 | 9142.28 | 9142.28 | 7796.49 | 8830.75 | 12726899712 | 148725283812 |
2018-02-03 | 8852.12 | 9430.75 | 8251.63 | 9174.91 | 7263790080 | 154540000411 |
2018-02-04 | 9175.70 | 9334.87 | 8031.22 | 8277.01 | 7073549824 | 139433682759 |
2018-02-05 | 8270.54 | 8364.84 | 6756.68 | 6955.27 | 9285289984 | 117184385122 |
Let's join our 3 dataframes. For each, we need to check the period, and the shape.
A necessary step, because we will find duplicates and missing values in one dataframe.
df_fgi = fgi.copy()
df_gst = gst.copy()
df_bhp = bhp.copy()
#print(df_fgi.index[-1]) >>> 2019-09-23 (+3 days)
#print(df_gst.index[-1]) >>> 2019-09-20
#print(df_bhp.index[-1]) >>> 2019-09-22 (+2 day)
df_fgi = df_fgi[:-3]
df_gst = df_gst
df_bhp = df_bhp[:-2]
#print(df_fgi.shape) >>> (596, 2) (Something weird with this dataframe)
#print(df_gst.shape) >>> (597, 2)
#print(df_bhp.shape) >>> (597, 6)
#df_fgi.reset_index().shape >>> (596, 3)
#df_fgi.reset_index().drop_duplicates(subset='Date', keep='last').shape >>> (594, 3)
# Found two duplicate values: 2019-02-17, 2019-01-30
df_fgi = df_fgi.reset_index().drop_duplicates(subset='Date', keep='last')
# Found three missing values: 2018-04-14, 2018-04-15, 2018-04-16
# I had to create realistic values
# And reported the bug to API team, maybe it will be fix
val = pd.DataFrame({'Date':['2018-04-14', '2018-04-15', '2018-04-16'],'Score':['25', '26', '24'],'Category':['Extreme Fear', 'Fear', 'Extreme Fear']})
val['Date'] = pd.to_datetime(val['Date'])
df_fgi = df_fgi.append(val).set_index('Date').sort_index()
# Okay
df = df_fgi.join(df_gst).join(df_bhp)
df.head()
Score | Category | bitcoin trading | isPartial | Open* | High | Low | Close** | Volume | Market Cap | |
---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||
2018-02-01 | 30 | Fear | 82 | False | 10237.30 | 10288.80 | 8812.28 | 9170.54 | 9959400448 | 154428564694 |
2018-02-02 | 15 | Extreme Fear | 98 | False | 9142.28 | 9142.28 | 7796.49 | 8830.75 | 12726899712 | 148725283812 |
2018-02-03 | 40 | Fear | 64 | False | 8852.12 | 9430.75 | 8251.63 | 9174.91 | 7263790080 | 154540000411 |
2018-02-04 | 24 | Extreme Fear | 67 | False | 9175.70 | 9334.87 | 8031.22 | 8277.01 | 7073549824 | 139433682759 |
2018-02-05 | 11 | Extreme Fear | 83 | False | 8270.54 | 8364.84 | 6756.68 | 6955.27 | 9285289984 | 117184385122 |
# Let's save this code (date format 190913)
today = date.today().strftime("%Y%m%d")[2:]
df.to_csv(f'{today}_bitcoin_predictions_data.csv')