Collecting information for machine learning purposes. Parsing and Grabbing

When studing machine learning we mainly concentrate on algorithms of proccessing data rather than on collecting that data. And this is naturally because there are so many databases available for downloading: any types, any sizes, for any ML algorithms. But in real life we are given particular goals and of course any data science processing starts with collecting/getting of information.

Today our life is directly connected with internet and web sites: almost any text information that we could need is available online. So in this tutorial we'l consider how to collect particular information from web sites. And first of all we`l look a little inside html code to understand better how to extract information from it.

HTML language "tells" web browsers when, where and what element to show at the html page. We can imagine it as map that specifies the route for drivers: when to start, where to turn left or right and what place to go. That`s why html structure of web pages are so convinient to grab information. Here is the simple piece of HTML code:

In [5]:
'''
<h2> Hello! This is the first text paragraph</h2>
<p><font size="+1" color="red"> and below how this code is interpreted by web browser</font></p>
'''
Out[5]:
'\n<h2> Hello! This is the first text paragraph</h2>\n<p><font size="+1"> and below how this code is interpreted by web browser</font></p>\n'

Hello! This is the first text paragraph

and below how this code is interpreted by web browser

Two markers 'h1' and 'p' point browser what and how to show on the page, and thus this markers are keys for us that help to get exactly that information we need. There are a lot of information about html language and its main tags ('h1', 'p', 'html', etc. - this are all tags), so you can learn it more deeply, because we will focus on the parsing process. And for this purpose we will use BeautifulSoup Python library:

In [6]:
from bs4 import BeautifulSoup

Before learning how to grab information directly online, let's load a little html page as text file from our folder. (right click on this link and download to folder with yupiter notebook), as sometimes we can work with already downloaded files.


In [7]:
source_text = open('toy.html', 'r')
html = ''
for line in source_text:    
    html = html+line
print(html)
<html><body><p class='main 1'>This is first paragraph.</p><p class='main 2'>This is second paragraph.</p><p id='third'>This is third paragraph.</p></body></html>

Our file consists of several tags and the information we need to get is contained in the second 'p' tag. To get it, we have to "feed" the BeautifulSoup module with total html code, that it could parse through it and find what we need.

In [8]:
# "feed" total page
soup = BeautifulSoup(html, "lxml")
# Find all <p> blocks in the page 
all_p = soup.find_all('p')
print(all_p)
[<p class="main 1">This is first paragraph.</p>, <p class="main 2">This is second paragraph.</p>, <p id="third">This is third paragraph.</p>]

Function '.find_all' collects for us all 'p' blocks in our file. We simply choose the necessary p-element from list by certain index and leave only text inside this tag:


In [25]:
#extract tags text
all_p[1].text
Out[25]:
'This is second paragraph.'

But when there are many uniform tags (like 'p') in the code or when from page to page the index number of needed paragraph changes, such aprroach will not do right things. Today almost any tag contains special attributes like 'id', 'class', 'title' etc. To know more about it you can search for CSS style sheet language. For us this attributes are additional anchors to pull from page source exactly the right paragraph (in our case). Using function '.find' we'l get not list but only 1 element (be sure that such element is only one in the page, because you can miss some information in other case.)

In [10]:
# We know the needed paragraph contains attribute id='third'. Let`s grab it 
p_third = soup.find('p', id='third').text
print(p_third)
# In case of searching for class attribute, we have to use another way of coding,
# because of reserved word "class" in Python
p_first = soup.find('p', {'class': 'main 1'}).text
print(p_first)
This is third paragraph.
This is first paragraph.

In the times of dynamic pages which has various CSS styles for different types of devices, you will be very often facing the problem of changing names of tag attributes. Or it could change a little from page to page accroding to other content on it. In case when names of needed tag blocks are totaly different - we have to setup complex grabbing "architecture". But usually there are common words in that differing names. In our toy case we have two paragraphs with word "main"in 'class' attribute in both of them.

In [14]:
# find all paragraphs where class attribute contains word "main" 
p_main = soup.find_all('p', {'class': 'main'})
p_main = [p.text for p in p_main]
print(p_main)
# find all paragraphs with word "2"
p_second = soup.find_all('p', {'class': '2'})
p_second = [p.text for p in p_second]
print(p_second)
['This is first paragraph.', 'This is second paragraph.']
['This is second paragraph.']

Also you can get code between two tags which contains inside smaller tag blocks and clean them out, leaving only text inside them.


In [15]:
# get all html code inside tag "html"
html_tag = soup.find('html')
print(html_tag)
# clear it from inside tags, leaving text.
html_tag = html_tag.text
print(html_tag)
<html><body><p class="main 1">This is first paragraph.</p><p class="main 2">This is second paragraph.</p><p id="third">This is third paragraph.</p></body></html>
This is first paragraph.This is second paragraph.This is third paragraph.

Let's do things more complicated. First grabbing task.

Imagine we have the task to analyze if there was correlation between the title of major news and price for Bitcoin? On the one hand we need to collect news about bitcoin for defined period and on the another - price. To do this we need to connect a few extra libraries including "selenium". Then download chromedriver and put it in folder with yupiter notebook. Selenium connects python script with chrome browser and let us to send commands to it and recieve html code of loaded pages.

In [17]:
from selenium import webdriver
from datetime import datetime
from datetime import timedelta
import time
import random

One of the way to get necessery news is to use Google Search. First of all because it grabs news headlines from many sites and we don't need to tune our script for that every news portal. The second thing - we can browse news by dates. What we have to do is to understand how the link of google news section works:

https://www.google.com/search?q=bitcoin&num=100&biw=1920&bih=938&source=lnt&tbs=cdr%3A1%2Ccd_min%3A12%2F11%2F2018%2Ccd_max%3A12%2F11%2F2018&tbm=nws

"search?q=bitcoin" - what we are searching for

"num=100" - number of headlines

"cd_min%3A12%2F11%2F2018" - start date (cd_min%3A [12] %2F [11] %2F [2018] %2C - 12/11/2018 - MM/DD/YYYY)

"cd_max%3A12%2F11%2F2018" - end date

<img src = 'http://umachka.net/ml/parsing_01.png'>


Let's try to load news for word bitcoin for 01/15/2018


In [18]:
# start chrome web browser
driver = webdriver.Chrome()
# set settings
cur_year = 2018
cur_month = 1
cur_day = 1
news_word = 'bitcoin'
# set up url
cur_url = 'https://www.google.com/search?q='+str(news_word)+'&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:'\
       +str(cur_month)+'/'+str(cur_day)+'/'+str(cur_year)+\
       ',cd_max:'+str(cur_month)+'/'+str(cur_day)+'/'+str(cur_year)+'&tbm=nws'
# load url to the chromedriver
driver.get(cur_url)
# wait a little to let the page load fully
time.sleep(random.uniform(5, 10))
# read html code of loaded to chrome web page.
html = driver.page_source
soup = BeautifulSoup(html, "lxml")

We are lucky): correct page, correct search word and correct date. To move on, we need to examine html code and to find that tags (anchors) which let us grab necessary information. The most convenient way is to use "inspect" button of the right-click menu of Google Chrome web browser (or similar in other browsers). See the screenshot.

As we see 'h3' tag is resposible for block with news titles. This tag has attribute class="r dO0Ag". But in this case we can use only 'h3' tag as anchor because it used only to highlight titles.

<img src = 'http://umachka.net/ml/parsing_02.png'>

In [19]:
# collect all h3 tags in the code
titles = soup.find_all('h3')
print(titles)
[<h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/after-ripples-rise-btc-dominance-falls-below-40/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/after-ripples-rise-btc-dominance-falls-below-40/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIJygAMAA">After Ripple's Rise BTC Dominance Falls Below 40%</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.theglobeandmail.com/report-on-business/how-is-the-growth-of-bitcoin-affecting-the-environment/article37468173/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.theglobeandmail.com/report-on-business/how-is-the-growth-of-bitcoin-affecting-the-environment/article37468173/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIKigAMAE">How is the growth of <em>bitcoin</em> affecting the environment?</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.investopedia.com/news/two-things-we-learned-bitcoins-price-moves/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.investopedia.com/news/two-things-we-learned-bitcoins-price-moves/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIILygAMAI">Two Things We Learned From <em>Bitcoin's</em> Price Moves</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://cointelegraph.com/news/india-falsely-condemns-bitcoin-as-ponzi-scheme-flawed-logic" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://cointelegraph.com/news/india-falsely-condemns-bitcoin-as-ponzi-scheme-flawed-logic&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIMigAMAM">India Falsely Condemns <em>Bitcoin</em> as Ponzi Scheme, Flawed Logic</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.forbes.com/sites/petertchir/2018/01/01/bitcoin-futures-fail-to-live-up-to-the-hype/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.forbes.com/sites/petertchir/2018/01/01/bitcoin-futures-fail-to-live-up-to-the-hype/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIINSgAMAQ"><em>Bitcoin</em> Futures Fail To Live Up To The Hype</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://forklog.com/stanet-li-lightning-network-otvetom-bitkoina-na-nizkie-komissii-altkoinov/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://forklog.com/stanet-li-lightning-network-otvetom-bitkoina-na-nizkie-komissii-altkoinov/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIOCgAMAU">Станет ли Lightning Network ответом биткоина на низкие комиссии ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.cnbc.com/2017/12/29/bitcoin-fever-to-burn-out-in-spectacular-crash-david-stockman-warned.html" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.cnbc.com/2017/12/29/bitcoin-fever-to-burn-out-in-spectacular-crash-david-stockman-warned.html&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIOygAMAY"><em>Bitcoin</em> fever to burn out in 'spectacular crash,' David Stockman warned</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://cointelegraph.com/news/bitcoin-adds-03-to-japans-gdp-claim-nomura-analysts" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://cointelegraph.com/news/bitcoin-adds-03-to-japans-gdp-claim-nomura-analysts&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIQCgAMAc"><em>Bitcoin</em> 'Adds 0.3%' To Japan's GDP, Claim Nomura Analysts</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/bank-of-england-could-issue-bitcoin-style-digital-currency-by-2018/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/bank-of-england-could-issue-bitcoin-style-digital-currency-by-2018/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIRSgAMAg">Bank of England Could Issue “<em>Bitcoin</em>-style Digital Currency” in 2018</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/privacy-coin-verge-allegedly-leaking-users-ip-addresses/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/privacy-coin-verge-allegedly-leaking-users-ip-addresses/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIISCgAMAk">'Privacy Coin' Verge is Allegedly Leaking Users' IP Addresses</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.thenational.ae/world/how-the-world-hijacked-bitcoin-in-2017-1.692055" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.thenational.ae/world/how-the-world-hijacked-bitcoin-in-2017-1.692055&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIISygAMAo">How the world hijacked <em>bitcoin</em> in 2017</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/public-firm-faces-class-action-lawsuit-for-falsely-claiming-link-to-bitcoin/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/public-firm-faces-class-action-lawsuit-for-falsely-claiming-link-to-bitcoin/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIITigAMAs">Public Firm Faces Class Action Lawsuit for Falsely Claiming Link to ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.forbes.com/sites/francescoppola/2018/01/01/the-illogical-value-proposition-of-bitcoin/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.forbes.com/sites/francescoppola/2018/01/01/the-illogical-value-proposition-of-bitcoin/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIUSgAMAw">The Illogical Value Proposition Of <em>Bitcoin</em></a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://qz.com/1169000/ripple-was-the-best-performing-cryptocurrency-of-2017-beating-bitcoin/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://qz.com/1169000/ripple-was-the-best-performing-cryptocurrency-of-2017-beating-bitcoin/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIVCgAMA0">Here are the top 10 cryptoassets of 2017 (and <em>bitcoin's</em> 1000% rise ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://psmag.com/economics/bitcoin-in-your-bodega" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://psmag.com/economics/bitcoin-in-your-bodega&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIVygAMA4"><em>Bitcoin</em> Is Coming to Your Bodega</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.coindesk.com/video-bitcoin-sign-guy-tells-infamous-janet-yellen-photobomb" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.coindesk.com/video-bitcoin-sign-guy-tells-infamous-janet-yellen-photobomb&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIWigAMA8">Video: <em>Bitcoin</em> Sign Guy Tells All About Infamous Janet Yellen ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.cbc.ca/news/business/bitcoin-s-gender-divide-could-be-a-bad-sign-experts-say-1.4458884" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.cbc.ca/news/business/bitcoin-s-gender-divide-could-be-a-bad-sign-experts-say-1.4458884&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIXSgAMBA"><em>Bitcoin's</em> gender divide could be a bad sign, experts say</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://cointelegraph.com/news/bitcoin-adoption-by-businesses-in-2017" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://cointelegraph.com/news/bitcoin-adoption-by-businesses-in-2017&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIYCgAMBE"><em>Bitcoin</em> Adoption by Businesses in 2017</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/pr-promising-icos-and-how-to-spot-them-icotoinvest-com/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/pr-promising-icos-and-how-to-spot-them-icotoinvest-com/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIYygAMBI">PR: Promising ICOs and How to Spot Them – ICOtoInvest.com</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.coindesk.com/video-bitcoin-litecoin-charlie-lee-crypto-better-2017" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.coindesk.com/video-bitcoin-litecoin-charlie-lee-crypto-better-2017&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIZigAMBM">Video: <em>Bitcoin</em> or Litecoin? Charlie Lee on Which Crypto Had a Better ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.rt.com/business/414725-china-russia-bitcoin-oil/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.rt.com/business/414725-china-russia-bitcoin-oil/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIaSgAMBQ">China &amp; Russia to crash <em>bitcoin</em> &amp; trade oil in yuan: Saxo Bank's ...</a></h3>, <h3 class="r dO0Ag"><a class="l lLrAF" href="https://news.bitcoin.com/south-korean-exchanges-policies-comply-crypto-regulation/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://news.bitcoin.com/south-korean-exchanges-policies-comply-crypto-regulation/&amp;ved=0ahUKEwjKia-71ZzfAhUXiIMKHcEFBbEQqQIIbCgAMBU">South Korean Exchanges Revise Policies to Comply with Crypto ...</a></h3>]

There are a lot of additional tags inside 'h3' blocks, that why we use loop to clear them and leave only text.


In [31]:
titles = [title.text for title in titles]
In [32]:
print(titles)
print(len(titles))
["After Ripple's Rise BTC Dominance Falls Below 40%", 'How is the growth of bitcoin affecting the environment?', 'India Falsely Condemns Bitcoin as Ponzi Scheme, Flawed Logic', "Two Things We Learned From Bitcoin's Price Moves", 'Bitcoin Futures Fail To Live Up To The Hype', 'Станет ли Lightning Network ответом биткоина на низкие комиссии ...', "Bitcoin fever to burn out in 'spectacular crash,' David Stockman warned", "Bitcoin 'Adds 0.3%' To Japan's GDP, Claim Nomura Analysts", 'Bank of England Could Issue “Bitcoin-style Digital Currency” in 2018', "'Privacy Coin' Verge is Allegedly Leaking Users' IP Addresses", 'How the world hijacked bitcoin in 2017', 'Public Firm Faces Class Action Lawsuit for Falsely Claiming Link to ...', 'The Illogical Value Proposition Of Bitcoin', "Here are the top 10 cryptoassets of 2017 (and bitcoin's 1000% rise ...", 'Bitcoin Is Coming to Your Bodega', 'Video: Bitcoin Sign Guy Tells All About Infamous Janet Yellen ...', "Bitcoin's gender divide could be a bad sign, experts say", 'Bitcoin Adoption by Businesses in 2017', 'PR: Promising ICOs and How to Spot Them – ICOtoInvest.com', 'Video: Bitcoin or Litecoin? Charlie Lee on Which Crypto Had a Better ...', "China & Russia to crash bitcoin & trade oil in yuan: Saxo Bank's ...", 'South Korean Exchanges Revise Policies to Comply with Crypto ...']
22

That's all. We get 21 news titles dated 1 january 2018. Also we can grab a few starting sentences from that news and use them in future analysis.


In [33]:
# alternative way to set class attribute is to use "class_"
news = soup.find_all('div', class_='st')
news = [new.text for new in news]
print(news)
print(len(news))
['Despite being one of the most amazing periods of sustained gains for bitcoin, 2017 ends with the first cryptocurrency losing market share to altcoins. After falling\xa0...', 'Bitcoin is the most popular virtual currency in the world, and it has grown in value this year. It was created in 2009 as a new way of paying for things that would\xa0...', 'Recently the Indian finance ministry criticized Bitcoin and the rest of the digital currencies in the market for their lack of intrinsic value. The Indian finance ministry\xa0...', "Even though bitcoin's price jumped by more than 1,400 percent in 2017, observers have yet to come up with a good explanation for its rise.", "After the first week I gave the CBOE's Bitcoin Futures contract a grade of C- (link). That contract has been trading for 3 weeks and the CME's contract has now\xa0...", 'В начале декабря 2017 года, когда основное внимание биткоин-сообщества было приковано к стремительному росту цены криптовалюты, многие\xa0...', "First it was the stock market. Now, it's bitcoin. David Stockman, President Ronald Reagan's former director of the Office of Management and a relentless Wall\xa0...", "The optimism capitalizes on Japan's Bitcoin breakout this year, which saw the country take pole position in Bitcoin-to-fiat trading, as well as adopt nurturing\xa0...", "The Bank of England might have “its own Bitcoin-style digital currency” this year, according to the country's legacy media. The more than three hundred year old\xa0...", "... article on privacy coins, news.Bitcoin.com wrote: “The general consensus is that verge isn't as private as some of its competitors, so don't trust it with your life.", 'Hearing Jamie Dimon talk out the side of his mouth, calling bitcoin a “fraud” in September, less than a month before the bank he runs, JP Morgan Chase, said it\xa0...', 'With bitcoin at over $19,000 and ether close to $800 that day, the stock price of Apollo shot up in price from 4.3 shekels to over 10 shekels immediately after this\xa0...', 'This paper has been doing the rounds on Twitter. Written by the pseudonymous “Mr. Game & Watch”, it purports to “demystify” the “value proposition” of Bitcoin.', "Bitcoin's value grew by more than 1,000% in 2017, but that wasn't enough to even place it among the 10 best-performing cryptoassets of the year. In a breakout\xa0...", 'This year, Bitcoin has increased in value more than a thousand percent, with the price of a single Bitcoin recently (albeit briefly) breaking $17,000. Entrepreneurs\xa0...', "This is an entry in CoinDesk's Most Influential in Blockchain 2017 series. He might never spend a summer in D.C. again, but he made this one count.", "Bitcoin, and the world of cryptocurrency, is a boys' club, say some experts, and that should be cause for concern. Cryptocurrency is a form of digital currency\xa0...", "2017 was a big year for Bitcoin. CBOE launched the first Bitcoin futures market, the NYSE filed for two Bitcoin ETF's, and Bitcoin price rose over 1,300 percent.", 'ICOtoInvest.com successfully recognised the potential of trade.io, the project has moved forward to form partnerships with HitBTC, the renown Bitcoin university\xa0...', "While litecoin outperformed in the markets, Lee says not to write off bitcoin's technical improvements. In this exclusive CoinDesk interview, Lee talks about ICOs,\xa0...", "Danish Saxo Bank is famous for making 'outrageous' predictions for the year ahead. This year, the bank predicts Russia and China will crack down on bitcoin\xa0...", "South Korea's cryptocurrency exchanges have implemented changes to comply with the government's mandates announced last week. In addition to restricting\xa0..."]
22

But 1 day in history is nothing for correlation detection . That's why we'l create the list of 10 dates (for educational purposes) and set up loop grabbing to get news for all of them. Tip: if you want to change the language of news, when the first page is loaded during script execution, change the language in the settings manually or in few minutes you'l learn how to do this by algorithm.

In [26]:
#Create the list of dates
dates = [datetime.strptime('01/25/2018', '%m/%d/%Y')+timedelta(days=n) 
         for n in range(10)]
print(dates)
[datetime.datetime(2018, 1, 25, 0, 0), datetime.datetime(2018, 1, 26, 0, 0), datetime.datetime(2018, 1, 27, 0, 0), datetime.datetime(2018, 1, 28, 0, 0), datetime.datetime(2018, 1, 29, 0, 0), datetime.datetime(2018, 1, 30, 0, 0), datetime.datetime(2018, 1, 31, 0, 0), datetime.datetime(2018, 2, 1, 0, 0), datetime.datetime(2018, 2, 2, 0, 0), datetime.datetime(2018, 2, 3, 0, 0)]
In [27]:
# create lists to save grabbed by dates information 
news=[]
titles=[]
# start loop
for date in dates:
    cur_year = date.year
    cur_month = date.month
    cur_day = date.day
    news_word = 'bitcoin'    
    cur_url = 'https://www.google.com/search?q='+str(news_word)+'&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:'\
           +str(cur_month)+'/'+str(cur_day)+'/'+str(cur_year)+\
           ',cd_max:'+str(cur_month)+'/'+str(cur_day)+'/'+str(cur_year)+'&tbm=nws'
    driver.get(cur_url)
    # we have to increase the pause between loadings of pages to avoid detection 
    # our activity as robots doing. So you have time for a cup of coffee while waiting.
    time.sleep(random.uniform(60, 120))
        
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    cur_titles = soup.find_all('h3')
    cur_titles = [title.text for title in cur_titles]
    titles.append(cur_titles)
    cur_news = soup.find_all('h3')
    cur_news = [new.text for new in cur_news]
    news.append(cur_news)
print(len(dates))
print(len(titles))
print(len(news))
#chech if the script works properly
print(dates[5])
print(titles[5][:5])
print(news[5][:5])
driver.quit()
10
10
10
2018-01-30 00:00:00
['Facebook is banning all ads promoting cryptocurrencies — including ...', 'Bitcoin drops 12%, falls below $10000 amid broad cryptocurrency sell ...', "Turkish soccer club completes first Bitcoin transfer in country's history ...", 'Facebook ban on bitcoin ads is the latest in a very bad day for ...', "Coincheck to Repay Hack Victims' XEM Balances at 81 US Cents Each"]
['Facebook is banning all ads promoting cryptocurrencies — including ...', 'Bitcoin drops 12%, falls below $10000 amid broad cryptocurrency sell ...', "Turkish soccer club completes first Bitcoin transfer in country's history ...", 'Facebook ban on bitcoin ads is the latest in a very bad day for ...', "Coincheck to Repay Hack Victims' XEM Balances at 81 US Cents Each"]

But if it is so simply, it wouldn't be so interesting.

First of all such grabbing could be detected by web sites algorithms like the robots activity and could be banned. Secondly, some web sites hide all content of their pages and show it only when you scroll down the page. Thirdly, very often we need to input values in input boxes, click links to open next/previous page or click download button. To solve these problems we can use special methods to control browser.

Let's open the example page at "Yahoo! Finance": https://finance.yahoo.com/quote/SNA/history?p=SNA. If you scroll down the page you'l see that the content loads up periodically, and then finally it reaches the last row "Dec 13, 2017". But when the page is just opened and we view page source (Ctrl+U for Google Chrome), we'l not find there "Dec 13, 2017". So to get data with all dates for this symbol, first of all we have to scroll down to the end and after that parse the page. Such code will help us to solve this problem (to learn different ways of scrolling look here https://goo.gl/JdSvR4):

In [28]:
import pandas as pd

driver = webdriver.Chrome()

cur_url = 'https://finance.yahoo.com/quote/SNA/history?p=SNA'
driver.get(cur_url)

SCROLL_PAUSE_TIME = 0.5

equal = 0
html_len_list = []
while True:
    window_size = driver.get_window_size()
    # get the size of loaded page
    html_len = len(driver.page_source.encode('utf-8'))
    html_len_list.append(html_len)
    # scroll down 
    driver.execute_script("window.scrollTo(0, window.scrollY +" 
                          +str(window_size['height'])+")")
    # time to load all content
    time.sleep(1)
    # get the size of newly loaded content
    new_html_len = len(driver.page_source.encode('utf-8'))
    # check if size of content not equal before and after scrolling. 
    # if they are equal: add 1 to "equal" and scroll down.
    # if they are equal more than 4 last scrollings: break
    # if they are not "equal": reset equal to 0 and scroll down again
    if html_len == new_html_len:
        equal += 1
        if equal > 4:
            break
    else:
        equal = 0

print(html_len_list)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
table = soup.find('table', {'data-test':'historical-prices'})
table = pd.read_html(str(table))[0]
print(table.head())
print(table.tail())
driver.quit()
[706249, 707803, 720356, 720356, 768578, 768578, 768578, 768578, 794404, 794404, 794404, 794404, 794404]
           Date    Open    High     Low  Close*  Adj Close**    Volume
0  Dec 12, 2018  149.40  150.82  148.25  148.39       148.39  468400.0
1  Dec 11, 2018  150.74  151.82  146.87  147.34       147.34  538500.0
2  Dec 10, 2018  150.53  151.00  145.94  148.71       148.71  553600.0
3  Dec 07, 2018  156.17  157.50  149.83  151.00       151.00  643200.0
4  Dec 06, 2018  154.19  155.98  151.31  155.97       155.97  859300.0
                                                  Date    Open    High  \
250                                       Dec 19, 2017  172.66  173.49   
251                                       Dec 18, 2017  169.30  173.25   
252                                       Dec 15, 2017  166.71  168.97   
253                                       Dec 14, 2017  170.37  170.37   
254  *Close price adjusted for splits.**Adjusted cl...     NaN     NaN   

        Low  Close*  Adj Close**    Volume  
250  170.11  172.35       168.75  584800.0  
251  169.30  172.62       169.01  656000.0  
252  166.09  168.20       164.69  761900.0  
253  165.45  165.50       162.04  702900.0  
254     NaN     NaN          NaN       NaN  

There are many web sites that prefer to devide one article for two and more parts, so you have to click 'next' or 'previous' buttons. For us it is the task to open all that pages and grab them). The same task is with multi page catalogues. Here is example: we will open several pages at stackoverflow tags catalogue and collect top tag words with their occurancies through the portal. To do this we will use find_element_by_css_selector() method to locate certain element on the page and click on it with click() method. To read more about locating elements open this: https://goo.gl/PyzbBN

In [29]:
from selenium.webdriver.common.keys import Keys
import pandas as pd

def page_parse(html):
    soup = BeautifulSoup(html, 'lxml')
    tags = soup.find_all('div', class_='grid-layout--cell tag-cell')
    tag_text_list = []
    tag_count_list = []
    for tag in tags:
        tag_text = tag.find('a').text
        tag_text_list.append(tag_text)
        tag_count = tag.find('span', class_='item-multiplier-count').text
        tag_count_list.append(tag_count)
    
    return tag_text_list, tag_count_list


driver = webdriver.Chrome()
cur_url = 'https://stackoverflow.com/tags'
driver.get(cur_url)
tag_names, tag_counts = [], []

for i in range(3):
    if i == 0:
        html = driver.page_source
        cur_tag_names, cur_tag_counts = page_parse(html)
        tag_names = tag_names+cur_tag_names
        tag_counts = tag_counts+cur_tag_counts
    else:   
        # find necessery element to click
        next = driver.find_element_by_css_selector('.page-numbers.next')
        # in some cases it would be enough to run next.click() but sometimes it doesn`t work
        # for more information about possible troubles of using click() read here:
        # https://goo.gl/kUGvsC
        driver.execute_script("arguments[0].click();", next)
        time.sleep(2)
        html = driver.page_source
        cur_tag_names, cur_tag_counts = page_parse(html)
        tag_names = tag_names+cur_tag_names
        tag_counts = tag_counts+cur_tag_counts
        
tag_table = pd.DataFrame({'tag': tag_names,
                          'count': tag_counts})

print(tag_table.head())
print(tag_table.tail())
driver.quit()
          tag    count
0  javascript  1730078
1        java  1492129
2          c#  1268430
3         php  1248340
4     android  1158141
                 tag  count
103  ruby-on-rails-3  55908
104       validation  54970
105            numpy  54786
106             tsql  54530
107          sorting  54524

Or here is another example: medium.com site hides the part of comments below articles. But if we need to analyze the "reasons" of popularity of the page, comments can play a great role in this analysis and it better to grab all of them. Open this page and scroll to the bottom - you'll find that there is "Show all responses" button as "div" element there. Let's click on it and open all comments.

In [30]:
driver = webdriver.Chrome()
cur_url = 'https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60'
driver.get(cur_url)
# locate div container that contains button
find_div = driver.find_element_by_css_selector(".container.js-showOtherResponses")
# locate button inside container
button = find_div.find_element_by_tag_name("button")
driver.execute_script("arguments[0].click();", button)
# check the page before and after running script - in second case all 
# comments are opened 

Authorization and input boxes

A lot of information is available only after authorization. So let's learn how to log in at Facebook. Algorithm is the same: find input boxes for login and password, insert into them text and after submit it. To send text to the inputs we will use .send_keys() method and to submit: .submit() method.

In [32]:
driver = webdriver.Chrome()
cur_url = 'https://www.facebook.com/'
driver.get(cur_url)
username_field = driver.find_element_by_name("email") # get the username field by name
password_field = driver.find_element_by_name("pass") # get the password field by name
username_field.send_keys("your_email") # insert your email
password_field.send_keys("your_password") # insert your password
password_field.submit()

But this methods are also very usefull when we need to change dates or insert values into input boxes to get certain information. For example, here is the "one page tool" to recieve etf fund flows information: Etf Fund Flows. There are no special pages for each ETF (as Yahoo! has) to view or download desired values. All you can do: to enter ETF symbol, start and end dates and click button "Submit". But if your boss set atask to obtain historical data for 500 etfs and 10 last years (120 months), you'l have to click 60000 the button "submit". What's a dull amusement... So let's make an algorithm that can collect this information while you'l be raving somewhere at Ibiza party.

In [33]:
# a function to get year, date and month for start and end date inputs
def convert_date(date):    
    year = date.split("/")[2]
    month = date.split("/")[0]
    day = date.split("/")[1]
    # here we have to add zero before the month if it is less than 10
    # because the input form requires such format of data: 01/01/2018 
    if len(month)<2:
        month = str("0")+str(month)
    
    return day, month, month

driver = webdriver.Chrome()
# set some dates
dates = ['6/30/2018', '3/31/2018', '12/31/2017', '9/30/2017']
# set a few etfs
etfs = ['SPY', 'QQQ']
# create empty dataframe to store values
export_table = pd.DataFrame({})

# start ticker loop
for ticker in etfs:
    loginUrl = "http://www.etf.com/etfanalytics/etf-fund-flows-tool"    
    # set loop for dates without last date, because we need to set a period of 
    # 'start' and 'end' date
    for i in range(0,len(dates)-1):
        #create current pair of dates
        start_day, start_month, start_year = convert_date(dates[i+1])
        end_day, end_month, end_year = convert_date(dates[i])
        date_1 = str(start_year)+str("-")+str(start_month)+str("-") + str(start_day)
        date_2 = str(end_year)+str("-")+str(end_month)+str("-") + str(end_day)
        driver.get(loginUrl)
        # locate the field to input etf symbol
        ticker_field = driver.find_element_by_id("edit-tickers")        
        ticker_field.send_keys(str(ticker))
        # locate the field to input start date
        start_date_field = driver.find_element_by_id("edit-startdate-datepicker-popup-0")
        start_date_field.send_keys(str(date_1))
        # locate the field to input end date
        end_date_field = driver.find_element_by_id("edit-enddate-datepicker-popup-0")
        end_date_field.send_keys(str(date_2))
        # submit form
        end_date_field.submit()
        # read the page souse
        html = driver.page_source        
        soup = BeautifulSoup(html, "lxml")
        # find certain table with etf flows information
        table = soup.find_all('table', id='topTenTable')
        # some transformations to get html code readable by pd.read_html() method
        table = table[0].find_all('tbody')
        table = str(table[0]).split("<tbody>")[1]
        table = table.split("</tbody>")[0]
        data = "<table>" + str(table) + "</table>"
        soup = BeautifulSoup(data, "lxml")
        # convert html code to pandas dataframe
        df = pd.read_html(str(soup))
        current_table = df[0]
        current_table.columns = ["Ticker", "Fund Name", "Net Flows", "Details"]
        current_table["Start Date"] = [date_1]
        current_table["End Date"] = [date_2]
        # concatenate current inflow table with main dataframe.
        export_table = pd.concat([export_table, current_table], ignore_index=True)
        # let the algorithm rest for a while
        time.sleep(random.uniform(5, 10))

# some magic and we get the information assigned by task.         
print(export_table)
driver.quit()
  Ticker               Fund Name  Net Flows  Details Start Date  End Date
0    SPY  SPDR S&P 500 ETF Trust    1714.15      NaN   03-03-31  06-06-30
1    SPY  SPDR S&P 500 ETF Trust    1714.15      NaN   12-12-31  03-03-31
2    SPY  SPDR S&P 500 ETF Trust    1714.15      NaN   09-09-30  12-12-31
3    QQQ       Invesco QQQ Trust     342.83      NaN   03-03-31  06-06-30
4    QQQ       Invesco QQQ Trust     342.83      NaN   12-12-31  03-03-31
5    QQQ       Invesco QQQ Trust     342.83      NaN   09-09-30  12-12-31

There are enormous amount of sites, each of them has its own design, access to information, protection against robots, etc. That's why this tutorial could be as a little book. But at least one more approach of grabbing information we'l discover. It is connected with parsing dynamic graphs as like www.google/trends uses. Interestingly that Google's programmers don't allow to parse the code of trend graphs (the div tag which contains the code of graph is hidden) but let you download csv file with information (so we can use one of the above algorithms to find this button, to click and download file). <img src = 'http://umachka.net/ml/parsing_10.png'>


Let's take another site where we can parse similar graphs: Portfolio Visualizer. Scroll down this page and you'l find graph as at screenshot. The worth of this graph is that the historical prices for Us Treasury Notes are not freely available - you have to buy it. But here we can grab it either manually (rewrite dates and values correspondigly), or to write code which "rewrites" the values for us and not only from this page...

In [85]:
loginUrl="https://www.portfoliovisualizer.com/backtest-asset-class-allocation?s=y&mode=2&startYear=1972&endYear=2018&initialAmount=10000&annualOperation=0&annualAdjustment=0&inflationAdjusted=true&annualPercentage=0.0&frequency=4&rebalanceType=1&portfolio1=Custom&portfolio2=Custom&portfolio3=Custom&TreasuryNotes1=100"
driver = webdriver.Chrome()
driver.get(loginUrl)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
# find the div with chart values
chart = soup.find_all('div', id='chartDiv2')
table = str(chart[0]).split("<tbody>")[1]
table = table.split("</tbody>")[0]
data = "<table>"+str(table)+"</table>"
soup = BeautifulSoup(data, "lxml")
df = pd.read_html(str(soup))[0]
print(df.head())
print(df.tail())
driver.quit()
              0        1
0  Dec 31, 1971  $10,000
1  Jan 31, 1972   $9,901
2  Feb 29, 1972   $9,988
3  Mar 31, 1972   $9,979
4  Apr 30, 1972  $10,015
                0         1
559  Jul 31, 2018  $239,204
560  Aug 31, 2018  $241,631
561  Sep 30, 2018  $238,720
562  Oct 31, 2018  $237,998
563  Nov 30, 2018  $241,163

In coclusion

It's important to admit that parsing activity can be easily determined as robot's activity and you we'l be asked to pass the "antirobot's" captcha. On the one hand you can find solutions how to give the right answers to it, but on the other (i think more natural), you can set up such algorithms that will be similar to human activity, when they use web sites. You are lucky, when the website has no protection against parsing. But in case with Google News - after 10 or 20 loadings of page, you'l meet google's captcha. So try to make your algorithm more humanlike: scroll up and down, click links or buttons, be on the page for at list 10-15 seconds or more, especially when you need to download several thousands of pages, take breaks for hour and for night, etc.

And good luck!