Notebook

Scoring Company Headline Sentiment¶

This is a quick example of scraping headlines and then using VADER to assign a sentiment score. These scores are fitted into a time series, which can then be combined with a time series of closing prices.

Getting headlines from single sources such as Bloomberg or Reuters is usually a lot easier, especially when they have APIs. But Yahoo Finance headlines are semi-curated from multiple sources and may offer a better cross-section of news articles, so it could be worth the trouble.

In [40]:

from bs4 import BeautifulSoup
import urllib
import re
from time import sleep
import pandas as pd
import numpy as np
import datetime
import json
import sys
import time
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
%matplotlib inline
today = datetime.datetime.now()

This code combines multiple json files. It could be useful to keep separate files from each scrape in case the combined file is corrupted for whatever reason.

In [146]:

# The financials argument is for combining financial statement jsons, not shown here

def merge_json_headlines (input_files = (), *args, output_file = 'data/all_headlines.json'):
    head_list = []
    for i in range(len(input_files)):
        with open (input_files[i]) as current_file:
            head_list.append(json.load(current_file))
    
    collector = head_list[0]
    for i in range(1, len(input_files)):
        for k_time,v_time in head_list[i].items():
            if k_time in collector:
                for k_head, v_head in v_time.items():
                    collector[k_time][k_head] = head_list[i][k_time][k_head]
            else:
                collector[k_time] = head_list[i][k_time]

    with open('data/' + output_file, 'w') as outfile:
        json.dump(collector, outfile)

This was a fun scraping exercise. The headline api calls are not only not paginated, but coded. So we have to log network traffic to get the specific api calls and then access them directly to get the json headline data. This is not, strictly speaking, necessary, since we can simply scrape the list items added to the page, but there is more precision in the dates from the json.

The Selenium scrolling is also not strictly necessary, but it reduces the need for contant scraping for fear that a flurry of headlines will cause one to miss a few.

In [200]:

# We might want to pay special attention to some companies that have more headlines.
# Make sure we get everything by scrolling down on the news page and finding the json injections directly.
# That is, until I can decipher the API codes. Then there'll be no need for any of this.

import re
import pyautogui

# The special list of companies we pay extra attention to. While it would be possible to do this for every company,
# it would be a waste of time.

def get_more_headlines():
    # It makes more sense to have a limited list since most companies don't have nearly as many stories about them
    # as high profile companies like Apple or Facebook.
   # with open('companies.json') as company_data:
    #    data = json.load(company_data)
    #tickers = [key.replace('.','-') for key,value in data.items()]
    
    tickers = ['AAPL','AMZN','FB','GOOG','NFLX']

    # Each page usually requires about 38-40 pagedowns to get everything. 50 would be very safe.
    # 10 should probably be enough as long as this is run regularly.
    pagedowns = 50
    
    chromeOptions = webdriver.ChromeOptions()
    
    # disable images, or sometimes not
    
    prefs = {"profile.managed_default_content_settings.images":2}
    chromeOptions.add_experimental_option("prefs",prefs)
    chromeOptions.add_argument('--ignore-certificate-errors')
    
    # For whatever reason, the log file doesn't complete in headless mode, so we use the "next best"
    # option of moving the window out of the way quickly.
    
    #chromeOptions.add_argument("--headless")
    
    extra_headlines = {}

    # First we trigger the json injections into the page while logging network traffic through Chrome
    
    for ticker in tickers:
        
        # Log files stored in d:\jsontemp\
        
        chromeOptions.add_argument('--log-net-log=d:\\jsontemp\\'+ticker+'.json')        
        driver = webdriver.Chrome(chrome_options=chromeOptions)
        
        # Shift the window waaaaay to the left, out of sight
        
        driver.set_window_position(-3000, 0)
        driver.set_script_timeout(15)

        url = 'https://finance.yahoo.com/quote/'+ticker+'/news?p='+ticker
        driver.get(url)
    
        sleep(0.5)

        elem = driver.find_element_by_tag_name("body")

        for i in range(pagedowns):
            elem.send_keys(Keys.PAGE_DOWN)
            
            # On a slow connection, this time should be increased to allow for things to load properly
            
            time.sleep(0.1)


        # Closing the tab rather than the entire browser first might increase the likelihood of the log saving properly.
        
        driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
        driver.close()
        
        # It seems like there should be a pause to 
        
        sleep(1)

    # With the net logs in hand, get the relevant json urls from them. These files are rather large.
    
    for ticker in tickers:
        with open('d:\\jsontemp\\'+ticker+'.json','r') as infile:
            data = json.load(infile)
        
        # Get the relevant json urls which contain "/content;", but ignore repeats
        
        json_list = []
        for i,e in enumerate(data['events']):
            for k1,v1 in e.items():
                if type(v1) == dict:
                    for k2,v2 in v1.items():
                        if type(v2) == str:
                             if '/content;' in v2:
                                if v2 not in json_list:
                                    json_list.append(v2)

        # 
        for js in json_list:
            json_news = requests.get(js).json()['data']['items']
             
            try:
                for d in json_news:
                    news_array.append(d)
            except:
                news_array = json_news

        for i,d in enumerate(news_array):
            if type(d) == dict:
                d['pubtime'] = int(time.mktime(datetime.datetime.strptime(d['publishDateStr'],'%B %d, %Y').timetuple()))*1000
                if d['provider'] != None:
                    d['publisher'] = d['provider']['name']
                else:
                    d['publisher'] = 'N/A'
        news_array = process_headlines (news_array)
        extra_headlines[ticker] = news_array
                
    with open('data/extraheadlines_'+datetime.datetime.now().strftime('%Y-%m-%d')+'.json', 'w') as outfile:
        json.dump(extra_headlines, outfile)
        
def process_headlines(heads):
    # We need to account for editor's picks articles which have a different forma
    reorg = {}
    for head in heads:
        for e in ['clusterInfo','editorsPick','format','storyline','i13n','id',
                  'idPrefix','images','is_eligible','link','off_network','type']:
            try:
                head.pop(e)
            except:
                pass 
        
        head['pubtime'] = datetime.datetime.fromtimestamp(int(head['pubtime'])//1000).strftime('%Y-%m-%d')
        if 'summary' not in head:
            head['summary'] = '(NO SUMMARY)'
        
        if head['pubtime'] in reorg:
            reorg[head['pubtime']][head['title']] = {'summary':head['summary'],
                                                   'publisher': head['publisher'],
                                                   'url': head['url']}
        else:
            reorg[head['pubtime']]={head['title']:{'summary':head['summary'],
                                                   'publisher': head['publisher'],
                                                   'url': head['url']}}
    
    # Store the number of headlines each day, as a large number of headlines MIGHT have a higher influence
    # with price movement in either direction.
    for day in reorg:
        daily_heads = len(reorg[day])
        reorg[day]['daily_headlines'] = daily_heads
        
    return reorg

In [201]:

get_more_headlines()

In [202]:

files = [os.path.join('data', f) for f in sorted(os.listdir('data')) if str(f).endswith('json')]

In [204]:

merge_json_headlines (input_files = files, output_file = 'all_headlines.json')

In [205]:

with open('data/all_headlines.json','r') as infile:
    data = json.load(infile)

In [206]:

fb = data['FB']

Text to sentiment score¶

SentimentIntensityAnalyzer in vader assigns compound, positive, neutral and negative scores for inputted text. Like the name suggests, the compound score is an overall score $\in (-1,1)$, where lower scores are more negative, and higher scores are more positive.

In [207]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, word_tokenize, sent_tokenize

We'll try two methods: headlines only and headlines with summaries. Summaries usually give siginificantly more relevant information than headlines alone. However, studies have shown that most people only read headlines, so the results might shock you! Data scientists hate them! Actually, because Yahoo displays the summary together with the headline, it is difficult to avoid the summaries.

The daily_headlines dictionary key that gives the number of daily headlines is not used in this example. Among other things, it could be used to try to predict the impact of sentiment with regards to a percentage change in price. A plethora of positive or negative coverage could signal major movements.

In [208]:

def get_hl_summary (ticker):
    text_dict = {}
    for k,v in ticker.items():
        text = []
        for hl in ticker[k]:
            if hl != 'daily_headlines':
                text.append('. '.join([hl, ticker[k][hl]['summary']]))
        text_dict[k] = text
    return text_dict

def get_hl (ticker):
    text_dict = {}
    for k,v in ticker.items():
        text = []
        for hl in ticker[k]:
            if hl != 'daily_headlines':
                text.append(hl)
        text_dict[k] = text
    return text_dict

In [209]:

# Gets the mean compound score of the inputted lines.

def get_score(lines):
    sid = SIA()
    score_list = []
    for line in lines:
        scores = sid.polarity_scores(line)
        current_scores = []
        
        # only using the combined score for now, but other scores could be useful
        
        for s in sorted(scores):
            current_scores.append(scores[s])
        score_list.append(current_scores)
    score_list = pd.DataFrame(score_list)
    mean_score = np.mean(score_list,axis=0)[0]
    return mean_score

In [210]:

def get_daily_scores(data):
    all_scores = {}
    for k, v in data.items():
        scores = []
        for i,lines in enumerate(v):
            lines = sent_tokenize(data[k][i])
            scores.append(get_score(lines))
        all_scores[k] = np.mean(scores)
    return all_scores

In [211]:

hl_summary = get_hl_summary(fb)

In [212]:

all_scores = get_daily_scores(hl_summary)

In [213]:

all_scores

Out[213]:

{'2018-01-10': 0.0,
 '2018-01-12': 0.49698333333333333,
 '2018-01-13': 0.14999166666666663,
 '2018-01-14': 0.41883333333333334,
 '2018-01-15': 0.12347166666666665,
 '2018-01-16': 0.091691000000000009,
 '2018-01-17': 0.059152642276422765,
 '2018-01-18': 0.1148108163265306,
 '2018-01-19': 0.068206269841269831}

In [214]:

hl_only = get_hl(fb)
all_scores_hl_only = get_daily_scores(hl_only)

In [215]:

all_scores_hl_only

Out[215]:

{'2018-01-10': 0.0,
 '2018-01-12': 0.06133333333333333,
 '2018-01-13': 0.214725,
 '2018-01-14': 0.42149999999999999,
 '2018-01-15': -0.013089999999999996,
 '2018-01-16': 0.066098000000000004,
 '2018-01-17': 0.063718292682926822,
 '2018-01-18': 0.079093877551020403,
 '2018-01-19': -0.016459523809523806}

The news has been slightly positive during this time period. The big discrepancy between the summary and no summary scores was on January 12, so let's take a look at that.

In [216]:

fb['2018-01-12']

Out[216]:

{"Cramer's game plan: JP Morgan set the benchmark. Now watc...": {'publisher': 'CNBC Videos',
  'summary': "Jim Cramer laid out the stocks and events he'll be watching as earnings season kicks into high gear with the big banks' reports.",
  'url': 'https://finance.yahoo.com/video/cramers-game-plan-jp-morgan-233900741.html'},
 "Cramer: 'This time it's different' can actually make you ...": {'publisher': 'CNBC Videos',
  'summary': 'Jim Cramer said the success of Facebook, Amazon, Netflix and Alphabet proves why "this time it\'s different" can help you, not hurt you.',
  'url': 'https://finance.yahoo.com/video/cramer-time-different-actually-234100884.html'},
 "Make money with 'this time it's different'": {'publisher': 'CNBC Videos',
  'summary': 'Jim Cramer said the success of Facebook, Amazon, Netflix and Alphabet proves why "this time it\'s different" can help you, not hurt you.',
  'url': 'https://finance.yahoo.com/video/money-time-different-234900972.html'},
 'daily_headlines': 3}

All Jim Cramer stories, and two of them are repeats. Sometimes less is more.

Anyway, we can then classify these scores as follows:

0 negative, $<$ -0.5
1 neutral, [-0.5,0.5]
2 positive, $>$ 0.5

They can be adjusted, but these are the guideline thresholds. We can then do the usual things with regression and ensembles to predict the direction of the stock.

In [ ]: