Knowledge of webscraping gives access to the largest bank of data available. Almost any website can become a source of data. Its use can range from analyzing competitors to learning more about a user base.
It would make sense to scrape through comments of each post too, but that would take far too long. GameFaqs is at least very casual so posts there will be as opinionated as comments.
The slightly long-winded title explains most of what this notebook is about. My goal is to scrape the post titles from Reddit and Gamefaqs to perform sentiment analysis on them on top of a some data exploration.
I will be using selenium to scroll through all reddit posts and to do some other automation used for clicking buttons. BeautifulSoup will be used to scrape and retrieve the actual data.
I will need to scrape a list of current common user agents and lists of free, recent proxies to rotate through for GameFaqs. They use pages and not infinite scrolling so too many http requests will result in a ban. To play it safe and uninterrupted a few measures will be taken. These IPs and user agents could be used for scraping any other website as well.
I will be obtaining all posts from the past year and filter out the ones which don't mention any that don't mention a developer.
import requests
import random
import time
import pandas as pd
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
After the entire page is loaded, we can scrape all the text of each post title.
def reddit_scraper(url):
'''
Webscrapes all reddit posts from the given link by scrolling through the "infinite scrolling"
Args:
url: The url of the subreddit or other reddit page you'd like to scrape
Returns:
A list of all post titles on that page
'''
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
for n in range(600):
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(0.5)
page_html = soup(driver.page_source, 'lxml')
driver.close()
containers = page_html.findAll("a", {'data-click-id' : 'body'})
post_titles = []
for container in containers:
titles = container.find_all("h2", recursive=True)
for title_tag in titles:
post_titles.append(title_tag.text)
return post_titles
r_games_url = 'https://www.reddit.com/r/games/top/?t=year'
r_games_posts = reddit_scraper(r_games_url)
Checking for mac64 chromedriver:2.46 in cache Driver found in /Users/adrianherrmann/.wdm/chromedriver/2.46/mac64/chromedriver
r_gaming_url = 'https://www.reddit.com/r/gaming/top/?t=year'
r_gaming_posts = reddit_scraper(r_gaming_url)
Checking for mac64 chromedriver:2.46 in cache Driver found in /Users/adrianherrmann/.wdm/chromedriver/2.46/mac64/chromedriver
r_truegaming_url = 'https://www.reddit.com/r/truegaming/top/?t=year'
r_truegaming_posts = reddit_scraper(r_truegaming_url)
Checking for mac64 chromedriver:2.46 in cache Driver found in /Users/adrianherrmann/.wdm/chromedriver/2.46/mac64/chromedriver
# r/games
r_games_forums = pd.concat([pd.DataFrame([[title, 'Reddit', 'r/games']], columns=['Post', 'Website', 'Board'])
for title in r_games_posts],
ignore_index=True)
# r/gaming
r_gaming_forums = pd.concat([pd.DataFrame([[title, 'Reddit', 'r/gaming']], columns=['Post', 'Website', 'Board'])
for title in r_gaming_posts],
ignore_index=True)
# r/truegaming
r_truegaming_forums = pd.concat([pd.DataFrame([[title, 'Reddit', 'r/truegaming']], columns=['Post', 'Website', 'Board'])
for title in r_truegaming_posts],
ignore_index=True)
# Join all to post_titles
post_titles = pd.DataFrame(columns=['Post', 'Website', 'Board'])
post_titles = post_titles.append([r_games_forums, r_gaming_forums, r_truegaming_forums], ignore_index=True)
Let's check out the DataFrame so far
print('Shape: {}'.format(post_titles.shape))
post_titles.head(5)
Shape: (3007, 3)
Post | Website | Board | |
---|---|---|---|
0 | John @Totalbiscuit Bain July 8, 1984 - May 24,... | r/games | |
1 | Bungie Splits With Activision | r/games | |
2 | Totalbiscuit hospitalized, his cancer is sprea... | r/games | |
3 | [E3 2018] Cyberpunk 2077 | r/games | |
4 | Sony faces growing Fortnite backlash at E3 | r/games |
GameFaqs will be a different challenge. Rather than infinite scrolling, this website uses pages. This means many http requests will need to be made, so in order to avoid an ip ban and not strain their servers a few things must be done:
To accomplish this we have to make use of whatismybrowser.com's list of common current user agents, which means webscraping the page to stay up to date.
def get_agents(browser, num_agents=10, offset=0):
'''
Webscrapes whatismybrowser.com for new user agents
Args:
browser: the browser you want the agent from
num_agents: number of agents to return
offset: get agents starting from offset num. on page
Returns:
A list of user agents from the given browser
'''
if offset + num_agents > 50:
return []
try:
chrome_url = requests.get('https://developers.whatismybrowser.com/useragents/explore/software_name/' \
+ browser)
except:
print('Browser does not exist. Try lower case')
return
chrome_html = soup(chrome_url.content)
chrome_containers = chrome_html.findAll('td', {'class' : 'useragent'})
user_agents = []
for i in range(num_agents):
chrome_agent = chrome_containers[i + offset].a.text
user_agents.append(chrome_agent)
return user_agents
Get 10 user agents for chrome and 10 for firefox
user_agents = []
user_agents.extend(get_agents('chrome'))
user_agents.extend(get_agents('firefox'))
We need to create a similar function for retrieving new proxies. This function is more important to call frequently as IPs should be updated frequently.
def get_ips(num_addresses=20):
'''
Webscrapes free-proxy-list.net for new free proxies. This is important because these proxies
could go bad after just a couple hours.
Args:
num_addresses: The number of IPs you want returned. If fewer than requested are available,
return the available amount
Returns:
A list of new proxies
'''
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://free-proxy-list.net/')
page_html = soup(driver.page_source, 'lxml')
containers = page_html.findAll('tr', {'role' : 'row'})
ips = []
ip_num = 0
page_num = 1
next_set_btn = driver.find_element_by_xpath('//*[@id="proxylisttable_next"]/a')
while len(ips) < num_addresses:
ip_num += 1
# Click next button to get more ips if the current page doesn't have enough
if ((ip_num % 20) - 1 == 0) and ip_num != 1:
# If reached the last page, return what we have
if page_num >= 15:
driver.close()
return ips
next_set_btn.click()
next_set_btn = driver.find_element_by_xpath('//*[@id="proxylisttable_next"]/a')
ip_num = 1
page_html = soup(driver.page_source, 'lxml')
containers = page_html.findAll('tr', {'role' : 'row'})
page_num += 1
row = containers[ip_num].find_all('td')
ip = row[0].text
port = row[1].text
if row[6].text == 'yes':
ips.append(':'.join([ip, port]))
driver.close()
return ips
ips = get_ips(20)
Checking for mac64 chromedriver:2.46 in cache Driver found in /Users/adrianherrmann/.wdm/chromedriver/2.46/mac64/chromedriver
# Print out the proxies to see what they look like
ips
['51.68.112.254:3128', '45.32.42.234:8080', '178.128.54.73:8080', '104.248.16.45:8080', '177.38.66.255:45235', '95.47.180.171:53484', '138.186.23.9:40340', '182.160.119.254:56229', '103.250.157.43:38641', '115.127.39.66:55474', '88.210.71.234:46626', '177.94.206.67:60666', '1.10.186.157:55129', '176.197.103.210:53281', '109.201.97.235:39125', '31.43.143.15:8181', '193.213.89.72:51024', '183.82.118.87:8080', '41.84.131.78:53281', '93.77.78.123:42803']
def gamefaqs_scraper(url, num_pages, ips, user_agents, offset=0, start_page=0):
'''
Scrape GameFaqs forums for post titles
Args:
url: The url to the first page of GameFaqs
num_pages: Number of pages to scrape
Returns:
A list of post titles
'''
rot_list = []
for ip in ips:
rot_list.append([{'https' : ip}, {'User-Agent' : random.choice(user_agents)}])
req = ''
i = 0
while req == '':
try:
agent_proxy_pair = random.choice(rot_list)
proxy = agent_proxy_pair[0]
headers = agent_proxy_pair[1]
if start_page == 0:
req = requests.get(url, headers=headers, proxies=proxy, timeout=10)
else:
req = requests.get(url + '?page=' + str(start_page+1), headers=headers, proxies=proxy, timeout=10)
print('Success with IP, ' + proxy['https'])
page_html = soup(req.content)
containers = page_html.findAll('td', {'class' : 'topic'})
if not containers:
print('Agent may be banned, removing agent and trying a new one...')
print(page_html, user_agent)
try:
user_agents.remove(headers['User-Agent'])
except:
pass
req = ''
except Exception as e:
i += 1
print('Error with IP, ' + proxy['https'] + ' requesting a new one...')
if i % 20 == 0:
ips = get_ips(20)
rot_list = []
for ip in ips:
rot_list.append([{'https' : ip}, {'User-Agent' : random.choice(user_agents)}])
post_titles = []
for page in range(start_page, num_pages):
for container in containers:
title = container.a.text
post_titles.append(title)
time.sleep(3)
req = ''
i = 0
while req == '':
try:
agent_proxy_pair = random.choice(rot_list)
proxy = agent_proxy_pair[0]
headers = agent_proxy_pair[1]
req = requests.get(url + '?page=' + str(page + 1), headers=headers, proxies=proxy, timeout=10)
page_html = soup(req.content)
containers = page_html.findAll('td', {'class' : 'topic'})
if not containers:
print('Agent may be banned, removing agent and trying a new one...')
try:
rot_list.remove(agent_proxy_pair)
if not rot_list:
print('Loading in new IPs...')
ips = get_ips(20)
rot_list = []
for ip in ips:
rot_list.append([{'https' : ip}, {'User-Agent' : random.choice(user_agents)}])
user_agents.remove(headers['User-Agent'])
except:
pass
req = ''
time.sleep(2)
if len(user_agents) == 0:
print('No more agents, ended at page,', page+1)
return post_titles
else:
print('Success with IP ' + proxy['https'] + ', now onto page ', page + 2)
except Exception as e:
i += 1
time.sleep(2)
print('Error with IP ' + proxy['https'] + ', requesting a new one...')
if i % 20 == 0:
print('Loading in new IPs...')
ips = get_ips(20)
rot_list = []
for ip in ips:
rot_list.append([{'https' : ip}, {'User-Agent' : random.choice(user_agents)}])
if page % 100 == 0:
print('Loading in new IPs...')
ips = get_ips(20)
rot_list = []
for ip in ips:
rot_list.append([{'https' : ip}, {'User-Agent' : random.choice(user_agents)}])
return post_titles
Note: I am simply printing the last 10 outputs for each forum webscraped after realizing the output couldn't be shrinked when uploaded.
switch_url = 'https://gamefaqs.gamespot.com/boards/189706-nintendo-switch'
switch_posts = gamefaqs_scraper(switch_url, num_pages=1700, ips=ips, user_agents=user_agents)
Success with IP 118.174.233.33:54705, now onto page 1692 Success with IP 203.128.94.102:60152, now onto page 1693 Success with IP 182.52.238.111:45639, now onto page 1694 Success with IP 116.203.1.177:1994, now onto page 1695 Success with IP 217.17.38.245:41506, now onto page 1696 Success with IP 203.128.94.102:60152, now onto page 1697 Success with IP 180.180.156.35:49510, now onto page 1698 Success with IP 1.20.97.4:46965, now onto page 1699 Success with IP 203.128.94.102:60152, now onto page 1700 Success with IP 180.180.156.45:32355, now onto page 1701
# Use new agents to avoid a temporary ban
user_agents = []
user_agents.extend(get_agents('chrome/2', num_agents=50))
user_agents.extend(get_agents('firefox/2', num_agents=50))
user_agents.extend(get_agents('safari/2', num_agents=50))
ips = get_ips(20)
playstation_url = 'https://gamefaqs.gamespot.com/boards/691087-playstation-4'
playstation_posts = gamefaqs_scraper(playstation_url, num_pages=1935, ips=ips, user_agents=user_agents)
Success with IP 119.192.179.46:55012, now onto page 1932 Success with IP 1.20.101.150:41904, now onto page 1933 Error with IP 103.194.192.29:49202, requesting a new one... Agent may be banned, removing agent and trying a new one... Error with IP 213.14.32.75:47442, requesting a new one... Error with IP 103.220.28.180:51493, requesting a new one... Error with IP 103.220.28.180:51493, requesting a new one... Success with IP 111.91.225.2:8080, now onto page 1934 Success with IP 119.192.179.46:55012, now onto page 1935 Success with IP 1.20.101.150:41904, now onto page 1936
# Use new agents to avoid a temporary ban
user_agents = []
user_agents.extend(get_agents('chrome/3', num_agents=50))
user_agents.extend(get_agents('firefox/3', num_agents=50))
user_agents.extend(get_agents('safari/3', num_agents=50))
ips = get_ips(20)
pc_url = 'https://gamefaqs.gamespot.com/boards/916373-pc'
pc_posts = gamefaqs_scraper(pc_url, num_pages=1065, ips=ips, user_agents=user_agents)
Success with IP 176.98.95.247:31955, now onto page 1061 Error with IP 45.6.100.250:48214, requesting a new one... Error with IP 75.98.119.13:57859, requesting a new one... Error with IP 45.6.100.250:48214, requesting a new one... Success with IP 41.215.81.170:59959, now onto page 1062 Success with IP 41.215.81.170:59959, now onto page 1063 Error with IP 45.6.100.250:48214, requesting a new one... Success with IP 87.26.3.40:8080, now onto page 1064 Success with IP 203.205.29.106:39191, now onto page 1065 Success with IP 87.26.3.40:8080, now onto page 1066
We will just reuse the same user agents here
ips = get_ips(20)
xbox_url = 'https://gamefaqs.gamespot.com/boards/691088-xbox-one'
xbox_posts = gamefaqs_scraper(xbox_url, num_pages=710, ips=ips, user_agents=user_agents)
Error with IP 210.11.181.221:55331, requesting a new one... Error with IP 178.128.217.99:8080, requesting a new one... Error with IP 31.209.110.159:39494, requesting a new one... Error with IP 210.11.181.221:55331, requesting a new one... Error with IP 202.91.92.21:43576, requesting a new one... Error with IP 5.2.200.145:44508, requesting a new one... Success with IP 109.201.142.14:3128, now onto page 710 Agent may be banned, removing agent and trying a new one... Error with IP 124.41.240.191:38167, requesting a new one... Success with IP 109.201.142.14:3128, now onto page 711
switch_posts = list(set(switch_posts))
playstation_posts = list(set(playstation_posts))
pc_posts = list(set(pc_posts))
xbox_posts = list(set(xbox_posts))
# Switch Boards
switch_forums = pd.concat([pd.DataFrame([[title, 'GameFaqs', 'Switch']], columns=['Post', 'Website', 'Board'])
for title in switch_posts],
ignore_index=True)
# PS4 Boards
playstation_forums = pd.concat([pd.DataFrame([[title, 'GameFaqs', 'Playstation 4']], columns=['Post', 'Website', 'Board'])
for title in playstation_posts],
ignore_index=True)
# Xbox One Boards
xbox_forums = pd.concat([pd.DataFrame([[title, 'GameFaqs', 'Xbox One']], columns=['Post', 'Website', 'Board'])
for title in xbox_posts],
ignore_index=True)
# PC Boards
pc_forums = pd.concat([pd.DataFrame([[title, 'GameFaqs', 'PC']], columns=['Post', 'Website', 'Board'])
for title in pc_posts],
ignore_index=True)
# Join all to post_titles
post_titles = pd.concat([post_titles, switch_forums, playstation_forums, xbox_forums, pc_forums], ignore_index=True)
We now have all the posts we want and could display the final results
post_titles
Post | Website | Board | |
---|---|---|---|
0 | John @Totalbiscuit Bain July 8, 1984 - May 24,... | r/games | |
1 | Bungie Splits With Activision | r/games | |
2 | Totalbiscuit hospitalized, his cancer is sprea... | r/games | |
3 | [E3 2018] Cyberpunk 2077 | r/games | |
4 | Sony faces growing Fortnite backlash at E3 | r/games | |
5 | John “TotalBiscuit” Bain to be inducted into E... | r/games | |
6 | Later today, Red Dead 2 gets a new trailer. Be... | r/games | |
7 | List of Video Games where you can pet the dogs | r/games | |
8 | It's time video game makers unionize. | r/games | |
9 | Bethesda Support Leaks Fallout 76 Customer Nam... | r/games | |
10 | Ubisoft will now ban players for racist, homop... | r/games | |
11 | Fallout 76 – Official Teaser Trailer | r/games | |
12 | Nintendo of America’s Reggie Fils-Aime to Reti... | r/games | |
13 | Obsidian's The Outer Worlds blends Firefly and... | r/games | |
14 | Bethesda offering 500 atoms ($5 ingame store c... | r/games | |
15 | [E3 2018] The Elder Scrolls VI | r/games | |
16 | Giantbomb Unlikely to Review Fallout 76. Gerst... | r/games | |
17 | Report: The Walking Dead developer Telltale Ga... | r/games | |
18 | Introducing the Xbox Adaptive Controller | r/games | |
19 | Game dev: Linux users were only 0.1% of sales ... | r/games | |
20 | Sony's Stubborn Stance on Cross-Play Is Embarr... | r/games | |
21 | Metro dev: 'if at all all the PC players annou... | r/games | |
22 | Blizzard Says It Wasn't Expecting Fans To Be T... | r/games | |
23 | Black Ops 4 adds microtransactions, requiring ... | r/games | |
24 | This takes it to the next level, Are we really... | r/games | |
25 | EA Cancels Open-World Star Wars Game | r/games | |
26 | In the crazy economy of Red Dead Online, baked... | r/games | |
27 | PlayStation Skipping E3 For First Time in Show... | r/games | |
28 | Cyberpunk 2077 is a First-Person RPG | r/games | |
29 | Belgian government opens criminal investigatio... | r/games | |
... | ... | ... | ... |
107471 | Skyrim vs Kingdom Come Deliverance, which game... | GameFaqs | PC |
107472 | MH World PC port planned for autumn 2018 | GameFaqs | PC |
107473 | Best place to buy cheap Steam key? | GameFaqs | PC |
107474 | Looking to get my A+ certification. | GameFaqs | PC |
107475 | Gears of war 4 help plz | GameFaqs | PC |
107476 | WD Easystore 4TB External Drive $120 At Best Buy | GameFaqs | PC |
107477 | With Injustice 1 and MKXL being the highest se... | GameFaqs | PC |
107478 | Best sandbox building game? Preferably with mu... | GameFaqs | PC |
107479 | Switch to a 2.4ghz connection instead of 5ghz ... | GameFaqs | PC |
107480 | Over-Ear Headphones under $150? | GameFaqs | PC |
107481 | What are your opinions on possible FO3 remaster? | GameFaqs | PC |
107482 | Most enjoyable fighter on pc so far? | GameFaqs | PC |
107483 | Monitor recommendations? | GameFaqs | PC |
107484 | What E3 games will YOU be buying? | GameFaqs | PC |
107485 | Your five most played Steam games? | GameFaqs | PC |
107486 | New Fire Pro Wrestling game coming to Steam | GameFaqs | PC |
107487 | Oculus Go is trash right now. | GameFaqs | PC |
107488 | Need help getting Sonic Heroes to work | GameFaqs | PC |
107489 | Anyone ever bypass Rockstar Social s*** on Steam? | GameFaqs | PC |
107490 | So does Win10 still have that mandatory update... | GameFaqs | PC |
107491 | Windows 10 update broke my graphics driver. AM... | GameFaqs | PC |
107492 | There is a lot of fear mongering by net neutra... | GameFaqs | PC |
107493 | Fallout 76 rust clone? | GameFaqs | PC |
107494 | Please help me out here. | GameFaqs | PC |
107495 | Climbed from bronze 4 to gold 5 | GameFaqs | PC |
107496 | Fellow 2500K users, when are you upgrading/hav... | GameFaqs | PC |
107497 | About to grab the Sennheiser 598, couple last ... | GameFaqs | PC |
107498 | So is it possible to do full body tracking wit... | GameFaqs | PC |
107499 | Modular PSU | GameFaqs | PC |
107500 | Steam dropped Bitcoin payments | GameFaqs | PC |
107501 rows × 3 columns
First I have to make a list of relevant developers and their different nicknames.
# The full name is only listed in cases like 'Activision Blizzard' together with 'Activision' and 'Blizzard'
# in order to label each post in the next step
developers = [['Tencent'], ['Rockstar'], ['Valve'], ['Sony'], ['Microsoft'], ['Nintendo'], ['Bungie'],
['Activision Blizzard', 'Activision', 'Activi$ion', 'Blizzard'], ['Electronic Arts', 'EA'],
['Bandai Namco', 'Bandai', 'Namco'], ['Ubisoft'], ['Nexon'], ['Telltale'],
['Epic Games', 'Epic'], ['BioWare'], ['Naughty Dog'], ['Square Enix', 'Square'],
['Bunjie'], ['Insomniac'], ['Bethesda'], ['Capcom'], ['Take-Two', 'Take Two', 'Take 2', 'Take2'],
['Sega'], ['Devolver Digital', 'Devolver'], ['Konami'], ['Apple']]
import re
dev_posts = pd.DataFrame(columns=['Post', 'Website', 'Board', 'Developer'])
index = 0
post_dict = {}
for i in range(len(post_titles)):
all_developers = []
for dev in developers:
for nickname in dev:
# Special case for EA. Common nickname but could also be mixed with common words like "each".
match = False
if nickname == 'EA':
post_title = post_titles['Post'].loc[i]
# Regex to match EA outside of other words
if re.match(r'([^a-zA-Z]|^)EA([^a-zA-Z]|$)', post_title):
all_developers += [dev[0]]
match = True
else:
post_title = post_titles['Post'].loc[i].lower()
if nickname.lower() in post_title:
all_developers += [dev[0]]
match = True
if match:
if post_dict.get(dev[0]):
post_dict[dev[0]].append(post_titles['Post'].loc[i])
else:
post_dict[dev[0]] = [post_titles['Post'].loc[i]]
break
if all_developers:
row = post_titles.loc[i].values.tolist() + [', '.join(all_developers)]
dev_posts.loc[index] = row
index += 1
print('Shape, {}'.format(dev_posts.shape))
dev_posts.head()
Shape, (8493, 4)
Post | Website | Board | Developer | |
---|---|---|---|---|
0 | Bungie Splits With Activision | r/games | Bungie, Activision Blizzard | |
1 | Sony faces growing Fortnite backlash at E3 | r/games | Sony | |
2 | Later today, Red Dead 2 gets a new trailer. Be... | r/games | Rockstar, Take-Two | |
3 | Bethesda Support Leaks Fallout 76 Customer Nam... | r/games | Bethesda | |
4 | Ubisoft will now ban players for racist, homop... | r/games | Ubisoft |
Now we can finally analyze our data and figure out how well public opinion is in each of these developer's favor.
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
[nltk_data] Downloading package vader_lexicon to [nltk_data] /Users/adrianherrmann/nltk_data...
True
We want to judge sentiments based on the compound score, which is the sum of all lexicon ratings standarized to be within the range from -1 to 1
dev_sentiments = pd.DataFrame(columns=['Mean Sentiment', 'Developer',
'Most Negative Sentence', 'Most Positive Sentence',
'Most Negative Score', 'Most Positive Score',
'Number of Posts'])
index = 0
sid = SentimentIntensityAnalyzer()
for dev in developers:
titles = dev_posts[dev_posts['Developer'].str.contains(dev[0])]
if not titles.values.tolist():
continue
tot_sentiment = 0
most_neg_sent = ''
most_pos_sent = ''
most_neg_score = 1
most_pos_score = -1
for title in titles['Post'].values:
sentiment = sid.polarity_scores(title)['compound']
tot_sentiment += sentiment
if sentiment < most_neg_score:
most_neg_score = sentiment
most_neg_sent = title
if sentiment > most_pos_score:
most_pos_score = sentiment
most_pos_sent = title
mean_sentiment = tot_sentiment / len(titles)
dev_sentiments.loc[index] = [mean_sentiment, dev[0],
most_neg_sent, most_pos_sent,
most_neg_score, most_pos_score,
len(titles)]
index += 1
Now that the sentiments are analyzed we can view the important details.
It is important to dive deeper so that you can apply even more specific filtering and sentiment analysis when analyzing one company, which I will be doing a bit of. The post_dict created earlier will help.
dev_sentiments
Mean Sentiment | Developer | Most Negative Sentence | Most Positive Sentence | Most Negative Score | Most Positive Score | Number of Posts | |
---|---|---|---|---|---|---|---|
0 | 0.138989 | Tencent | Does anyone actually play the crappy F2P games... | Should superior Chinese companies like Tencent... | -0.2960 | 0.8176 | 9 |
1 | -0.043847 | Rockstar | Rockstar Lies & Red Dead Online Economy Is A G... | Ubisoft is a BETTER company than Rockstar! LOL... | -0.8625 | 0.8419 | 167 |
2 | -0.018581 | Valve | Dead before it even released? A valve game?? A... | Artifact is so good, Kotaku writer wants to re... | -0.7041 | 0.8147 | 94 |
3 | 0.031590 | Sony | Sony's Devil May Cry has arrived. Lost Souls A... | Sony wins best Float at PRIDE 2018 | -0.8689 | 0.9008 | 1627 |
4 | 0.054865 | Microsoft | NO! BAD MICROSOFT! I'm so ashamed of you! | Amazing show Microsoft!! My brother even said ... | -0.9191 | 0.8798 | 708 |
5 | 0.072509 | Nintendo | Resident Evil, Resident Evil 0, and Resident E... | Discovered a Nintendo office close to where I ... | -0.9349 | 0.9273 | 3969 |
6 | 0.009765 | Bungie | Activision currently under investigation for f... | LMAO Anthem is the exact same hustle Bungie us... | -0.5859 | 0.6841 | 37 |
7 | 0.006517 | Activision Blizzard | Heroes of the Storm pros vent sadness, anger a... | Thank you Activision for CoD Black Ops 4 Black... | -0.7783 | 0.8555 | 183 |
8 | -0.105926 | Electronic Arts | EA Head Fired For Gross Misconduct | EA are an excellent company that provides chea... | -0.7717 | 0.8316 | 130 |
9 | 0.038390 | Bandai Namco | WTF were Namco Bandai thinking? | Bandai Namco proves to be the best third party... | -0.6739 | 0.7845 | 73 |
10 | 0.057659 | Ubisoft | Ubisoft will now ban players for racist, homop... | Ubisoft is a BETTER company than Rockstar! LOL... | -0.8225 | 0.8419 | 201 |
11 | 0.000000 | Nexon | What Nexon games use NX? | What Nexon games use NX? | 0.0000 | 0.0000 | 1 |
12 | -0.043905 | Telltale | No wonder Telltale Games died a slow painful d... | Would Telltale Be The Best Developer If They H... | -0.9118 | 0.7964 | 109 |
13 | 0.028201 | Epic Games | God of War Has An Epic Avengers Infinity War R... | The Epic Games store is now live - giving away... | -0.7717 | 0.8625 | 189 |
14 | -0.061895 | BioWare | Did EA ruin Bioware or did Bioware ruin itself? | Former Bioware legend Mike Laidlaw praises wit... | -0.8225 | 0.8122 | 60 |
15 | 0.051743 | Naughty Dog | Naughty Dog's lead animator explains in-depth ... | I for one am proud Naughty Dog is displaying E... | -0.6696 | 0.7712 | 72 |
16 | 0.037446 | Square Enix | kingdom hearts 3 deluxe edition....what the HE... | Did Square Enix ever state why they skipped ou... | -0.8793 | 0.8788 | 286 |
17 | 0.064921 | Insomniac | Santa Monica, Gorilla, Insomniac, Sucker Punch... | Ratchet and Clank PS4 is Insomniacs BEST SELLI... | -0.5267 | 0.7125 | 33 |
18 | 0.034053 | Bethesda | Bethesda worst dev/pub of all time,nothing was... | Call me crazy, but Fallout 4's the best Bethes... | -0.8271 | 0.8481 | 247 |
19 | 0.068994 | Capcom | As a Devil My Cry fan, I'm jealous of the way ... | Operation Make DMC Great again is a success! T... | -0.9274 | 0.8999 | 386 |
20 | -0.113245 | Take-Two | Take two/Rockstar will be VERY hard to stop wi... | Take2 CEO on Epic Store: Competition is a good... | -0.6756 | 0.7003 | 20 |
21 | 0.014805 | Sega | Damn sega!!!! Killing it this week!! | Virtua Fighter 5 Final Showdown is free with G... | -0.8507 | 0.8932 | 220 |
22 | 0.016383 | Devolver Digital | Not a Hero hitting Switch Aug 2nd. 12 more Dev... | So who is Devolver Digital and should I care a... | -0.4449 | 0.4939 | 12 |
23 | 0.020019 | Konami | Death Stranding will prove that Konami was rig... | What is the best Konami game you've ever played? | -0.7430 | 0.7650 | 134 |
24 | 0.036826 | Apple | Went shopping for apple products, it's a horri... | Apple Finally Caves, promises to support Steam... | -0.5423 | 0.6486 | 31 |
Now let's take a deeper look at a couple of companies with scores on two opposite ends of the spectrum, Electronic Arts and Nintendo. These two have the second worst and second best scores respectively, but they also have plenty of posts, which the developers with the worst and best scores (Take-Two, 20 posts and Tencent, 9 posts) don't have.
dev_sentiments[dev_sentiments['Developer'] == 'Nintendo']
Mean Sentiment | Developer | Most Negative Sentence | Most Positive Sentence | Most Negative Score | Most Positive Score | Number of Posts | |
---|---|---|---|---|---|---|---|
5 | 0.072509 | Nintendo | Resident Evil, Resident Evil 0, and Resident E... | Discovered a Nintendo office close to where I ... | -0.9349 | 0.9273 | 3969 |
print('Nintendo\'s most negative sentence:\n' +
dev_sentiments[dev_sentiments['Developer'] == 'Nintendo']['Most Negative Sentence'].values[0] + '\n')
print('Nintendo\'s most positive sentence:\n' +
dev_sentiments[dev_sentiments['Developer'] == 'Nintendo']['Most Positive Sentence'].values[0] + '\n')
Nintendo's most negative sentence: Resident Evil, Resident Evil 0, and Resident Evil 4 coming to Nintendo Switch in 2019 Nintendo's most positive sentence: Discovered a Nintendo office close to where I live and asked if they had any kind of tour or something. Lady told me they hadn’t but she handed me a bag full of cool souvenirs. This coin is definitely the best of all!
For Nintendo it looks like the worst post, which is in fact the most negatively rated post of all threads across all developers, is rated so because it mentions the game "Resident Evil" multiple times. This only testifies for their high overall score.
Nintendo being so well liked comes to no surprise. They without a doubt have the most devout following of any modern gaming company. So many people grew up on Nintendo as children and continue to play their games as adults, many even strictly stick to Nintendo.
Let's get the word frequencies from Nintendo posts.
from collections import Counter
neg_titles = []
for title in post_dict['Nintendo']:
sentiment = sid.polarity_scores(title)['compound']
if sentiment <= -0.5:
neg_titles.append(title)
nintendo_words = ' '.join(neg_titles).split(' ')
neu_words=[]
for word in nintendo_words:
sentiment = sid.polarity_scores(word)['compound']
if (sentiment >= -0.4 and sentiment <= 0.4):
neu_words.append(word.lower())
neu_freq = Counter(neu_words)
print('Most common neutral words in negative titles: ', neu_freq.most_common(50))
Most common neutral words in negative titles: [('nintendo', 187), ('the', 71), ('to', 52), ('is', 46), ('switch', 35), ('a', 32), ('of', 29), ('and', 26), ('for', 26), ('why', 26), ('in', 19), ("nintendo's", 19), ('on', 19), ('online', 19), ('you', 19), ('it', 16), ('has', 15), ('have', 14), ('do', 14), ('that', 13), ('so', 13), ('i', 12), ('are', 12), ('what', 11), ('-', 10), ('up', 10), ('games', 10), ('with', 10), ('this', 10), ('will', 9), ('sony', 9), ('if', 9), ('console', 9), ('does', 8), ('e3', 8), ('be', 8), ('an', 7), ('about', 7), ('at', 7), ('get', 7), ('nintendo?', 7), ('game', 7), ('think', 6), ('was', 6), ('or', 6), ('would', 6), ('not', 6), ('how', 6), ('most', 6), ('did', 6)]
Going down the list we see some common words, but then notice one which should definitely not be common:
Here are some sentences containing online in the titles with negative sentiments, there are 19 posts total.
index = 0
bad_count = 0
while bad_count < 10:
if 'online' in neg_titles[index].lower():
print(neg_titles[index])
bad_count += 1
index += 1
Jim Sterling: The Online system makes nintendo look weak and stupid Nintendo Switch Paid Online Still a Disaster? - Nintendo Direct Review So nintendo online was a scam Everytime I finish a mission in Resident Evil a Nintendo Online message appears Would it of killed Nintendo to add promotion SNES titles to new Online Subs? Nintendo's paid online is bad. FACT. scumbag nintendo wont let me try the darksouls demo without online. Will Nintendo Switch Online kill multiplayer lobbies? Nintendo would be dumb to not have an online paywall TBH. What the hell does Nintendo online even include?
From this it's obvious that one very critical complaint of Nintendo is the Online system they have in place. If there was one thing they could do to please their base, it would be to address the paywall and offer more with their online subscription (i.e. it's lackluster). We can figure this all out just based on these posts.
dev_sentiments[dev_sentiments['Developer'] == 'Electronic Arts']
Mean Sentiment | Developer | Most Negative Sentence | Most Positive Sentence | Most Negative Score | Most Positive Score | Number of Posts | |
---|---|---|---|---|---|---|---|
8 | -0.105926 | Electronic Arts | EA Head Fired For Gross Misconduct | EA are an excellent company that provides chea... | -0.7717 | 0.8316 | 130 |
print('Electronic Arts\'s most negative sentence:\n' +
dev_sentiments[dev_sentiments['Developer'] == 'Electronic Arts']['Most Negative Sentence'].values[0] + '\n')
print('Electronic Arts\'s most positive sentence:\n' +
dev_sentiments[dev_sentiments['Developer'] == 'Electronic Arts']['Most Positive Sentence'].values[0] + '\n')
Electronic Arts's most negative sentence: EA Head Fired For Gross Misconduct Electronic Arts's most positive sentence: EA are an excellent company that provides cheap access to a lot of great games
Unlike Nintendo, many people online really hate EA. In fact, they have the most downvoted comment of any post in Reddit history, which should be a testament to how negatively they are seen. Still, they continue to be pretty successful. Apex Legends, a new game they recently released, seems to be gaining rapid popularity. Public opinion on the way they monetize their games seems to be changing, which may be a good indicator that people will once again have a positive attitude towards the company.
Similarly I am going to check EA's sentiment frequency.
neg_titles = []
for title in post_dict['Electronic Arts']:
sentiment = sid.polarity_scores(title)['compound']
if sentiment <= -0.5:
neg_titles.append(title)
ea_words = ' '.join(neg_titles).split(' ')
neu_words=[]
for word in ea_words:
sentiment = sid.polarity_scores(word)['compound']
if (sentiment >= -0.4 and sentiment <= 0.4):
neu_words.append(word.lower())
neu_freq = Counter(neu_words)
print('Most common neutral words in negative titles: ', neu_freq.most_common(75))
Most common neutral words in negative titles: [('ea', 21), ('for', 9), ('star', 6), ('game', 6), ('is', 6), ('of', 4), ('open', 4), ('world', 4), ('hiring', 3), ('under', 3), ('investigation', 3), ('to', 3), ('the', 3), ('cancels', 2), ('open-world', 2), ('an', 2), ('anthem', 2), ('video', 2), ('removed', 2), ('another', 2), ('sell', 2), ('lootboxes', 2), ('in', 2), ('belgium', 2), ('games', 2), ('are', 2), ('stocks', 2), ('by', 2), ('bfv', 2), ('sales', 2), ('people', 2), ('ea:', 1), ('youtube', 1), ("creator's", 1), ('disclosure', 1), ('not', 1), ('content', 1), ('head', 1), ('misconduct', 1), ('zelda', 1), ('botw', 1), ('mass', 1), ("effect's", 1), ('franchise', 1), ('continuing', 1), ('automatically', 1), ('loses.', 1), ('conference.', 1), ('and', 1), ('got', 1), ('downgraded', 1), ('says', 1), ('singleplayer', 1), ('god', 1), ('goty.', 1), ('massive', 1), ('blow', 1), ('&', 1), ('plummet!', 1), ('should', 1), ('space', 1), ('ip', 1), ('falling', 1), ('apart.', 1), ('we', 1), ('plummet', 1), ('$21', 1), ('billion.', 1), ('blame', 1), ("ea's", 1), ('cancelled', 1), ('vancouver', 1), ('execs', 1), ('dump', 1), ('millions', 1)]
Because EA has a smaller sample size we should look at multiple words to get some more intuition.
index = 0
bad_count = 0
while bad_count < 10:
if 'star' in neg_titles[index].lower() \
or 'anthem' in neg_titles[index].lower() \
or 'lootboxes' in neg_titles[index].lower():
print(neg_titles[index])
bad_count += 1
index += 1
EA Cancels Open-World Star Wars Game EA Is Hiring For An Open-World Star Wars Game EA: YouTube creator's Anthem video removed for disclosure failure, not content EA is under criminal investigation for continuing to sell Lootboxes in Belgium EA Automatically loses. Horrible Conference. And Anthem got DOWNGRADED EA's Open World Star Wars Game Cancelled EA Vancouver hiring for open world Star Wars game EA is hiring people for an open world star wars game... EA cancels open world Star Wars game EA is under criminal investigation by the Belgium government for FIFA lootboxes
The most critical complaints are on anthem and lootboxes while the negative sentiment with Star Wars seems to be more of a disappointment that a game was cancelled given the several posts on the topic. Again, the frequency size is small because the number of posts wasn't much but we can still extract a good amount of information.
There is plenty more that can be done, like getting more data from the comments. This would give much more input and allow us to view even more opinionated posts, meaning a better consensus of how people feel about different companies. This is at least a taste of what can be done