runtime_meta()
Updated 2018-02-27 14:50:36.395425 By leonyin Using Python 3.6.1 On Darwin-17.4.0-x86_64-i386-64bit
A scraper for the usnpl site for newspapers, magazines, and college papers.
View this on Github.
View this on NBViewer.
Visit my lab's website
Geography - The state of the media source.
Name - The name of the media source.
Facebook - The URL for the Facebook account.
Twitter_Name - The Tweet screen name for the Twitter account.
Twitter_ID - The Tweet ID for the Twitter Account.
Website - The URL for the media source
Medium - What format is the media source?
The output is publically avaiable as a csv here.
... and you can access it from the Web using Pandas:
import pandas as pd
link = 'https://raw.githubusercontent.com/yinleon/usnpl/master/data/usnpl_newspapers_twitter_ids.csv'
df = pd.read_csv(link, dtype={'Twitter_ID' : str})
df.head()
Name | Medium | Website | Twitter_Name | Twitter_ID | Geography | ||
---|---|---|---|---|---|---|---|
0 | Alaska Dispatch News | Newspapers | http://www.adn.com | https://www.facebook.com/akdispatch | adndotcom | 15828025 | AK |
1 | Alaska Journal of Commerce | Newspapers | http://www.alaskajournal.com | https://www.facebook.com/AlaskaJournal | alaskajournal | 341639834 | AK |
2 | Anchorage Press | Newspapers | http://www.anchoragepress.com | https://www.facebook.com/anchoragepress | anchoragepress | 17761344 | AK |
3 | Petroleum News | Newspapers | http://www.petroleumnews.com | https://www.facebook.com/PetroleumNews | NaN | NaN | AK |
4 | Delta Discovery | Newspapers | http://www.deltadiscovery.com | https://www.facebook.com/deltadiscovery | NaN | NaN | AK |
The Python modules required to run this is found in this file, and can be downloaded in the notebook with the command below:
!pip install -r requirements.txt
With the requirements downloaded, you can use the code below to scrape the website.
import os
import time
import requests
import tweepy
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from bs4 import BeautifulSoup
states = '''ak al ar az ca co ct dc de fl ga hi ia id il in ks ky la ma md me mi mn mo ms mt nc nd ne nh nj nm nv ny oh ok or pa ri sc sd tn tx ut va vt wa wi wv wy '''
states = [s.strip() for s in states.split(' ')]
def parse_row(soup):
'''
For each media publication in the html,
we're going to strip the city name, the publication name,
the website url, and social links (if they exist)
The input `soup` is a beautiful soup object.
the output is a dict of the parsed fields.
'''
city = soup.find('b').text
name = soup.find('a').text
web = soup.find('a').get('href')
fb = soup.find('a', text='F')
if fb:
fb= fb.get('href')
tw = soup.find('a', text='T')
if tw:
tw=tw.get('href').replace('http://www.twitter.com/', '').rstrip('/')
return {
'Facebook' : fb,
'Twitter_Name' : tw,
'Name' : name,
'Website' : web
}
We will use that function and apply it to sections of the website where the data is. The output of each of these sections will be appended to the list sites
, and create a list of dictionaries.
Line-by-line, we are looking through each 2-letter state abbreviation, collecting the html of that state's page, soupifying it (so that we can parse it), parsing out the fields we're interested in, and then appending those results to a list. The if/else statements regarding medium differ because the html holding such information also differs.
sites = []
for state in states:
url = 'http://www.usnpl.com/{}news.php'.format(state)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data_possibilities = soup.find_all('div' ,{"id" : 'data_box'})
for i, raw_table in enumerate(data_possibilities[1:]):
j = 1 if i == 0 else 0
medium = raw_table.find('h3').text
if medium == 'Newspapers':
data_table = str(raw_table).split('<br/><br/>\n</div>\n')[j]
entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
elif medium in ['Magazines', 'College Newspapers']:
data_table = str(raw_table).split('<title>Untitled Document</title>')[1]
entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
else:
break
for row in tqdm(entries_to_parse):
row = row.strip('\r').strip('\n')
if row:
entry = parse_row(BeautifulSoup(row, 'lxml'))
entry['Geography'] = state.upper()
entry['Medium'] = medium
sites.append(entry)
time.sleep(1)
That list of dictionaries is now a perfect format to place into a Pandas dataframe.
df = pd.DataFrame(sites)
df['Website'] = df['Website'].str.rstrip('/')
df.head()
Geography | Medium | Name | Twitter_Name | Website | ||
---|---|---|---|---|---|---|
0 | https://www.facebook.com/akdispatch | AK | Newspapers | Alaska Dispatch News | adndotcom | http://www.adn.com |
1 | https://www.facebook.com/AlaskaJournal | AK | Newspapers | Alaska Journal of Commerce | alaskajournal | http://www.alaskajournal.com |
2 | https://www.facebook.com/anchoragepress | AK | Newspapers | Anchorage Press | anchoragepress | http://www.anchoragepress.com |
3 | https://www.facebook.com/PetroleumNews | AK | Newspapers | Petroleum News | None | http://www.petroleumnews.com |
4 | https://www.facebook.com/deltadiscovery | AK | Newspapers | Delta Discovery | None | http://www.deltadiscovery.com |
# Let's save it
df.to_csv('data/usnpl_newspapers.csv', index=False)
We want to get the Twitter ID, in addition to the screename. This section uses Tweepy to get such info.
# fill these in!
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth,
wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
# filter the media outlets for unique Twitter Names to query.
twitter_names = df[~df['Twitter_Name'].isnull()]['Twitter_Name'].unique()
This is an example of an API call for user metadata, and what kind of info we get back.
user = api.get_user(twitter_names[0])
user._json
{'contributors_enabled': False, 'created_at': 'Tue Aug 12 20:59:03 +0000 2008', 'default_profile': False, 'default_profile_image': False, 'description': "Alaska's largest news site and newspaper. (Formerly Alaska Dispatch News) Newstips@adn.com or call 907-257-4301.", 'entities': {'description': {'urls': []}, 'url': {'urls': [{'display_url': 'adn.com', 'expanded_url': 'http://adn.com', 'indices': [0, 23], 'url': 'https://t.co/brBsUYJrYV'}]}}, 'favourites_count': 215, 'follow_request_sent': False, 'followers_count': 70902, 'following': False, 'friends_count': 6093, 'geo_enabled': True, 'has_extended_profile': False, 'id': 15828025, 'id_str': '15828025', 'is_translation_enabled': False, 'is_translator': False, 'lang': 'en', 'listed_count': 1383, 'location': 'Anchorage, Alaska', 'name': 'Anchorage Daily News', 'notifications': False, 'profile_background_color': 'FFFFFF', 'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/15686233/1-tweet.jpg', 'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/15686233/1-tweet.jpg', 'profile_background_tile': True, 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/15828025/1458879223', 'profile_image_url': 'http://pbs.twimg.com/profile_images/932121310108442624/NY7s0rqO_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/932121310108442624/NY7s0rqO_normal.jpg', 'profile_link_color': '0084B4', 'profile_location': None, 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'DDFFCC', 'profile_text_color': '333333', 'profile_use_background_image': False, 'protected': False, 'screen_name': 'adndotcom', 'status': {'contributors': None, 'coordinates': None, 'created_at': 'Tue Mar 06 06:20:46 +0000 2018', 'entities': {'hashtags': [{'indices': [15, 24], 'text': 'Iditarod'}], 'symbols': [], 'urls': [], 'user_mentions': [{'id': 919941379, 'id_str': '919941379', 'indices': [3, 13], 'name': 'Sports at ADN', 'screen_name': 'sportsadn'}]}, 'favorite_count': 0, 'favorited': False, 'geo': None, 'id': 970907016972480512, 'id_str': '970907016972480512', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'is_quote_status': False, 'lang': 'en', 'place': None, 'retweet_count': 10, 'retweeted': False, 'retweeted_status': {'contributors': None, 'coordinates': None, 'created_at': 'Tue Mar 06 06:14:24 +0000 2018', 'entities': {'hashtags': [{'indices': [0, 9], 'text': 'Iditarod'}], 'symbols': [], 'urls': [{'display_url': 'twitter.com/i/web/status/9…', 'expanded_url': 'https://twitter.com/i/web/status/970905415071363072', 'indices': [117, 140], 'url': 'https://t.co/qvX7Q3dBns'}], 'user_mentions': []}, 'favorite_count': 18, 'favorited': False, 'geo': None, 'id': 970905415071363072, 'id_str': '970905415071363072', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'is_quote_status': False, 'lang': 'en', 'place': None, 'possibly_sensitive': False, 'retweet_count': 10, 'retweeted': False, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'text': "#Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from las… https://t.co/qvX7Q3dBns", 'truncated': True}, 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'text': "RT @sportsadn: #Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from last year's…", 'truncated': False}, 'statuses_count': 50552, 'time_zone': 'Alaska', 'translator_type': 'none', 'url': 'https://t.co/brBsUYJrYV', 'utc_offset': -32400, 'verified': True}
I wonder what we can win in the raffle w/ our hunting liscence?
You don't need a hunting liscence to scrape the web and do some whacky analysis!
Let's iterate through the unique Twitter usernames from our media outlets, make the API call, and store the results we're interested in as a dictionary.
from tweepy import TweepError
user_ids = []
for screen_name in tqdm(twitter_names):
try:
user = api.get_user(screen_name=screen_name)
user_id = user.id_str
except TweepError:
user_id = None
pass
user_ids.append({
'Twitter_ID' : user_id,
'Twitter_Name' : screen_name
})
A Jupyter Widget
Exception in thread Thread-158: Traceback (most recent call last): File "/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run for instance in self.tqdm_cls._instances: File "/anaconda3/lib/python3.6/_weakrefset.py", line 60, in __iter__ for itemref in self.data: RuntimeError: Set changed size during iteration
Rate limit reached. Sleeping for: 605 Rate limit reached. Sleeping for: 664 Rate limit reached. Sleeping for: 578 Rate limit reached. Sleeping for: 650
Let's save the output, and merge it back into the dataframe of USNPL fields we parsed out.
df_users = pd.DataFrame(user_ids)
df_users.to_csv('data/twitter_users.csv', index=False)
df_merge = df.merge(df_users, on='Twitter_Name', how='left')[['Name', 'Medium', 'Website', 'Facebook', 'Twitter_Name','Twitter_ID', 'Geography']]
df_merge.to_csv('data/usnpl_newspapers_twitter_ids.csv', index=False)
And that is how we scrape a website and get a nice csv.
df_merge[df_merge['Name'] == 'Fairbanks Daily News-Miner']
Name | Medium | Website | Twitter_Name | Twitter_ID | Geography | ||
---|---|---|---|---|---|---|---|
12 | Fairbanks Daily News-Miner | Newspapers | http://www.newsminer.com | https://www.facebook.com/fairbanksDNM | newsminer | 16555200 | AK |
We can aggregate this data get an idea of the media landscape across states.
%matplotlib inline
df_ = df_merge[df_merge['Geography'].isin(['WY', 'AZ', 'MA', 'NY'])]
df_ = pd.crosstab(df_['Geography'], df_['Medium'])
df_.plot(kind='barh', title="Num of State-Level Media Outlets");