In [104]:
Updated 2018-02-27 14:50:36.395425
By leonyin
Using Python 3.6.1
On Darwin-17.4.0-x86_64-i386-64bit


A scraper for the usnpl site for newspapers, magazines, and college papers.

View this on Github.
View this on NBViewer.
Visit my lab's website

The output has

Geography - The state of the media source.
Name - The name of the media source.
Facebook - The URL for the Facebook account.
Twitter_Name - The Tweet screen name for the Twitter account.
Twitter_ID - The Tweet ID for the Twitter Account.
Website - The URL for the media source
Medium - What format is the media source?

The output is publically avaiable as a csv here.
... and you can access it from the Web using Pandas:

In [46]:
import pandas as pd

link = ''
df = pd.read_csv(link, dtype={'Twitter_ID' : str})

Name Medium Website Facebook Twitter_Name Twitter_ID Geography
0 Alaska Dispatch News Newspapers adndotcom 15828025 AK
1 Alaska Journal of Commerce Newspapers alaskajournal 341639834 AK
2 Anchorage Press Newspapers anchoragepress 17761344 AK
3 Petroleum News Newspapers NaN NaN AK
4 Delta Discovery Newspapers NaN NaN AK

The Python modules required to run this is found in this file, and can be downloaded in the notebook with the command below:

In [41]:
!pip install -r requirements.txt

With the requirements downloaded, you can use the code below to scrape the website.

In [1]:
import os
import time
import requests

import tweepy
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from bs4 import BeautifulSoup
In [2]:
states = '''ak	  al	  ar	  az	  ca	  co	  ct	  dc	  de	  fl	  ga	  hi	  ia	  id	  il	  in	  ks   ky	  la	  ma	  md	  me	  mi	  mn	  mo	  ms	  mt	  nc	  nd	  ne	  nh	  nj	  nm	  nv	  ny	  oh	  ok	  or	  pa	  ri	  sc	  sd	  tn	  tx	  ut	  va	  vt	  wa	  wi	  wv	  wy	'''
states = [s.strip() for s in states.split('  ')]
In [3]:
def parse_row(soup):
    For each media publication in the html, 
    we're going to strip the city name, the publication name,
    the website url, and social links (if they exist)
    The input `soup` is a beautiful soup object.
    the output is a dict of the parsed fields.
    city = soup.find('b').text
    name = soup.find('a').text
    web = soup.find('a').get('href')
    fb = soup.find('a', text='F')
    if fb:
        fb= fb.get('href')
    tw = soup.find('a', text='T')
    if tw:
        tw=tw.get('href').replace('', '').rstrip('/')
    return {
        'Facebook' : fb,
        'Twitter_Name' : tw,
        'Name' : name,
        'Website' : web

We will use that function and apply it to sections of the website where the data is. The output of each of these sections will be appended to the list sites, and create a list of dictionaries.

Line-by-line, we are looking through each 2-letter state abbreviation, collecting the html of that state's page, soupifying it (so that we can parse it), parsing out the fields we're interested in, and then appending those results to a list. The if/else statements regarding medium differ because the html holding such information also differs.

In [45]:
sites = []
for state in states:
    url = '{}news.php'.format(state)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    data_possibilities = soup.find_all('div' ,{"id" : 'data_box'})
    for i, raw_table in enumerate(data_possibilities[1:]):
        j = 1 if i == 0 else 0
        medium = raw_table.find('h3').text
        if medium == 'Newspapers':
            data_table = str(raw_table).split('<br/><br/>\n</div>\n')[j]
            entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
        elif medium in ['Magazines', 'College Newspapers']:
            data_table = str(raw_table).split('<title>Untitled Document</title>')[1]
            entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
        for row in tqdm(entries_to_parse):
            row = row.strip('\r').strip('\n')
            if row:
                entry = parse_row(BeautifulSoup(row, 'lxml'))
                entry['Geography'] = state.upper()
                entry['Medium'] = medium

That list of dictionaries is now a perfect format to place into a Pandas dataframe.

In [5]:
df = pd.DataFrame(sites)
In [8]:
df['Website'] = df['Website'].str.rstrip('/')
In [9]:
Facebook Geography Medium Name Twitter_Name Website
0 AK Newspapers Alaska Dispatch News adndotcom
1 AK Newspapers Alaska Journal of Commerce alaskajournal
2 AK Newspapers Anchorage Press anchoragepress
3 AK Newspapers Petroleum News None
4 AK Newspapers Delta Discovery None
In [10]:
# Let's save it
df.to_csv('data/usnpl_newspapers.csv', index=False)

Getting Twitter User IDs

We want to get the Twitter ID, in addition to the screename. This section uses Tweepy to get such info.

In [20]:
# fill these in!
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
In [22]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth,
In [23]:
# filter the media outlets for unique Twitter Names to query.
twitter_names = df[~df['Twitter_Name'].isnull()]['Twitter_Name'].unique()

This is an example of an API call for user metadata, and what kind of info we get back.

In [24]:
user = api.get_user(twitter_names[0])
In [25]:
{'contributors_enabled': False,
 'created_at': 'Tue Aug 12 20:59:03 +0000 2008',
 'default_profile': False,
 'default_profile_image': False,
 'description': "Alaska's largest news site and newspaper. (Formerly Alaska Dispatch News) [email protected] or call 907-257-4301.",
 'entities': {'description': {'urls': []},
  'url': {'urls': [{'display_url': '',
     'expanded_url': '',
     'indices': [0, 23],
     'url': ''}]}},
 'favourites_count': 215,
 'follow_request_sent': False,
 'followers_count': 70902,
 'following': False,
 'friends_count': 6093,
 'geo_enabled': True,
 'has_extended_profile': False,
 'id': 15828025,
 'id_str': '15828025',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': 'en',
 'listed_count': 1383,
 'location': 'Anchorage, Alaska',
 'name': 'Anchorage Daily News',
 'notifications': False,
 'profile_background_color': 'FFFFFF',
 'profile_background_image_url': '',
 'profile_background_image_url_https': '',
 'profile_background_tile': True,
 'profile_banner_url': '',
 'profile_image_url': '',
 'profile_image_url_https': '',
 'profile_link_color': '0084B4',
 'profile_location': None,
 'profile_sidebar_border_color': 'FFFFFF',
 'profile_sidebar_fill_color': 'DDFFCC',
 'profile_text_color': '333333',
 'profile_use_background_image': False,
 'protected': False,
 'screen_name': 'adndotcom',
 'status': {'contributors': None,
  'coordinates': None,
  'created_at': 'Tue Mar 06 06:20:46 +0000 2018',
  'entities': {'hashtags': [{'indices': [15, 24], 'text': 'Iditarod'}],
   'symbols': [],
   'urls': [],
   'user_mentions': [{'id': 919941379,
     'id_str': '919941379',
     'indices': [3, 13],
     'name': 'Sports at ADN',
     'screen_name': 'sportsadn'}]},
  'favorite_count': 0,
  'favorited': False,
  'geo': None,
  'id': 970907016972480512,
  'id_str': '970907016972480512',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 10,
  'retweeted': False,
  'retweeted_status': {'contributors': None,
   'coordinates': None,
   'created_at': 'Tue Mar 06 06:14:24 +0000 2018',
   'entities': {'hashtags': [{'indices': [0, 9], 'text': 'Iditarod'}],
    'symbols': [],
    'urls': [{'display_url': '…',
      'expanded_url': '',
      'indices': [117, 140],
      'url': ''}],
    'user_mentions': []},
   'favorite_count': 18,
   'favorited': False,
   'geo': None,
   'id': 970905415071363072,
   'id_str': '970905415071363072',
   'in_reply_to_screen_name': None,
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': None,
   'in_reply_to_user_id_str': None,
   'is_quote_status': False,
   'lang': 'en',
   'place': None,
   'possibly_sensitive': False,
   'retweet_count': 10,
   'retweeted': False,
   'source': '<a href="" rel="nofollow">Twitter Web Client</a>',
   'text': "#Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from las…",
   'truncated': True},
  'source': '<a href="" rel="nofollow">TweetDeck</a>',
  'text': "RT @sportsadn: #Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from last year's…",
  'truncated': False},
 'statuses_count': 50552,
 'time_zone': 'Alaska',
 'translator_type': 'none',
 'url': '',
 'utc_offset': -32400,
 'verified': True}

I wonder what we can win in the raffle w/ our hunting liscence?
You don't need a hunting liscence to scrape the web and do some whacky analysis!

Let's iterate through the unique Twitter usernames from our media outlets, make the API call, and store the results we're interested in as a dictionary.

In [27]:
from tweepy import TweepError
In [28]:
user_ids = []
for screen_name in tqdm(twitter_names):
        user = api.get_user(screen_name=screen_name)
        user_id = user.id_str
    except TweepError:
        user_id = None
        'Twitter_ID' : user_id,
        'Twitter_Name' : screen_name

Exception in thread Thread-158:
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/", line 916, in _bootstrap_inner
  File "/anaconda3/lib/python3.6/site-packages/tqdm/", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/anaconda3/lib/python3.6/", line 60, in __iter__
    for itemref in
RuntimeError: Set changed size during iteration

Rate limit reached. Sleeping for: 605
Rate limit reached. Sleeping for: 664
Rate limit reached. Sleeping for: 578
Rate limit reached. Sleeping for: 650

Let's save the output, and merge it back into the dataframe of USNPL fields we parsed out.

In [34]:
df_users = pd.DataFrame(user_ids)
In [35]:
df_users.to_csv('data/twitter_users.csv', index=False)
In [40]:
df_merge = df.merge(df_users, on='Twitter_Name', how='left')[['Name', 'Medium', 'Website', 'Facebook', 'Twitter_Name','Twitter_ID', 'Geography']]
df_merge.to_csv('data/usnpl_newspapers_twitter_ids.csv', index=False)

And that is how we scrape a website and get a nice csv.

In [39]:
df_merge[df_merge['Name'] == 'Fairbanks Daily News-Miner']
Name Medium Website Facebook Twitter_Name Twitter_ID Geography
12 Fairbanks Daily News-Miner Newspapers newsminer 16555200 AK

We can aggregate this data get an idea of the media landscape across states.

In [41]:
%matplotlib inline
In [43]:
df_ = df_merge[df_merge['Geography'].isin(['WY', 'AZ', 'MA', 'NY'])]
df_ = pd.crosstab(df_['Geography'], df_['Medium'])
In [44]:
df_.plot(kind='barh', title="Num of State-Level Media Outlets");