In [104]:

runtime_meta()

Updated 2018-02-27 14:50:36.395425
By leonyin
Using Python 3.6.1
On Darwin-17.4.0-x86_64-i386-64bit

USNPL¶

A scraper for the usnpl site for newspapers, magazines, and college papers.

View this on Github.
View this on NBViewer.
Visit my lab's website

The output has¶

Geography - The state of the media source.
Name - The name of the media source.
Facebook - The URL for the Facebook account.
Twitter_Name - The Tweet screen name for the Twitter account.
Twitter_ID - The Tweet ID for the Twitter Account.
Website - The URL for the media source
Medium - What format is the media source?

The output is publically avaiable as a csv here.
... and you can access it from the Web using Pandas:

In [46]:

import pandas as pd

link = 'https://raw.githubusercontent.com/yinleon/usnpl/master/data/usnpl_newspapers_twitter_ids.csv'
df = pd.read_csv(link, dtype={'Twitter_ID' : str})

df.head()

Out[46]:

	Name	Medium	Website	Facebook	Twitter_Name	Twitter_ID	Geography
0	Alaska Dispatch News	Newspapers	http://www.adn.com	https://www.facebook.com/akdispatch	adndotcom	15828025	AK
1	Alaska Journal of Commerce	Newspapers	http://www.alaskajournal.com	https://www.facebook.com/AlaskaJournal	alaskajournal	341639834	AK
2	Anchorage Press	Newspapers	http://www.anchoragepress.com	https://www.facebook.com/anchoragepress	anchoragepress	17761344	AK
3	Petroleum News	Newspapers	http://www.petroleumnews.com	https://www.facebook.com/PetroleumNews	NaN	NaN	AK
4	Delta Discovery	Newspapers	http://www.deltadiscovery.com	https://www.facebook.com/deltadiscovery	NaN	NaN	AK

The Python modules required to run this is found in this file, and can be downloaded in the notebook with the command below:

In [41]:

!pip install -r requirements.txt

With the requirements downloaded, you can use the code below to scrape the website.

In [1]:

import os
import time
import requests

import tweepy
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from bs4 import BeautifulSoup

In [2]:

states = '''ak	  al	  ar	  az	  ca	  co	  ct	  dc	  de	  fl	  ga	  hi	  ia	  id	  il	  in	  ks   ky	  la	  ma	  md	  me	  mi	  mn	  mo	  ms	  mt	  nc	  nd	  ne	  nh	  nj	  nm	  nv	  ny	  oh	  ok	  or	  pa	  ri	  sc	  sd	  tn	  tx	  ut	  va	  vt	  wa	  wi	  wv	  wy	'''
states = [s.strip() for s in states.split('  ')]

In [3]:

def parse_row(soup):
    '''
    For each media publication in the html, 
    we're going to strip the city name, the publication name,
    the website url, and social links (if they exist)
    
    The input `soup` is a beautiful soup object.
    the output is a dict of the parsed fields.
    '''
    city = soup.find('b').text
    name = soup.find('a').text
    web = soup.find('a').get('href')
    
    fb = soup.find('a', text='F')
    if fb:
        fb= fb.get('href')
    tw = soup.find('a', text='T')
    if tw:
        tw=tw.get('href').replace('http://www.twitter.com/', '').rstrip('/')
    
    return {
        'Facebook' : fb,
        'Twitter_Name' : tw,
        'Name' : name,
        'Website' : web
    }

We will use that function and apply it to sections of the website where the data is. The output of each of these sections will be appended to the list sites, and create a list of dictionaries.

Line-by-line, we are looking through each 2-letter state abbreviation, collecting the html of that state's page, soupifying it (so that we can parse it), parsing out the fields we're interested in, and then appending those results to a list. The if/else statements regarding medium differ because the html holding such information also differs.

In [45]:

sites = []
for state in states:
    url = 'http://www.usnpl.com/{}news.php'.format(state)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    
    data_possibilities = soup.find_all('div' ,{"id" : 'data_box'})
    for i, raw_table in enumerate(data_possibilities[1:]):
        j = 1 if i == 0 else 0
        medium = raw_table.find('h3').text
        if medium == 'Newspapers':
            data_table = str(raw_table).split('<br/><br/>\n</div>\n')[j]
            entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
        elif medium in ['Magazines', 'College Newspapers']:
            data_table = str(raw_table).split('<title>Untitled Document</title>')[1]
            entries_to_parse = data_table.rstrip('</div>').split('\n<br/>\n')
        else:
            break
            
        for row in tqdm(entries_to_parse):
            row = row.strip('\r').strip('\n')
            if row:
                entry = parse_row(BeautifulSoup(row, 'lxml'))
                entry['Geography'] = state.upper()
                entry['Medium'] = medium
                sites.append(entry)
        time.sleep(1)

That list of dictionaries is now a perfect format to place into a Pandas dataframe.

In [5]:

df = pd.DataFrame(sites)

In [8]:

df['Website'] = df['Website'].str.rstrip('/')

In [9]:

df.head()

Out[9]:

	Facebook	Geography	Medium	Name	Twitter_Name	Website
0	https://www.facebook.com/akdispatch	AK	Newspapers	Alaska Dispatch News	adndotcom	http://www.adn.com
1	https://www.facebook.com/AlaskaJournal	AK	Newspapers	Alaska Journal of Commerce	alaskajournal	http://www.alaskajournal.com
2	https://www.facebook.com/anchoragepress	AK	Newspapers	Anchorage Press	anchoragepress	http://www.anchoragepress.com
3	https://www.facebook.com/PetroleumNews	AK	Newspapers	Petroleum News	None	http://www.petroleumnews.com
4	https://www.facebook.com/deltadiscovery	AK	Newspapers	Delta Discovery	None	http://www.deltadiscovery.com

In [10]:

# Let's save it
df.to_csv('data/usnpl_newspapers.csv', index=False)

Getting Twitter User IDs¶

We want to get the Twitter ID, in addition to the screename. This section uses Tweepy to get such info.

In [20]:

# fill these in!
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''

In [22]:

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth,
                 wait_on_rate_limit=True,
                 wait_on_rate_limit_notify=True)

In [23]:

# filter the media outlets for unique Twitter Names to query.
twitter_names = df[~df['Twitter_Name'].isnull()]['Twitter_Name'].unique()

This is an example of an API call for user metadata, and what kind of info we get back.

In [24]:

user = api.get_user(twitter_names[0])

In [25]:

user._json

Out[25]:

{'contributors_enabled': False,
 'created_at': 'Tue Aug 12 20:59:03 +0000 2008',
 'default_profile': False,
 'default_profile_image': False,
 'description': "Alaska's largest news site and newspaper. (Formerly Alaska Dispatch News) Newstips@adn.com or call 907-257-4301.",
 'entities': {'description': {'urls': []},
  'url': {'urls': [{'display_url': 'adn.com',
     'expanded_url': 'http://adn.com',
     'indices': [0, 23],
     'url': 'https://t.co/brBsUYJrYV'}]}},
 'favourites_count': 215,
 'follow_request_sent': False,
 'followers_count': 70902,
 'following': False,
 'friends_count': 6093,
 'geo_enabled': True,
 'has_extended_profile': False,
 'id': 15828025,
 'id_str': '15828025',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': 'en',
 'listed_count': 1383,
 'location': 'Anchorage, Alaska',
 'name': 'Anchorage Daily News',
 'notifications': False,
 'profile_background_color': 'FFFFFF',
 'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/15686233/1-tweet.jpg',
 'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/15686233/1-tweet.jpg',
 'profile_background_tile': True,
 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/15828025/1458879223',
 'profile_image_url': 'http://pbs.twimg.com/profile_images/932121310108442624/NY7s0rqO_normal.jpg',
 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/932121310108442624/NY7s0rqO_normal.jpg',
 'profile_link_color': '0084B4',
 'profile_location': None,
 'profile_sidebar_border_color': 'FFFFFF',
 'profile_sidebar_fill_color': 'DDFFCC',
 'profile_text_color': '333333',
 'profile_use_background_image': False,
 'protected': False,
 'screen_name': 'adndotcom',
 'status': {'contributors': None,
  'coordinates': None,
  'created_at': 'Tue Mar 06 06:20:46 +0000 2018',
  'entities': {'hashtags': [{'indices': [15, 24], 'text': 'Iditarod'}],
   'symbols': [],
   'urls': [],
   'user_mentions': [{'id': 919941379,
     'id_str': '919941379',
     'indices': [3, 13],
     'name': 'Sports at ADN',
     'screen_name': 'sportsadn'}]},
  'favorite_count': 0,
  'favorited': False,
  'geo': None,
  'id': 970907016972480512,
  'id_str': '970907016972480512',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 10,
  'retweeted': False,
  'retweeted_status': {'contributors': None,
   'coordinates': None,
   'created_at': 'Tue Mar 06 06:14:24 +0000 2018',
   'entities': {'hashtags': [{'indices': [0, 9], 'text': 'Iditarod'}],
    'symbols': [],
    'urls': [{'display_url': 'twitter.com/i/web/status/9…',
      'expanded_url': 'https://twitter.com/i/web/status/970905415071363072',
      'indices': [117, 140],
      'url': 'https://t.co/qvX7Q3dBns'}],
    'user_mentions': []},
   'favorite_count': 18,
   'favorited': False,
   'geo': None,
   'id': 970905415071363072,
   'id_str': '970905415071363072',
   'in_reply_to_screen_name': None,
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': None,
   'in_reply_to_user_id_str': None,
   'is_quote_status': False,
   'lang': 'en',
   'place': None,
   'possibly_sensitive': False,
   'retweet_count': 10,
   'retweeted': False,
   'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
   'text': "#Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from las… https://t.co/qvX7Q3dBns",
   'truncated': True},
  'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
  'text': "RT @sportsadn: #Iditarod musher Mitch Seavey is second musher to leave Rohn, but he's running without one of his top dogs from last year's…",
  'truncated': False},
 'statuses_count': 50552,
 'time_zone': 'Alaska',
 'translator_type': 'none',
 'url': 'https://t.co/brBsUYJrYV',
 'utc_offset': -32400,
 'verified': True}

I wonder what we can win in the raffle w/ our hunting liscence?
You don't need a hunting liscence to scrape the web and do some whacky analysis!

Let's iterate through the unique Twitter usernames from our media outlets, make the API call, and store the results we're interested in as a dictionary.

In [27]:

from tweepy import TweepError

In [28]:

user_ids = []
for screen_name in tqdm(twitter_names):
    try:
        user = api.get_user(screen_name=screen_name)
        user_id = user.id_str
    except TweepError:
        user_id = None
        pass
    user_ids.append({
        'Twitter_ID' : user_id,
        'Twitter_Name' : screen_name
    })

A Jupyter Widget

Exception in thread Thread-158:
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/anaconda3/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

Rate limit reached. Sleeping for: 605
Rate limit reached. Sleeping for: 664
Rate limit reached. Sleeping for: 578
Rate limit reached. Sleeping for: 650

Let's save the output, and merge it back into the dataframe of USNPL fields we parsed out.

In [34]:

df_users = pd.DataFrame(user_ids)

In [35]:

df_users.to_csv('data/twitter_users.csv', index=False)

In [40]:

df_merge = df.merge(df_users, on='Twitter_Name', how='left')[['Name', 'Medium', 'Website', 'Facebook', 'Twitter_Name','Twitter_ID', 'Geography']]
df_merge.to_csv('data/usnpl_newspapers_twitter_ids.csv', index=False)

And that is how we scrape a website and get a nice csv.

In [39]:

df_merge[df_merge['Name'] == 'Fairbanks Daily News-Miner']

Out[39]:

	Name	Medium	Website	Facebook	Twitter_Name	Twitter_ID	Geography
12	Fairbanks Daily News-Miner	Newspapers	http://www.newsminer.com	https://www.facebook.com/fairbanksDNM	newsminer	16555200	AK

We can aggregate this data get an idea of the media landscape across states.

In [41]:

%matplotlib inline

In [43]:

df_ = df_merge[df_merge['Geography'].isin(['WY', 'AZ', 'MA', 'NY'])]
df_ = pd.crosstab(df_['Geography'], df_['Medium'])

In [44]:

df_.plot(kind='barh', title="Num of State-Level Media Outlets");