urlExpander Quickstart

View this notebook on NBViewer or Github| Run it interactively on Binder
By Leon Yin for SMaPP NYU

urlExpander is a Python package for quickly and thoroughly expanding URLs.

You can download the software using pip:

In [1]:
import urlexpander
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('QuickStart User')
print(f"This notebook is using urlExpander v{urlexpander.__version__}")
Updated 2018-10-02 13:58:26.922288
By QuickStart User
Using Python 3.6.1
On Darwin-17.7.0-x86_64-i386-64bit
This notebook is using urlExpander v0.0.33

Here is a toy example of some URLs taken from Congressional Twitter accounts:

In [3]:
urls = [
    'https://trib.al/xXI5ruM',
    'http://bit.ly/1Sv81cj',
    'https://www.youtube.com/watch?v=8NwKcfXvGl4',
    'https://t.co/zNU1eHhQRn',
]

We can use the expand function (see the code) to unshorten any link:

In [4]:
urlexpander.expand(urls[0])
Out[4]:
'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'

It also works on any list of URLs.

In [5]:
urlexpander.expand(urls)
Out[5]:
['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'http://www.billshusterforcongress.com/congressman-shuster-endorses-donald-trump/',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

To save compute time, we can skip links that don't need to be expanded.
The is_short function takes any url and checks if the domain is from a known list of link shorteners

In [6]:
print(f"{urls[1]} returns:")
urlexpander.is_short(urls[1])
http://bit.ly/1Sv81cj returns:
Out[6]:
True

bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!

In [7]:
print(f"{urls[2]} returns:")
urlexpander.is_short(urls[2])
https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:
Out[7]:
False

urlExpander takes advantage of a list of known domains that offer link shortening services.

In [8]:
known_shorteners = urlexpander.constants.all_short_domains.copy()
print(len(known_shorteners))
85

You can make modifications or use your own list_of_domains as an argument for theis_short function or is_short_domain (which is faster and operates on the domain-level).

In [9]:
known_shorteners += ['youtube.com']
In [10]:
print(f"Now {urls[2]} returns:")
urlexpander.is_short(urls[2], list_of_domains=known_shorteners) # this is the default
Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:
Out[10]:
True

Now we can shorten our workload:

In [11]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if urlexpander.is_short(link)]
urls_to_shorten
Out[11]:
['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj', 'https://t.co/zNU1eHhQRn']

urlExpander's multithread_expand() does heavy lifting to quickly and thoroughly expand a list of links:

In [12]:
expanded_urls = urlexpander.expand(urls_to_shorten)
expanded_urls
Out[12]:
['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'http://www.billshusterforcongress.com/congressman-shuster-endorses-donald-trump/',
 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

Instead of filtering the inputs ahead of time, you can assign a filter in the expand function.
Expand will use the filter_function, which can be any boolean function on each element of the input.

In [13]:
def custom_filter(url):
    '''This function returns True if the url is a shortened Twitter URL'''
    if urlexpander.get_domain(url) == 't.co':
        return True
    else:
        return False
In [14]:
resolved_links = urlexpander.expand(urls, 
                                    filter_function=custom_filter, 
                                    verbose=1)
resolved_links
1it [00:00,  3.39it/s]
Out[14]:
['https://trib.al/xXI5ruM',
 'http://bit.ly/1Sv81cj',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

Although filtering within the expand function is convenient, you will see changes in performance time.

In [15]:
resolved_links = urlexpander.expand(urls,  
                                    filter_function=urlexpander.is_short,
                                    verbose=1)
resolved_links
1it [00:01,  1.84s/it]
Out[15]:
['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',
 'http://www.billshusterforcongress.com/congressman-shuster-endorses-donald-trump/',
 'https://www.youtube.com/watch?v=8NwKcfXvGl4',
 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']

But that is a toy example, let's see how this fairs with a larger dataset.
This package comes with a sampled dataset of links extracted from Twitter accounts from the 115th Congress.
If you work with Twitter data you'll be glad to know there is a function ux.tweet_utils.get_link for creating a similar dataset from Tweets.

In [16]:
df_congress = urlexpander.datasets.load_congress_twitter_links(nrows=10000)

print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)
The dataset has 10000 rows
Out[16]:
link_domain link_url_long link_url_short tweet_created_at tweet_id tweet_text user_id
9998 facebook.com https://www.facebook.com/theDanRather/posts/10... https://t.co/VOiuOXFi1P Tue Jun 20 21:36:04 +0000 2017 877278904846888965 RT @DanRather: Nothing I have ever seen approa... 15808765
9999 bit.ly http://bit.ly/1YWRIXg https://t.co/Hz8RojBqOy Tue Dec 08 19:34:38 +0000 2015 674311141527560197 We need to get people off the sidelines & ... 733751245

About 30% of the links are short!
The performance of the next script is dependent on your internet connection:

In [17]:
!curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -
Retrieving speedtest.net configuration...
Testing from New York University (128.122.215.16)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Speedtest.net (New York City, NY) [2.57 km]: 4.263 ms
Testing download speed................................................................................
Download: 422.94 Mbit/s
Testing upload speed......................................................................................................
Upload: 320.82 Mbit/s

Let's see how long it takes to expand these 15k links.

This is where the optional parameters for expand shine. We can created multiple threads for requests (using n_workers), cache results into a json file (cache_file), and chunk the input into smaller pieces (using chunksize). Why does this last part matter? Something I noticed when expanding links in mass is that performance degrades over time. Chunking the input prevents this from happening (not sure why though)!

In [17]:
short_list = urlexpander.constants.all_short_domains + urlexpander.constants.short_domain_media
def custom_filter(url):
    return urlexpander.is_short(url, list_of_domains=short_list)
In [18]:
resolved_links = urlexpander.expand(df_congress['link_url_long'], 
                                    chunksize=1280,
                                    n_workers=64, 
                                    cache_file='temp.json', 
                                    verbose=1,
                                    filter_function=custom_filter)
0it [00:00, ?it/s]
http://nyti.ms/1AfrSPE failed to resolve due to error: <class 'requests.exceptions.ConnectionError'>
http://fxn.ws/1LwmSyn failed to resolve due to error: <class 'requests.exceptions.TooManyRedirects'>
1it [00:09,  9.14s/it]

At SMaPP, the process of link expansion has been a burden on our research.
We hope that this software helps you overcome similar obstacles!

In [19]:
df_congress['expanded_url'] = resolved_links
df_congress['resolved_domain'] = df_congress['expanded_url'].apply(urlexpander.get_domain)
df_congress.tail(2)
Out[19]:
link_domain link_url_long link_url_short tweet_created_at tweet_id tweet_text user_id expanded_url resolved_domain
9998 facebook.com https://www.facebook.com/theDanRather/posts/10... https://t.co/VOiuOXFi1P Tue Jun 20 21:36:04 +0000 2017 877278904846888965 RT @DanRather: Nothing I have ever seen approa... 15808765 https://www.facebook.com/theDanRather/posts/10... facebook.com
9999 bit.ly http://bit.ly/1YWRIXg https://t.co/Hz8RojBqOy Tue Dec 08 19:34:38 +0000 2015 674311141527560197 We need to get people off the sidelines &amp; ... 733751245 http://speakerryan.com/__CLIENT_ERROR__ speakerryan.com

Here are the top 25 shared domains from this sampled Congress dataset:

In [20]:
df_congress.resolved_domain.value_counts().head(25)
Out[20]:
twitter.com                 1492
youtube.com                  574
facebook.com                 519
instagram.com                176
washingtonpost.com           157
nytimes.com                  152
thehill.com                  132
politico.com                  83
amp.twimg.com                 56
wsj.com                       53
foxnews.com                   50
cnn.com                       47
washingtonexaminer.com        46
ow.ly                         46
medium.com                    43
huffingtonpost.com            43
usatoday.com                  42
energycommerce.house.gov      36
c-span.org                    33
gop.gov                       32
pscp.tv                       31
healthcare.gov                31
speaker.gov                   30
rollcall.com                  26
mn.gov                        23
Name: resolved_domain, dtype: int64

Bonus Round!

You can count number of resolved_domains for each user_id using count_matrix().
You can even choose which domains are counted by modifying the domain_list arg:

In [25]:
count_matrix = urlexpander.tweet_utils.count_matrix(df_congress,
                                                    user_col='user_id', 
                                                    domain_col='resolved_domain', 
                                                    unique_count_col='tweet_id',
                                                    domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])

count_matrix.tail(3)
Out[25]:
facebook.com youtube.com twitter.com google.com
user_id
941000686275387392 1 0 2 0
941080085121175552 0 0 0 0
948946378939609089 0 1 0 0

One of the domain lists you might be interested in are US national media outlets - datasets.load_us_national_media_outlets() compiled by Gregory Eady (Forthcoming).

In [26]:
urlexpander.datasets.load_us_national_media_outlets()[:5]
Out[26]:
array(['abcnews.go.com', 'aim.org', 'alternet.org',
       'theamericanconservative.com', 'prospect.org'], dtype=object)


We also built a one-size-fits-all scraper that returns the title, description, and/or paragraphs from any given URL.

In [27]:
urlexpander.html_utils.get_webpage_title(urls[0])
Out[27]:
"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"
In [28]:
urlexpander.html_utils.get_webpage_description(urls[0])
Out[28]:
'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'
In [29]:
urlexpander.html_utils.get_webpage_meta(urls[0])
Out[29]:
OrderedDict([('url', 'https://trib.al/xXI5ruM'),
             ('title',
              "Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart"),
             ('description',
              'Sunday CBS\'s "Face the Nation," while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'),
             ('paragraphs',
              ['Sunday CBS’s “Face the Nation,” while discussing the last several\xa0days of protests in Iran over\xa0government corruption, Sen. Lindsey Graham (R-SC) warned President Donald Trump that he couldn’t “just tweet” about the protests.',
               'Graham said, “The Iranian people are not our enemy. The Ayatollah is the enemy of the world. Here is what I would do if I were President Trump. I would explain what a better deal would look like. It’s not enough to watch. President Trump is tweeting very sympathetically to the Iranian people. But you just can’t tweet here. You have to lay out a plan.”',
               '<em><span>Follow Pam Key on Twitter <a href="https://twitter.com/pamkeyNEN">@pamkeyNEN</a> </span></em>',
               '<a href="https://www.facebook.com/Breitbart"></a>.',
               '<small>Copyright © 2018 Breitbart</small>'])])

Counclusion

Thanks for stumbling upon this package, we hope that it will lead to more research around links.
We're working on some projects in thie vein and would love to know if you are too!

As an open source package, please feel to reach out about bugs, feature requests, or collaboration!