import urlexpander
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('QuickStart User')
print(f"This notebook is using urlExpander v{urlexpander.__version__}")
Here is a toy example of some URLs taken from Congressional Twitter accounts:
urls = [
'https://trib.al/xXI5ruM',
'http://bit.ly/1Sv81cj',
'https://www.youtube.com/watch?v=8NwKcfXvGl4',
'https://t.co/zNU1eHhQRn',
]
We can use the expand
function (see the code) to unshorten any link:
urlexpander.expand(urls[0])
It also works on any list of URLs.
urlexpander.expand(urls)
To save compute time, we can skip links that don't need to be expanded.
The is_short
function takes any url and checks if the domain is from a known list of link shorteners
print(f"{urls[1]} returns:")
urlexpander.is_short(urls[1])
bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!
print(f"{urls[2]} returns:")
urlexpander.is_short(urls[2])
urlExpander takes advantage of a list of known domains that offer link shortening services.
known_shorteners = urlexpander.constants.all_short_domains.copy()
print(len(known_shorteners))
You can make modifications or use your own list_of_domains
as an argument for theis_short
function or is_short_domain
(which is faster and operates on the domain-level).
known_shorteners += ['youtube.com']
print(f"Now {urls[2]} returns:")
urlexpander.is_short(urls[2], list_of_domains=known_shorteners) # this is the default
Now we can shorten our workload:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if urlexpander.is_short(link)]
urls_to_shorten
urlExpander's multithread_expand()
does heavy lifting to quickly and thoroughly expand a list of links:
expanded_urls = urlexpander.expand(urls_to_shorten)
expanded_urls
Note that URLs that resolve to defunct pages, still return the domain name -- followed by the type of error surrounded by two underscores IE `http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__`.
Instead of filtering the inputs before running the expand
function, you can assign a filter using the filter_function
argument.
Filter functions can be any boolean function that operates on a string. Below is an example function that filters for t.co links:
def custom_filter(url):
'''This function returns True if the url is a shortened Twitter URL'''
if urlexpander.get_domain(url) == 't.co':
return True
else:
return False
resolved_links = urlexpander.expand(urls,
filter_function=custom_filter,
verbose=1)
resolved_links
Although filtering within the expand
function is convenient, you will see changes in performance time.
resolved_links = urlexpander.expand(urls,
filter_function=urlexpander.is_short,
verbose=1)
resolved_links
But that is a toy example, let's see how this fairs with a larger dataset.
This package comes with a sampled dataset of links extracted from Twitter accounts from the 115th Congress.
If you work with Twitter data you'll be glad to know there is a function urlexpander.tweet_utils.get_link
for creating a similar dataset from Tweets.
df_congress = urlexpander.datasets.load_congress_twitter_links(nrows=10000)
print(f'The dataset has {len(df_congress)} rows')
df_congress.tail(2)
shortened_urls = df_congress[df_congress.link_domain.apply(urlexpander.is_short)].tweet_id.nunique()
all_urls = df_congress.tweet_id.nunique()
shortened_urls / all_urls
About 28% of the links are short!
The performance of the next script is dependent on your internet connection:
!curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -
Let's see how long it takes to expand these 10k links.
This is where the optional parameters for expand
shine.
We can created multiple threads for requests (using n_workers
), cache results into a json file (cache_file
), and chunk the input into smaller pieces (using chunksize
). Why does this last part matter? Something I noticed when expanding links in mass is that performance degrades over time. Chunking the input prevents this from happening (not sure why though)!
resolved_links = urlexpander.expand(df_congress['link_url_long'],
chunksize=1280,
n_workers=64,
cache_file='temp.json',
verbose=1,
filter_function=urlexpander.is_short)
At SMaPP, the process of link expansion has been a burden on our research.
We hope that this software helps you overcome similar obstacles!
df_congress['expanded_url'] = resolved_links
df_congress['resolved_domain'] = df_congress['expanded_url'].apply(urlexpander.get_domain)
df_congress.tail(2)
Here are the top 25 shared domains from this sampled Congress dataset:
df_congress.resolved_domain.value_counts().head(25)
You can count number of resolved_domain
s for each user_id
using count_matrix()
.
You can even choose which domains are counted by modifying the domain_list
arg:
count_matrix = urlexpander.tweet_utils.count_matrix(df_congress,
user_col='user_id',
domain_col='resolved_domain',
unique_count_col='tweet_id',
domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])
count_matrix.tail(3)
One of the domain lists you might be interested in are US national media outlets -
datasets.load_us_national_media_outlets()
compiled by Gregory Eady (Forthcoming).
urlexpander.datasets.load_us_national_media_outlets()[:5]
urlexpander.html_utils.get_webpage_title(urls[0])
urlexpander.html_utils.get_webpage_description(urls[0])
urlexpander.html_utils.get_webpage_meta(urls[0])
Thanks for stumbling upon this package, we hope that it will lead to more research around links.
We're working on some projects in thie vein and would love to know if you are too!
As an open source package, please feel to reach out about bugs, feature requests, or collaboration!