#!/usr/bin/env python
# coding: utf-8
# # urlExpander Quickstart
# View this notebook on [NBViewer](http://nbviewer.jupyter.org/github/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb?flush_cache=true) or [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| Run it interactively on
# [Binder](https://mybinder.org/v2/gh/SMAPPNYU/urlExpander/master?filepath=examples%2Fquickstart.ipynb)
# By [Leon Yin](leonyin.org) for [SMaPP NYU](https://wp.nyu.edu/smapp/)
#
#
# [urlExpander](https://github.com/SMAPPNYU/urlExpander) is a Python package for quickly and thoroughly expanding URLs.
#
# You can download the software using pip:
# In[1]:
import urlexpander
from runtimestamp.runtimestamp import runtimestamp
runtimestamp('QuickStart User')
print(f"This notebook is using urlExpander v{urlexpander.__version__}")
# Here is a toy example of some URLs taken from Congressional Twitter accounts:
# In[2]:
urls = [
'https://trib.al/xXI5ruM',
'http://bit.ly/1Sv81cj',
'https://www.youtube.com/watch?v=8NwKcfXvGl4',
'https://t.co/zNU1eHhQRn',
]
# We can use the `expand` function (see the code) to unshorten any link:
# In[3]:
urlexpander.expand(urls[0])
# It also works on any list of URLs.
# In[4]:
urlexpander.expand(urls)
# To save compute time, we can skip links that don't need to be expanded.
# The `is_short` function takes any url and checks if the domain is from a known list of link shorteners
# In[5]:
print(f"{urls[1]} returns:")
urlexpander.is_short(urls[1])
# bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!
# In[6]:
print(f"{urls[2]} returns:")
urlexpander.is_short(urls[2])
# urlExpander takes advantage of a list of known domains that offer link shortening services.
# In[7]:
known_shorteners = urlexpander.constants.all_short_domains.copy()
print(len(known_shorteners))
# You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level).
# In[8]:
known_shorteners += ['youtube.com']
# In[9]:
print(f"Now {urls[2]} returns:")
urlexpander.is_short(urls[2], list_of_domains=known_shorteners) # this is the default
# Now we can shorten our workload:
# In[10]:
# filter only domains that need to be shortenened
urls_to_shorten = [link for link in urls if urlexpander.is_short(link)]
urls_to_shorten
# urlExpander's `multithread_expand()` does heavy lifting to quickly and thoroughly expand a list of links:
# In[11]:
expanded_urls = urlexpander.expand(urls_to_shorten)
expanded_urls
# Note that URLs that resolve to defunct pages, still return the domain name -- followed by the type of error surrounded by two underscores IE `http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__`.
#
# Instead of filtering the inputs before running the `expand` function, you can assign a filter using the `filter_function` argument.
# Filter functions can be any boolean function that operates on a string. Below is an example function that filters for t.co links:
# In[12]:
def custom_filter(url):
'''This function returns True if the url is a shortened Twitter URL'''
if urlexpander.get_domain(url) == 't.co':
return True
else:
return False
# In[13]:
resolved_links = urlexpander.expand(urls,
filter_function=custom_filter,
verbose=1)
resolved_links
# Although filtering within the `expand` function is convenient, you will see changes in performance time.
# In[15]:
resolved_links = urlexpander.expand(urls,
filter_function=urlexpander.is_short,
verbose=1)
resolved_links
#