Where would you start if you wanted to get an understanding of a website's content, especially large publishers? I'm usually interested in the following questions:
In their most basic form, sitemaps are required to only have the "loc" tag (under the parent "url" tag). Essentially, a sitemap is allowed to simply be list of URLs. Other optional tags are allowed, most importantly "lastmod", as well as "changefreq", "priority", and in some cases "alternate".
If you have "lastmod" in the sitemap (and most reputable sites do), then you can get all the information related to publishing activity and trends. Then the richness of URLs determines how much information you can extract (if the URLs are structured with no real information like www.example.com/product/12345 then you won't be able to get much from the sitemap).
The goal of this tutorial is to make sitemaps less boring objects!
I'll be analyzing the sitemaps of BuzzFeed, and since they have "lastmod" as well as consistent and rich URLs, we will be able to answer all of the questions raised above.
I'll be using Python for the analysis and an interactive version of the article is available here. I encourage you to check it out if you want to follow along. This way you can make changes and explore other things that you might be curious about. The data visualizations are also interactive, so you will be able to zoom, hover, and explore a little better.
If you don't know any programming, you can safely ignore all the code snippets (which I will be explaining anyways).
To get the sitemaps in a table format, I'll use the sitemap_to_df
function from the advertools package. "df" is short for DataFrame, which is basically a data table. You simply pass the URL of a sitemap (or a sitemap index URL) to the function, and it returns the sitemap(s) in tabular format. If you give it a sitemap index, then it will go through all the sub-sitemaps and extract the URLs and whatever other data is available.
In addition to advertools, I'll be using pandas for data manipulation, as well as plotly for data visualization.
import advertools as adv
import plotly.graph_objects as go
import pandas as pd
# buzzfeed_generic = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/buzzfeed.xml')
# buzzfeed_tasty = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/tasty.xml')
# buzzfeed_video = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/video.xml')
# buzzfeed_shopping = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/shopping.xml')
# buzzfeed = pd.concat([buzzfeed_generic, buzzfeed_tasty, buzzfeed_video, buzzfeed_shopping],
# ignore_index=True)
Since I have saved them to CSV files so you don't have to re-import them, we'll open the files directly, and put them in one big DataFrame. We will set the "lastmod" column to become the index and set its type as "datetime" so we can access special date and time functionality.
import os
buzzfeed = pd.concat((pd.read_csv('data/' + file) for file in os.listdir('data/')),
ignore_index=True)
buzzfeed['lastmod'] = pd.to_datetime(buzzfeed['lastmod'])
buzzfeed = buzzfeed.set_index('lastmod')
buzzfeed = buzzfeed.drop(columns=['video'])
buzzfeed
loc | sitemap | |
---|---|---|
lastmod | ||
NaT | https://www.buzzfeed.com/watch/video/1961 | https://www.buzzfeed.com/sitemap/video/2016_28... |
NaT | https://www.buzzfeed.com/watch/video/1503 | https://www.buzzfeed.com/sitemap/video/2016_28... |
NaT | https://www.buzzfeed.com/watch/video/1741 | https://www.buzzfeed.com/sitemap/video/2016_28... |
NaT | https://www.buzzfeed.com/watch/video/108 | https://www.buzzfeed.com/sitemap/video/2016_28... |
NaT | https://www.buzzfeed.com/watch/video/1975 | https://www.buzzfeed.com/sitemap/video/2016_28... |
... | ... | ... |
2020-03-31 00:00:00+00:00 | https://www.buzzfeed.com/jp/sonomishimada/chin... | https://www.buzzfeed.com/sitemap/tasty/2020_13... |
2020-03-27 00:00:00+00:00 | https://www.buzzfeed.com/br/agathadahora/recei... | https://www.buzzfeed.com/sitemap/tasty/2020_13... |
2020-03-28 00:00:00+00:00 | https://www.buzzfeed.com/jp/redkikuchi/potato-... | https://www.buzzfeed.com/sitemap/tasty/2020_13... |
2020-03-30 00:00:00+00:00 | https://www.buzzfeed.com/br/agathadahora/recei... | https://www.buzzfeed.com/sitemap/tasty/2020_13... |
2020-03-31 00:00:00+00:00 | https://www.buzzfeed.com/jp/yuittakahashi/baby... | https://www.buzzfeed.com/sitemap/tasty/2020_13... |
512320 rows × 2 columns
The above is how the DataFrame looks like. "lastmod" is the index, and we have two columns; "loc" which is the URLs, and "sitemap", which is the URL of the siteamp from which the URL was retreived.
NaT
stands for "not-a-time", which is the missing value representation of datetime
objects.
As you can see, we have around half a million URLs to go through.
If you look at the URLs of the sitemaps, you will see that they contain the website category, for example:
https://www.buzzfeed.com/sitemap/buzzfeed/2019_5.xml
https://www.buzzfeed.com/sitemap/shopping/2018_13.xml
This can be helpful in understanding which category the URL falls under.
To extract the category from those URLs, the following line splits the XML URLs by the forward slash character, and takes the fifth element (index 4) of the resulting list. The extracted text will be assigned to a new column called sitemap_cat
.
buzzfeed['sitemap_cat'] = buzzfeed['sitemap'].str.split('/').str[4]
buzzfeed.sample(5)
loc | sitemap | sitemap_cat | |
---|---|---|---|
lastmod | |||
2012-05-25 00:00:00+00:00 | https://www.buzzfeed.com/expresident/otter-swi... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed |
2016-04-12 00:00:00+00:00 | https://www.buzzfeed.com/fabordrabfeed/helen-m... | https://www.buzzfeed.com/sitemap/buzzfeed/2016... | buzzfeed |
2019-01-12 00:00:00+00:00 | https://www.buzzfeed.com/nataliebrown/gardenin... | https://www.buzzfeed.com/sitemap/buzzfeed/2018... | buzzfeed |
2012-02-20 00:00:00+00:00 | https://www.buzzfeed.com/flavorwire/watch-roll... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed |
2017-02-15 00:00:00+00:00 | https://www.buzzfeed.com/andreborges/delicious... | https://www.buzzfeed.com/sitemap/buzzfeed/2017... | buzzfeed |
Now that we have a column showing the categories, we can count how many URLs they have and get an overview of the relative volume of content under each. The following code simply counts the values in that column and formats the resulting DataFrame.
(buzzfeed['sitemap_cat']
.value_counts()
.to_frame()
.assign(percentage=lambda df: df['sitemap_cat'].div(df['sitemap_cat'].sum()))
.style.format(dict(sitemap_cat='{:,}', percentage='{:.1%}')))
sitemap_cat | percentage | |
---|---|---|
buzzfeed | 478,430 | 93.4% |
shopping | 13,774 | 2.7% |
video | 10,657 | 2.1% |
tasty | 5,337 | 1.0% |
asis | 4,122 | 0.8% |
It's clear the "buzzfeed" is the major category, which is basically the main site, and the others are very small in comparison.
Before proceeding further, it's important to get a better understanding of the NaT
values that we saw at the beginning. Let's see what category they fall under.
buzzfeed[buzzfeed.index.isna()]['sitemap_cat'].head()
lastmod NaT video NaT video NaT video NaT video NaT video Name: sitemap_cat, dtype: object
The first five fall under "video", but is that true for all the missing values?
The following line takes a subset of the DataFrame buzzfeed
(the subset where the index contains missing values), then takes the sitemap_cat
column, and counts the number of unique values. Since we saw that some values are "video", if the number of unique values is one, then all categories of missing dates fall under "video".
buzzfeed[buzzfeed.index.isna()]['sitemap_cat'].nunique()
1
We have now uncovered a limitation in our dataset, which we know affects 2.1% of the URLs. We will not be able to analyze date-related issues with the video URLs. Nor will we be able to get any information about the content of those URLs for that matter:
buzzfeed[buzzfeed['sitemap_cat']=='video']['loc'].sample(10)
lastmod NaT https://www.buzzfeed.com/watch/video/38468 NaT https://www.buzzfeed.com/watch/video/8753 NaT https://www.buzzfeed.com/watch/video/19053 NaT https://www.buzzfeed.com/watch/video/18874 NaT https://www.buzzfeed.com/watch/video/52612 NaT https://www.buzzfeed.com/watch/video/96088 NaT https://www.buzzfeed.com/watch/video/21987 NaT https://www.buzzfeed.com/watch/video/76722 NaT https://www.buzzfeed.com/watch/video/15605 NaT https://www.buzzfeed.com/watch/video/7388 Name: loc, dtype: object
Let's check how many articles they publish per year, and whether or not there were higher/lower years.
The following code resample
s the DataFrame by "A" (for annual), and counts the rows.
articles_per_year = buzzfeed.resample('A')['loc'].count()
articles_per_year.to_frame()
loc | |
---|---|
lastmod | |
2008-12-31 00:00:00+00:00 | 2646 |
2009-12-31 00:00:00+00:00 | 3514 |
2010-12-31 00:00:00+00:00 | 11994 |
2011-12-31 00:00:00+00:00 | 46974 |
2012-12-31 00:00:00+00:00 | 62006 |
2013-12-31 00:00:00+00:00 | 61941 |
2014-12-31 00:00:00+00:00 | 62563 |
2015-12-31 00:00:00+00:00 | 56018 |
2016-12-31 00:00:00+00:00 | 49835 |
2017-12-31 00:00:00+00:00 | 38084 |
2018-12-31 00:00:00+00:00 | 40318 |
2019-12-31 00:00:00+00:00 | 54470 |
2020-12-31 00:00:00+00:00 | 11300 |
from IPython.display import HTML
fig = go.Figure()
fig.add_bar(x=articles_per_year.index, y=articles_per_year.values)
fig.layout.title = 'BuzzFeed Articles per Year (excluding video)'
fig.layout.yaxis.title = 'Number of articles'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())
We can see dramatic increases in articles from 2010 (3,514) to 2011 (12k), and from 2011 to 2012 (46k). It's extremely unlikely that a website can increase it's publishing activity almost fourfold, twice, and in two consecutive years. They might have made some acquisitions, content partnerships, or maybe there are issues with the dataset. When we check the authors later, we will see a possible answer to this sudden increase. Let's zoom in further. Let's do the same, but look at the monthly trend.
articles_per_month = buzzfeed.resample('M')['loc'].count()
articles_per_month.to_frame()
loc | |
---|---|
lastmod | |
2008-01-31 00:00:00+00:00 | 163 |
2008-02-29 00:00:00+00:00 | 159 |
2008-03-31 00:00:00+00:00 | 158 |
2008-04-30 00:00:00+00:00 | 170 |
2008-05-31 00:00:00+00:00 | 216 |
... | ... |
2019-12-31 00:00:00+00:00 | 4245 |
2020-01-31 00:00:00+00:00 | 3562 |
2020-02-29 00:00:00+00:00 | 3483 |
2020-03-31 00:00:00+00:00 | 4202 |
2020-04-30 00:00:00+00:00 | 53 |
148 rows × 1 columns
fig = go.Figure()
fig.add_bar(x=articles_per_month.index, y=articles_per_month.values)
fig.layout.title = 'BuzzFeed Articles per Month (excluding video)'
fig.layout.yaxis.title = 'Number of articles'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())
This confirms the trend above and shows a more sudden change. In April 2010, they published 1,249 articles, after having published 354 the previous month. We have something similar happening in April of 2011. Now it is almost certain that this is not an organic natural growth in their publishing activity.
We can also take a look at the trend by day of the week.
(buzzfeed
.groupby(buzzfeed.index.weekday)['loc']
.count().to_frame()
.rename(columns=dict(loc='count'))
.assign(day=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
.style.bar(color='lightblue').format(dict(count='{:,}'))
)
count | day | |
---|---|---|
lastmod | ||
0.0 | 82,469 | Mon |
1.0 | 86,661 | Tue |
2.0 | 85,311 | Wed |
3.0 | 82,199 | Thu |
4.0 | 81,362 | Fri |
5.0 | 43,487 | Sat |
6.0 | 40,174 | Sun |
Nothing very surprising here. They produce a fairly consistent number of articles on weekdays, which is almost double what they produce on weekends.
We can also take a look at the annual trends by category and see if something pops out.
The following code goes through all categories, and creates a plot for the number of articles per year.
from plotly.subplots import make_subplots
categories = buzzfeed['sitemap_cat'][buzzfeed['sitemap_cat'] != 'video'].unique()
fig = make_subplots(rows=1, cols=len(categories), subplot_titles=categories,
shared_yaxes=True)
for i, cat in enumerate(categories):
df = buzzfeed[buzzfeed['sitemap_cat']==cat].resample('A')['loc'].count()
fig.add_bar(x=df.index.year, y=df.values, row=1, col=i+1, showlegend=False, name=cat)
fig.layout.height = 550
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.title = 'Buzzfeed Articles per Year - By Category'
HTML(fig.to_html())
I can see two things here. First is the jump in "shopping" articles from 1,732 to 6,845 in 2019, and 2020 is on track to top that. It seems that it is working well for them. Checking some of those articles you can see that they are running affiliate programs and promoting some products. Second, is how misleading this chart can be. For example Tasty has been acquired by BuzzFeed, and here you can see it occupying a tiny portion of the content. But if you check their Facebook page you'll see that they have almost one hundred million followers. So keep this in mind, be skeptical, and try to verify the information from other sources where possible.
We can now move to analyze whatever information we can get from the URLs, and here is a random sample:
buzzfeed['loc'].sample(15).tolist()
['https://www.buzzfeed.com/buzzfeedpress/buzzfeeds-ceo-on-the-future-of-media-the-internet-always', 'https://www.buzzfeed.com/thestreet/5-stocks-under-10-set-to-soar-xh5', 'https://www.buzzfeed.com/huffpost/nick-cannons-autoimmune-disease-america-7p0', 'https://www.buzzfeed.com/mjs538/heres-what-pete-buttigieg-has-to-say-to-all-the-mike-pences', 'https://www.buzzfeed.com/alexandranapoli/products-that-must-have-been-designed-by-geniuses', 'https://www.buzzfeed.com/ryanschocket2/john-mayer-keeps-tryna-slide-into-halseys-dms-on-insta-and', 'https://www.buzzfeed.com/kimberleydadds/giovanna-fletcher-has-shared-this-inspiring-message-for-crit', 'https://www.buzzfeed.com/briangalindo/15-great-celebrity-tbt-photos-august-8', 'https://www.buzzfeed.com/samir/video-of-trucks-crashing-into-low-bridges', 'https://www.buzzfeed.com/ktlincoln/the-new-york-knicks-basketball-trolls', 'https://www.buzzfeed.com/slate/todd-akin-missouri-gop-nominee-prompts-stir-with-126l', 'https://www.buzzfeed.com/abagg/this-three-and-a-half-minute-music-video-was-shot-in-only-fi', 'https://www.buzzfeed.com/hollywoodlife/kristen-stewart-worried-robert-pattinson-wont-u0l', 'https://www.buzzfeed.com/soft/unbelievably-cheap-prices-for-public-pc-desktop-6-2v4d', 'https://www.buzzfeed.com/starpulse/dave-grohl-krist-novoselic-play-smells-rsh']
The general template seems to be in the form buzzfeed.com/{language}/{author}/{article-title}
and the English articles don't have "/en/" in them.
Let's now create a new column for languages, which can be done by extracting the pattern of two letters between two slashes. If nothing is available, it will be filled with "en". Now we can see the number of articles per language.
buzzfeed['lang'] = buzzfeed['loc'].str.extract('/([a-z]{2})/')[0].fillna('en')
by_lang = buzzfeed['lang'].value_counts()
by_lang.to_frame().style.format('{:,}')
lang | |
---|---|
en | 441,283 |
br | 21,104 |
jp | 16,226 |
mx | 12,714 |
de | 10,942 |
fr | 10,051 |
fig = go.Figure()
fig.add_bar(x=by_lang.index, y=by_lang.values)
fig.layout.title = 'BuzzFeed Number of Articles per Language (excluding video)'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())
We can also see the monthly number of articles per language for a better view.
fig = make_subplots(cols=1, rows=len(by_lang), subplot_titles=by_lang.index.str.upper(),
shared_xaxes=True, y_title='Number of articles')
for i, lang in enumerate(by_lang):
df = buzzfeed[buzzfeed['lang']==by_lang.index[i]].resample('M')['loc'].count()
fig.add_bar(x=df.index, y=df.values, row=i+1, col=1, showlegend=False, name=lang)
fig.layout.height = 650
fig.layout.title = 'BuzzFeed Monthly Articles by Language (excluding video)'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())
Now let's do the same for authors. It's the same process, we split the "loc" column by "/" and extract the second to last element, and place it in a new "author" column. After that we can count the articles by author.
buzzfeed['author'] = buzzfeed['loc'].str.split('/').str[-2]
buzzfeed.sample(10)
loc | sitemap | sitemap_cat | lang | author | |
---|---|---|---|---|---|
lastmod | |||||
2012-05-24 00:00:00+00:00 | https://www.buzzfeed.com/current/15-pictures-o... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed | en | current |
2013-02-06 00:00:00+00:00 | https://www.buzzfeed.com/aolweirdnews/timothy-... | https://www.buzzfeed.com/sitemap/buzzfeed/2013... | buzzfeed | en | aolweirdnews |
2018-10-12 00:00:00+00:00 | https://www.buzzfeed.com/br/davirocha/qual-mus... | https://www.buzzfeed.com/sitemap/buzzfeed/2018... | buzzfeed | br | davirocha |
2012-10-10 00:00:00+00:00 | https://www.buzzfeed.com/ivillage/how-to-make-... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed | en | ivillage |
2011-09-06 00:00:00+00:00 | https://www.buzzfeed.com/pharmacy/buy-authenti... | https://www.buzzfeed.com/sitemap/buzzfeed/2011... | buzzfeed | en | pharmacy |
2010-06-11 00:00:00+00:00 | https://www.buzzfeed.com/limelife/do-you-have-... | https://www.buzzfeed.com/sitemap/buzzfeed/2010... | buzzfeed | en | limelife |
2016-10-15 00:00:00+00:00 | https://www.buzzfeed.com/laraparker/can-you-gu... | https://www.buzzfeed.com/sitemap/buzzfeed/2016... | buzzfeed | en | laraparker |
2019-11-30 00:00:00+00:00 | https://www.buzzfeed.com/jp/michelleno/newborn... | https://www.buzzfeed.com/sitemap/buzzfeed/2019... | buzzfeed | jp | michelleno |
2015-09-16 00:00:00+00:00 | https://www.buzzfeed.com/robertcrauder/muslim-... | https://www.buzzfeed.com/sitemap/buzzfeed/2015... | buzzfeed | en | robertcrauder |
2012-08-29 00:00:00+00:00 | https://www.buzzfeed.com/whitneyjefferson/eliz... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed | en | whitneyjefferson |
print('Number of authors:', buzzfeed['author'].nunique(), '\n')
(buzzfeed['author']
.value_counts()
.to_frame()
.assign(perc=lambda df: df['author'].div(df['author'].sum()),
cum_perc=lambda df: df['perc'].cumsum())[:20]
.reset_index()
.rename(columns=dict(index='author', author='count', perc='%', cum_perc='cum. %'))
.style.set_caption('Top Authors by Number of Articles')
.format({'count': '{:,}', '%': '{:.1%}', 'cum. %': '{:.1%}'})
.background_gradient(cmap='cividis'))
Number of authors: 6834
author | count | % | cum. % | |
---|---|---|---|---|
0 | fabordrabfeed | 11,999 | 2.3% | 2.3% |
1 | huffpost | 11,064 | 2.2% | 4.5% |
2 | video | 10,657 | 2.1% | 6.6% |
3 | hollywoodreporter | 8,492 | 1.7% | 8.2% |
4 | soft | 5,636 | 1.1% | 9.3% |
5 | whitneyjefferson | 5,405 | 1.1% | 10.4% |
6 | flavorwire | 5,334 | 1.0% | 11.4% |
7 | mjs538 | 5,253 | 1.0% | 12.5% |
8 | lyapalater | 5,218 | 1.0% | 13.5% |
9 | justjared | 4,965 | 1.0% | 14.4% |
10 | slate | 4,944 | 1.0% | 15.4% |
11 | glamour | 4,302 | 0.8% | 16.3% |
12 | avclub | 4,204 | 0.8% | 17.1% |
13 | nowthisnews | 4,085 | 0.8% | 17.9% |
14 | hollywoodlife | 3,666 | 0.7% | 18.6% |
15 | time | 3,660 | 0.7% | 19.3% |
16 | collegehumor | 3,394 | 0.7% | 20.0% |
17 | usmagazine | 3,376 | 0.7% | 20.6% |
18 | nypost | 3,374 | 0.7% | 21.3% |
19 | donnad | 3,292 | 0.6% | 21.9% |
"cum. %" shows the cumulative percentage of the number of articles by the authors up to the current row. The first three authors generated 6.6% of the total articles for example. You can also see that the first few are actually other news organizations, and not people. I manually checked a few articles by "huffpost", and got a 404 error.
The following code snippet goes through a random sample of URLs where the author is "huffpost", and prints the URL along with the response.
import requests
for url in buzzfeed[buzzfeed['author']=='huffpost']['loc'].sample(10).tolist():
resp = requests.get(url)
print(url,'|', resp.status_code, resp.reason)
https://www.buzzfeed.com/huffpost/heidi-montag-undergoes-breast-reduction-surgery-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/new-photography-book-chicks-with-guns-sh-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/liza-monroy-marriage-changes-things-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/blue-ivy-beyonce-jay-z-grab-lunch-in-paris-phot-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/mexico-meth-bust-army-finds-15-tons-of-pure-metha-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/teen-jeopardy-contestant-leonard-cooper-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/elin-nordegren-returns-to-college-as-divorce-rumor-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/the-17-weirdest-things-schools-have-banned-photos-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/boston-mayor-vows-to-block-chick-fil-a-from-openin-7p0 | 404 Not Found https://www.buzzfeed.com/huffpost/glory-jay-zs-song-for-daughter-blu-7p0 | 404 Not Found
And this is another issue in the dataset. The articles of the top contributors don't exist anymore. I didn't check them all, and the proper way is to go through all of the half-a-million URLs to quantify this issue.
The existence of such a large number of articles by large news organizations might be the answer to the question of the sudden increase in the volume of content on BuzzFeed. It's definitely a problem to have 404's in your sitemap, but in our case it is great that they didn't remove them, because we have a better view on the history of the site, even though many URLs don't exist anymore. This also means that there might be other non-existent URLs that were removed and we don't know about.
With such a large website, you can expect some issues especially going back seven or eight years were many things change, and many things are not relevant anymore. So let's do the same exercise for a recent period, the years 2019 and 2020 (first quarter).
print('Number of authors:', format(buzzfeed[buzzfeed.index.year > 2018]['author'].nunique(), ','))
print('Number of articles:', format(buzzfeed[buzzfeed.index.year > 2018].__len__(), ','), '\n')
(buzzfeed[buzzfeed.index.year > 2018]['author']
.value_counts()
.to_frame()
.assign(perc=lambda df: df['author'].div(df['author'].sum()),
cum_perc=lambda df: df['perc'].cumsum())
.reset_index()[:20]
.rename(columns=dict(index='author', author='count', perc='%', cum_perc='cum. %'))
.style.set_caption('Top Authors by Number of Articles 2019-2020')
.format({'count': '{:,}', '%': '{:.1%}', 'cum. %': '{:.1%}'})
.background_gradient(cmap='cividis'))
Number of authors: 1,582 Number of articles: 65,770
author | count | % | cum. % | |
---|---|---|---|---|
0 | ryanschocket2 | 914 | 1.4% | 1.4% |
1 | daves4 | 831 | 1.3% | 2.7% |
2 | noradominick | 792 | 1.2% | 3.9% |
3 | ehisosifo1 | 784 | 1.2% | 5.0% |
4 | sydrobinson1 | 745 | 1.1% | 6.2% |
5 | mjs538 | 727 | 1.1% | 7.3% |
6 | sarahaspler | 709 | 1.1% | 8.4% |
7 | mikespohr | 697 | 1.1% | 9.4% |
8 | briangalindo | 696 | 1.1% | 10.5% |
9 | mireyagonzalez | 695 | 1.1% | 11.5% |
10 | audreyworboys | 686 | 1.0% | 12.6% |
11 | stephenlaconte | 685 | 1.0% | 13.6% |
12 | kristatorres | 664 | 1.0% | 14.6% |
13 | luisdelvalle | 664 | 1.0% | 15.6% |
14 | ainamaruyama | 658 | 1.0% | 16.6% |
15 | andyneuenschwander | 655 | 1.0% | 17.6% |
16 | crystalro | 647 | 1.0% | 18.6% |
17 | hannahloewentheil | 632 | 1.0% | 19.6% |
18 | farrahpenn | 628 | 1.0% | 20.5% |
19 | alliehayes | 613 | 0.9% | 21.5% |
Now all the top authors seem to be people and not organizations. We can also see that the top twenty produced 21.5% of the content in this period. And we can see how many articles each author produced, as well as the percentage of that number out of the total articles for the period.
In case you were wondering how many articles per month each author produced:
top16authors = buzzfeed[buzzfeed.index.year > 2018]['author'].value_counts()[:16]
df = buzzfeed[buzzfeed.index.year > 2018]
fig = make_subplots(cols=4, rows=4, subplot_titles=top16authors.index, shared_xaxes=True, shared_yaxes=True)
count = 1
for i in range(4):
for j in range(4):
temp_df = df[df['author']==top16authors.index[i+j]].resample('M')['loc'].count()
fig.add_bar(x=temp_df.index, y=temp_df.values, col=i+1, row=j+1,
name=top16authors.index[i+j], showlegend=False)
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.height = 650
fig.layout.title = 'Buzzfeed Top Sixteen Authors Articles per Month 2019-2020'
HTML(fig.to_html())
The above was an exploratory approach, where we didn't know anything about the authors. Now that we know a little, we can use a top-down approach. The following function takes an arbitrary number of author names, and plots the monthly number of articles for them, so you can compare any two or more authors. So let's start with the top news organizations.
def compare_authors(*authors):
fig = go.Figure()
for author in authors:
df = buzzfeed[buzzfeed['author']==author].resample('M')['loc'].count()
fig.add_scatter(x=df.index, y=df.values, name=author, mode='markers+lines')
fig.layout.title = 'Articles per Month by: ' + ', '.join(authors)
fig.layout.legend = dict(orientation='h')
fig.layout.paper_bgcolor = '#E5ECF6'
return HTML(fig.to_html())
compare_authors('fabordrabfeed', 'huffpost', 'hollywoodreporter', 'soft')
Now it's getting more likely that the jump in articles in April 2011 was due to content partnerships, we can also see that the partnership with HuffingtonPost was ended on November of 2013, according to the sitemap at least.
Below are the trends for the top three authors in the last five quarters.
compare_authors('ryanschocket2', 'daves4', 'noradominick')
We now get to the final part of the URL, the slug that contains the titles of the articles. Everything up to this point was basically creating meta data by categorizing the content by date, category, language, and author. The slugs can also be extracted into their own column using the same approach. I also replaced the dashes with spaces to more easily split and analyze.
buzzfeed['slugs'] = buzzfeed['loc'].str.split('/').str[-1].str.replace('-', ' ')
buzzfeed.sample(7)
loc | sitemap | sitemap_cat | lang | author | slugs | |
---|---|---|---|---|---|---|
lastmod | ||||||
2011-03-16 00:00:00+00:00 | https://www.buzzfeed.com/soft/indigo-rose-visu... | https://www.buzzfeed.com/sitemap/buzzfeed/2011... | buzzfeed | en | soft | indigo rose visual patch 3 5 full download cra... |
2013-05-20 00:00:00+00:00 | https://www.buzzfeed.com/ryanhatesthis/the-sin... | https://www.buzzfeed.com/sitemap/buzzfeed/2013... | buzzfeed | en | ryanhatesthis | the singer miguel fell on a womans head at the... |
2020-03-20 00:00:00+00:00 | https://www.buzzfeed.com/briangalindo/12-celeb... | https://www.buzzfeed.com/sitemap/buzzfeed/2020... | buzzfeed | en | briangalindo | 12 celebrity tbt photos march 19 2020 |
2011-05-25 00:00:00+00:00 | https://www.buzzfeed.com/huffpost/christopher-... | https://www.buzzfeed.com/sitemap/buzzfeed/2011... | buzzfeed | en | huffpost | christopher meloni out at law order sv 7p0 |
2012-04-09 00:00:00+00:00 | https://www.buzzfeed.com/celebuzz/demi-lovato-... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed | en | celebuzz | demi lovato news demi lovato goes au naturel o... |
2018-01-05 00:00:00+00:00 | https://www.buzzfeed.com/tastyeditor/pigs-in-a... | https://www.buzzfeed.com/sitemap/tasty/2018_1.xml | tasty | en | tastyeditor | pigs in a blanket stadium |
2012-08-27 00:00:00+00:00 | https://www.buzzfeed.com/glamour/the-25-best-p... | https://www.buzzfeed.com/sitemap/buzzfeed/2012... | buzzfeed | en | glamour | the 25 best places for the rich and single or ... |
To take a look at the slugs, I created a subset of them containing only English articles.
slugs = buzzfeed[buzzfeed['lang']=='en']['slugs']
slugs.sample(10).to_frame()
slugs | |
---|---|
lastmod | |
2016-01-06 00:00:00+00:00 | mindy kaling knows rom coms |
2013-09-25 00:00:00+00:00 | newsweek pakistans controversial cover features l |
2020-02-29 00:00:00+00:00 | worst romantic leads add yours |
2017-06-22 00:00:00+00:00 | miranda lambert answers fan questions while pl... |
2016-05-13 00:00:00+00:00 | mac and cheeeeese |
2012-10-24 00:00:00+00:00 | groupon coupons that no one uses teamcococom 25qa |
2014-10-06 00:00:00+00:00 | man swarmed by pug puppy dog pile experiences ... |
2011-03-28 00:00:00+00:00 | killink csv 1 1 buy cheap downloadable oem sof... |
2018-07-16 00:00:00+00:00 | literal inception level shit |
2011-06-28 00:00:00+00:00 | collector has 350000 pieces of erotica sxr |
The simplest thing to do is to count the words in the slugs. The word_frequency
function does that for us.
word_counts = adv.word_frequency(slugs)
(word_counts
.iloc[:20, :2]
.style
.format(dict(abs_freq='{:,}'))
.background_gradient(cmap='cividis'))
word | abs_freq | |
---|---|---|
0 | things | 12,651 |
1 | new | 11,240 |
2 | 7p0 | 11,013 |
3 | best | 10,508 |
4 | que | 9,429 |
5 | people | 8,670 |
6 | 30ks | 8,492 |
7 | de | 8,280 |
8 | photos | 6,279 |
9 | like | 6,042 |
10 | 2v4d | 5,636 |
11 | fbl | 5,334 |
12 | 10 | 5,066 |
13 | video | 5,032 |
14 | ozb | 4,965 |
15 | 126l | 4,944 |
16 | know | 4,935 |
17 | 5 | 4,705 |
18 | day | 4,550 |
19 | products | 4,467 |
If one word is not conveying much information, we can specify the phrase_len
value as 2 to count the two-word phrases (tokens is another name for that).
word_counts2 = adv.word_frequency(slugs, phrase_len=2)
(word_counts2.iloc[:20, :2]
.style.format(dict(abs_freq='{:,}'))
.background_gradient(cmap='cividis'))
word | abs_freq | |
---|---|---|
0 | at the | 11,442 |
1 | of the | 6,858 |
2 | the best | 4,855 |
3 | in the | 4,661 |
4 | are you | 3,969 |
5 | the most | 3,710 |
6 | is the | 3,575 |
7 | can you | 3,029 |
8 | how to | 3,007 |
9 | on the | 2,898 |
10 | that will | 2,837 |
11 | this is | 2,646 |
12 | do you | 2,629 |
13 | make you | 2,493 |
14 | that are | 2,454 |
15 | from the | 2,446 |
16 | to be | 2,406 |
17 | for the | 2,367 |
18 | kim kardashian | 2,352 |
19 | things you | 2,233 |
Just like we compared the authors, we can use the same approach by creating a similar function for words, which will serve as topics to be analyzed.
def compare_topics(*topics):
fig = go.Figure()
for topic in topics:
df = slugs[slugs.str.contains(topic)].resample('M').count()
fig.add_scatter(x=df.index, y=df.values, name=topic, mode='markers+lines')
fig.layout.title = 'Articles per Month on: ' + ', '.join(topics)
fig.layout.legend = dict(orientation='h')
fig.layout.paper_bgcolor = '#E5ECF6'
return HTML(fig.to_html())
These are the three most frequently appearing names of celebrities, and "quiz" seems also popular, so I compared them to each other.
compare_topics('kim kardashian', 'miley cyrus', 'justin bieber', 'quiz')
This shows that probably the content HuffingtonPost and the others were publishing was celebrity heavy. It also shows how popular quizzes have been, and the massive focus they are giving to them. This raises the question of what those quizzes are about.
We can count the words in the slugs of a subset of the URLs that contain "quiz" in them. This way we can tell what topics they are using for their quizzes.
(adv.word_frequency(slugs[slugs.str.contains('quiz')])
.iloc[:15, :2].style.format(dict(abs_freq='{:,}')))
word | abs_freq | |
---|---|---|
0 | quiz | 4,121 |
1 | trivia | 275 |
2 | food | 200 |
3 | personality | 185 |
4 | character | 176 |
5 | reveal | 166 |
6 | disney | 138 |
7 | movie | 131 |
8 | quizzes | 120 |
9 | youll | 113 |
10 | pass | 110 |
11 | youre | 108 |
12 | age | 107 |
13 | hardest | 107 |
14 | true | 94 |
We now have a good overview on the size and structure of the dataset. We spotted a few issues in the data. To better structure it, we created a few columns so we can more easily aggregate by language, category, author, date, and finally the titles of articles.
Obviously you don't get the full view on the website by the sitemaps alone, but they provide a very fast way to get a lot of information on publishing activity and content as seen above.
The way we dealt with "lastmod" is pretty standard (many sites also provide the time of publishing and not just the date), but the URLs are different for every site.
After this preparation and getting familiar with some of the possible pitfalls you may face, you can now start a proper analysis of the content. Some ideas you might want to explore: topic modelling, word co-occurrence, entity extraction, document clustering, and doing these for different time ranges and for any of the other available parameters we created.