Content Analysis with Sitemaps and Python¶

Where would you start if you wanted to get an understanding of a website's content, especially large publishers? I'm usually interested in the following questions:

How often and how much do they publish?
Are there daily, weekly, monthly, or annual trends in their publishing activity?
What topics to they write about or what products do they sell?
What are the trends in their topics? Which topics are gaining in volume and which are not?
How is the content or product split across languages, regions, categories, or authors?

In their most basic form, sitemaps are required to only have the "loc" tag (under the parent "url" tag). Essentially, a sitemap is allowed to simply be list of URLs. Other optional tags are allowed, most importantly "lastmod", as well as "changefreq", "priority", and in some cases "alternate". If you have "lastmod" in the sitemap (and most reputable sites do), then you can get all the information related to publishing activity and trends. Then the richness of URLs determines how much information you can extract (if the URLs are structured with no real information like www.example.com/product/12345 then you won't be able to get much from the sitemap).
The goal of this tutorial is to make sitemaps less boring objects!

I'll be analyzing the sitemaps of BuzzFeed, and since they have "lastmod" as well as consistent and rich URLs, we will be able to answer all of the questions raised above.

I'll be using Python for the analysis and an interactive version of the article is available here. I encourage you to check it out if you want to follow along. This way you can make changes and explore other things that you might be curious about. The data visualizations are also interactive, so you will be able to zoom, hover, and explore a little better.
If you don't know any programming, you can safely ignore all the code snippets (which I will be explaining anyways).

To get the sitemaps in a table format, I'll use the sitemap_to_df function from the advertools package. "df" is short for DataFrame, which is basically a data table. You simply pass the URL of a sitemap (or a sitemap index URL) to the function, and it returns the sitemap(s) in tabular format. If you give it a sitemap index, then it will go through all the sub-sitemaps and extract the URLs and whatever other data is available.
In addition to advertools, I'll be using pandas for data manipulation, as well as plotly for data visualization.

In [1]:

import advertools as adv
import plotly.graph_objects as go
import pandas as pd

# buzzfeed_generic = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/buzzfeed.xml')
# buzzfeed_tasty = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/tasty.xml')
# buzzfeed_video = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/video.xml')
# buzzfeed_shopping = adv.sitemap_to_df('https://www.buzzfeed.com/sitemap/shopping.xml')
# buzzfeed = pd.concat([buzzfeed_generic, buzzfeed_tasty, buzzfeed_video, buzzfeed_shopping],
#                      ignore_index=True)

Since I have saved them to CSV files so you don't have to re-import them, we'll open the files directly, and put them in one big DataFrame. We will set the "lastmod" column to become the index and set its type as "datetime" so we can access special date and time functionality.

In [2]:

import os

In [3]:

buzzfeed = pd.concat((pd.read_csv('data/' + file) for file in os.listdir('data/')), 
                     ignore_index=True)
buzzfeed['lastmod'] = pd.to_datetime(buzzfeed['lastmod'])
buzzfeed = buzzfeed.set_index('lastmod')
buzzfeed = buzzfeed.drop(columns=['video'])
buzzfeed

Out[3]:

	loc	sitemap
lastmod
NaT	https://www.buzzfeed.com/watch/video/1961	https://www.buzzfeed.com/sitemap/video/2016_28...
NaT	https://www.buzzfeed.com/watch/video/1503	https://www.buzzfeed.com/sitemap/video/2016_28...
NaT	https://www.buzzfeed.com/watch/video/1741	https://www.buzzfeed.com/sitemap/video/2016_28...
NaT	https://www.buzzfeed.com/watch/video/108	https://www.buzzfeed.com/sitemap/video/2016_28...
NaT	https://www.buzzfeed.com/watch/video/1975	https://www.buzzfeed.com/sitemap/video/2016_28...
...	...	...
2020-03-31 00:00:00+00:00	https://www.buzzfeed.com/jp/sonomishimada/chin...	https://www.buzzfeed.com/sitemap/tasty/2020_13...
2020-03-27 00:00:00+00:00	https://www.buzzfeed.com/br/agathadahora/recei...	https://www.buzzfeed.com/sitemap/tasty/2020_13...
2020-03-28 00:00:00+00:00	https://www.buzzfeed.com/jp/redkikuchi/potato-...	https://www.buzzfeed.com/sitemap/tasty/2020_13...
2020-03-30 00:00:00+00:00	https://www.buzzfeed.com/br/agathadahora/recei...	https://www.buzzfeed.com/sitemap/tasty/2020_13...
2020-03-31 00:00:00+00:00	https://www.buzzfeed.com/jp/yuittakahashi/baby...	https://www.buzzfeed.com/sitemap/tasty/2020_13...

512320 rows × 2 columns

The above is how the DataFrame looks like. "lastmod" is the index, and we have two columns; "loc" which is the URLs, and "sitemap", which is the URL of the siteamp from which the URL was retreived.
NaT stands for "not-a-time", which is the missing value representation of datetime objects.
As you can see, we have around half a million URLs to go through.

Sitemap Categories¶

If you look at the URLs of the sitemaps, you will see that they contain the website category, for example:

https://www.buzzfeed.com/sitemap/buzzfeed/2019_5.xml
https://www.buzzfeed.com/sitemap/shopping/2018_13.xml

This can be helpful in understanding which category the URL falls under. To extract the category from those URLs, the following line splits the XML URLs by the forward slash character, and takes the fifth element (index 4) of the resulting list. The extracted text will be assigned to a new column called sitemap_cat.

In [4]:

buzzfeed['sitemap_cat'] = buzzfeed['sitemap'].str.split('/').str[4]
buzzfeed.sample(5)

Out[4]:

	loc	sitemap	sitemap_cat
lastmod
2012-05-25 00:00:00+00:00	https://www.buzzfeed.com/expresident/otter-swi...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed
2016-04-12 00:00:00+00:00	https://www.buzzfeed.com/fabordrabfeed/helen-m...	https://www.buzzfeed.com/sitemap/buzzfeed/2016...	buzzfeed
2019-01-12 00:00:00+00:00	https://www.buzzfeed.com/nataliebrown/gardenin...	https://www.buzzfeed.com/sitemap/buzzfeed/2018...	buzzfeed
2012-02-20 00:00:00+00:00	https://www.buzzfeed.com/flavorwire/watch-roll...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed
2017-02-15 00:00:00+00:00	https://www.buzzfeed.com/andreborges/delicious...	https://www.buzzfeed.com/sitemap/buzzfeed/2017...	buzzfeed

Now that we have a column showing the categories, we can count how many URLs they have and get an overview of the relative volume of content under each. The following code simply counts the values in that column and formats the resulting DataFrame.

In [5]:

(buzzfeed['sitemap_cat']
 .value_counts()
 .to_frame()
 .assign(percentage=lambda df: df['sitemap_cat'].div(df['sitemap_cat'].sum()))
 .style.format(dict(sitemap_cat='{:,}', percentage='{:.1%}')))

Out[5]:

	sitemap_cat	percentage
buzzfeed	478,430	93.4%
shopping	13,774	2.7%
video	10,657	2.1%
tasty	5,337	1.0%
asis	4,122	0.8%

It's clear the "buzzfeed" is the major category, which is basically the main site, and the others are very small in comparison.

Before proceeding further, it's important to get a better understanding of the NaT values that we saw at the beginning. Let's see what category they fall under.

In [6]:

buzzfeed[buzzfeed.index.isna()]['sitemap_cat'].head()

Out[6]:

lastmod
NaT    video
NaT    video
NaT    video
NaT    video
NaT    video
Name: sitemap_cat, dtype: object

The first five fall under "video", but is that true for all the missing values?
The following line takes a subset of the DataFrame buzzfeed (the subset where the index contains missing values), then takes the sitemap_cat column, and counts the number of unique values. Since we saw that some values are "video", if the number of unique values is one, then all categories of missing dates fall under "video".

In [7]:

buzzfeed[buzzfeed.index.isna()]['sitemap_cat'].nunique()

Out[7]:

We have now uncovered a limitation in our dataset, which we know affects 2.1% of the URLs. We will not be able to analyze date-related issues with the video URLs. Nor will we be able to get any information about the content of those URLs for that matter:

In [8]:

buzzfeed[buzzfeed['sitemap_cat']=='video']['loc'].sample(10)

Out[8]:

lastmod
NaT    https://www.buzzfeed.com/watch/video/38468
NaT     https://www.buzzfeed.com/watch/video/8753
NaT    https://www.buzzfeed.com/watch/video/19053
NaT    https://www.buzzfeed.com/watch/video/18874
NaT    https://www.buzzfeed.com/watch/video/52612
NaT    https://www.buzzfeed.com/watch/video/96088
NaT    https://www.buzzfeed.com/watch/video/21987
NaT    https://www.buzzfeed.com/watch/video/76722
NaT    https://www.buzzfeed.com/watch/video/15605
NaT     https://www.buzzfeed.com/watch/video/7388
Name: loc, dtype: object

Publishing Trends¶

Let's check how many articles they publish per year, and whether or not there were higher/lower years.
The following code resamples the DataFrame by "A" (for annual), and counts the rows.

In [9]:

articles_per_year = buzzfeed.resample('A')['loc'].count()
articles_per_year.to_frame()

Out[9]:

	loc
lastmod
2008-12-31 00:00:00+00:00	2646
2009-12-31 00:00:00+00:00	3514
2010-12-31 00:00:00+00:00	11994
2011-12-31 00:00:00+00:00	46974
2012-12-31 00:00:00+00:00	62006
2013-12-31 00:00:00+00:00	61941
2014-12-31 00:00:00+00:00	62563
2015-12-31 00:00:00+00:00	56018
2016-12-31 00:00:00+00:00	49835
2017-12-31 00:00:00+00:00	38084
2018-12-31 00:00:00+00:00	40318
2019-12-31 00:00:00+00:00	54470
2020-12-31 00:00:00+00:00	11300

In [10]:

from IPython.display import HTML
fig = go.Figure()
fig.add_bar(x=articles_per_year.index, y=articles_per_year.values)
fig.layout.title = 'BuzzFeed Articles per Year (excluding video)'
fig.layout.yaxis.title = 'Number of articles'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())

Out[10]:

We can see dramatic increases in articles from 2010 (3,514) to 2011 (12k), and from 2011 to 2012 (46k). It's extremely unlikely that a website can increase it's publishing activity almost fourfold, twice, and in two consecutive years. They might have made some acquisitions, content partnerships, or maybe there are issues with the dataset. When we check the authors later, we will see a possible answer to this sudden increase. Let's zoom in further. Let's do the same, but look at the monthly trend.

In [11]:

articles_per_month = buzzfeed.resample('M')['loc'].count()
articles_per_month.to_frame()

Out[11]:

	loc
lastmod
2008-01-31 00:00:00+00:00	163
2008-02-29 00:00:00+00:00	159
2008-03-31 00:00:00+00:00	158
2008-04-30 00:00:00+00:00	170
2008-05-31 00:00:00+00:00	216
...	...
2019-12-31 00:00:00+00:00	4245
2020-01-31 00:00:00+00:00	3562
2020-02-29 00:00:00+00:00	3483
2020-03-31 00:00:00+00:00	4202
2020-04-30 00:00:00+00:00	53

148 rows × 1 columns

In [12]:

fig = go.Figure()
fig.add_bar(x=articles_per_month.index, y=articles_per_month.values)
fig.layout.title = 'BuzzFeed Articles per Month (excluding video)'
fig.layout.yaxis.title = 'Number of articles'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())

Out[12]:

This confirms the trend above and shows a more sudden change. In April 2010, they published 1,249 articles, after having published 354 the previous month. We have something similar happening in April of 2011. Now it is almost certain that this is not an organic natural growth in their publishing activity.
We can also take a look at the trend by day of the week.

In [13]:

(buzzfeed
 .groupby(buzzfeed.index.weekday)['loc']
 .count().to_frame()
 .rename(columns=dict(loc='count'))
 .assign(day=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
 .style.bar(color='lightblue').format(dict(count='{:,}'))
)

Out[13]:

	count	day
lastmod
0.0	82,469	Mon
1.0	86,661	Tue
2.0	85,311	Wed
3.0	82,199	Thu
4.0	81,362	Fri
5.0	43,487	Sat
6.0	40,174	Sun

Nothing very surprising here. They produce a fairly consistent number of articles on weekdays, which is almost double what they produce on weekends.

Categories' Trends¶

We can also take a look at the annual trends by category and see if something pops out.
The following code goes through all categories, and creates a plot for the number of articles per year.

In [14]:

from plotly.subplots import make_subplots
categories = buzzfeed['sitemap_cat'][buzzfeed['sitemap_cat'] != 'video'].unique()
fig = make_subplots(rows=1, cols=len(categories), subplot_titles=categories,
                    shared_yaxes=True)
for i, cat in enumerate(categories):
    df = buzzfeed[buzzfeed['sitemap_cat']==cat].resample('A')['loc'].count()
    fig.add_bar(x=df.index.year, y=df.values, row=1, col=i+1, showlegend=False, name=cat)
fig.layout.height = 550
fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.title = 'Buzzfeed Articles per Year - By Category'
HTML(fig.to_html())

Out[14]:

I can see two things here. First is the jump in "shopping" articles from 1,732 to 6,845 in 2019, and 2020 is on track to top that. It seems that it is working well for them. Checking some of those articles you can see that they are running affiliate programs and promoting some products. Second, is how misleading this chart can be. For example Tasty has been acquired by BuzzFeed, and here you can see it occupying a tiny portion of the content. But if you check their Facebook page you'll see that they have almost one hundred million followers. So keep this in mind, be skeptical, and try to verify the information from other sources where possible.

URL Structure¶

We can now move to analyze whatever information we can get from the URLs, and here is a random sample:

In [15]:

buzzfeed['loc'].sample(15).tolist()

Out[15]:

['https://www.buzzfeed.com/buzzfeedpress/buzzfeeds-ceo-on-the-future-of-media-the-internet-always',
 'https://www.buzzfeed.com/thestreet/5-stocks-under-10-set-to-soar-xh5',
 'https://www.buzzfeed.com/huffpost/nick-cannons-autoimmune-disease-america-7p0',
 'https://www.buzzfeed.com/mjs538/heres-what-pete-buttigieg-has-to-say-to-all-the-mike-pences',
 'https://www.buzzfeed.com/alexandranapoli/products-that-must-have-been-designed-by-geniuses',
 'https://www.buzzfeed.com/ryanschocket2/john-mayer-keeps-tryna-slide-into-halseys-dms-on-insta-and',
 'https://www.buzzfeed.com/kimberleydadds/giovanna-fletcher-has-shared-this-inspiring-message-for-crit',
 'https://www.buzzfeed.com/briangalindo/15-great-celebrity-tbt-photos-august-8',
 'https://www.buzzfeed.com/samir/video-of-trucks-crashing-into-low-bridges',
 'https://www.buzzfeed.com/ktlincoln/the-new-york-knicks-basketball-trolls',
 'https://www.buzzfeed.com/slate/todd-akin-missouri-gop-nominee-prompts-stir-with-126l',
 'https://www.buzzfeed.com/abagg/this-three-and-a-half-minute-music-video-was-shot-in-only-fi',
 'https://www.buzzfeed.com/hollywoodlife/kristen-stewart-worried-robert-pattinson-wont-u0l',
 'https://www.buzzfeed.com/soft/unbelievably-cheap-prices-for-public-pc-desktop-6-2v4d',
 'https://www.buzzfeed.com/starpulse/dave-grohl-krist-novoselic-play-smells-rsh']

The general template seems to be in the form buzzfeed.com/{language}/{author}/{article-title} and the English articles don't have "/en/" in them.
Let's now create a new column for languages, which can be done by extracting the pattern of two letters between two slashes. If nothing is available, it will be filled with "en". Now we can see the number of articles per language.

In [16]:

buzzfeed['lang'] = buzzfeed['loc'].str.extract('/([a-z]{2})/')[0].fillna('en')
by_lang = buzzfeed['lang'].value_counts()
by_lang.to_frame().style.format('{:,}')

Out[16]:

	lang
en	441,283
br	21,104
jp	16,226
mx	12,714
de	10,942
fr	10,051

In [17]:

fig = go.Figure()
fig.add_bar(x=by_lang.index, y=by_lang.values)
fig.layout.title = 'BuzzFeed Number of Articles per Language (excluding video)'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())

Out[17]:

We can also see the monthly number of articles per language for a better view.

In [18]:

fig = make_subplots(cols=1, rows=len(by_lang), subplot_titles=by_lang.index.str.upper(),
                    shared_xaxes=True, y_title='Number of articles')
for i, lang in enumerate(by_lang):
    df = buzzfeed[buzzfeed['lang']==by_lang.index[i]].resample('M')['loc'].count()
    fig.add_bar(x=df.index, y=df.values, row=i+1, col=1, showlegend=False, name=lang)
fig.layout.height = 650
fig.layout.title = 'BuzzFeed Monthly Articles by Language (excluding video)'
fig.layout.paper_bgcolor = '#E5ECF6'
HTML(fig.to_html())

Out[18]:

Authors¶

Now let's do the same for authors. It's the same process, we split the "loc" column by "/" and extract the second to last element, and place it in a new "author" column. After that we can count the articles by author.

In [19]:

buzzfeed['author'] = buzzfeed['loc'].str.split('/').str[-2]
buzzfeed.sample(10)

Out[19]:

	loc	sitemap	sitemap_cat	lang	author
lastmod
2012-05-24 00:00:00+00:00	https://www.buzzfeed.com/current/15-pictures-o...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed	en	current
2013-02-06 00:00:00+00:00	https://www.buzzfeed.com/aolweirdnews/timothy-...	https://www.buzzfeed.com/sitemap/buzzfeed/2013...	buzzfeed	en	aolweirdnews
2018-10-12 00:00:00+00:00	https://www.buzzfeed.com/br/davirocha/qual-mus...	https://www.buzzfeed.com/sitemap/buzzfeed/2018...	buzzfeed	br	davirocha
2012-10-10 00:00:00+00:00	https://www.buzzfeed.com/ivillage/how-to-make-...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed	en	ivillage
2011-09-06 00:00:00+00:00	https://www.buzzfeed.com/pharmacy/buy-authenti...	https://www.buzzfeed.com/sitemap/buzzfeed/2011...	buzzfeed	en	pharmacy
2010-06-11 00:00:00+00:00	https://www.buzzfeed.com/limelife/do-you-have-...	https://www.buzzfeed.com/sitemap/buzzfeed/2010...	buzzfeed	en	limelife
2016-10-15 00:00:00+00:00	https://www.buzzfeed.com/laraparker/can-you-gu...	https://www.buzzfeed.com/sitemap/buzzfeed/2016...	buzzfeed	en	laraparker
2019-11-30 00:00:00+00:00	https://www.buzzfeed.com/jp/michelleno/newborn...	https://www.buzzfeed.com/sitemap/buzzfeed/2019...	buzzfeed	jp	michelleno
2015-09-16 00:00:00+00:00	https://www.buzzfeed.com/robertcrauder/muslim-...	https://www.buzzfeed.com/sitemap/buzzfeed/2015...	buzzfeed	en	robertcrauder
2012-08-29 00:00:00+00:00	https://www.buzzfeed.com/whitneyjefferson/eliz...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed	en	whitneyjefferson

In [20]:

print('Number of authors:', buzzfeed['author'].nunique(), '\n')
(buzzfeed['author']
 .value_counts()
 .to_frame()
 .assign(perc=lambda df: df['author'].div(df['author'].sum()),
         cum_perc=lambda df: df['perc'].cumsum())[:20]
 .reset_index()
 .rename(columns=dict(index='author', author='count', perc='%', cum_perc='cum. %'))
 .style.set_caption('Top Authors by Number of Articles')
 .format({'count': '{:,}', '%': '{:.1%}', 'cum. %': '{:.1%}'})
 .background_gradient(cmap='cividis'))

Number of authors: 6834

Out[20]:

Top Authors by Number of Articles
	author	count	%	cum. %
0	fabordrabfeed	11,999	2.3%	2.3%
1	huffpost	11,064	2.2%	4.5%
2	video	10,657	2.1%	6.6%
3	hollywoodreporter	8,492	1.7%	8.2%
4	soft	5,636	1.1%	9.3%
5	whitneyjefferson	5,405	1.1%	10.4%
6	flavorwire	5,334	1.0%	11.4%
7	mjs538	5,253	1.0%	12.5%
8	lyapalater	5,218	1.0%	13.5%
9	justjared	4,965	1.0%	14.4%
10	slate	4,944	1.0%	15.4%
11	glamour	4,302	0.8%	16.3%
12	avclub	4,204	0.8%	17.1%
13	nowthisnews	4,085	0.8%	17.9%
14	hollywoodlife	3,666	0.7%	18.6%
15	time	3,660	0.7%	19.3%
16	collegehumor	3,394	0.7%	20.0%
17	usmagazine	3,376	0.7%	20.6%
18	nypost	3,374	0.7%	21.3%
19	donnad	3,292	0.6%	21.9%

"cum. %" shows the cumulative percentage of the number of articles by the authors up to the current row. The first three authors generated 6.6% of the total articles for example. You can also see that the first few are actually other news organizations, and not people. I manually checked a few articles by "huffpost", and got a 404 error.
The following code snippet goes through a random sample of URLs where the author is "huffpost", and prints the URL along with the response.

In [21]:

import requests
for url in buzzfeed[buzzfeed['author']=='huffpost']['loc'].sample(10).tolist():    
    resp = requests.get(url)
    print(url,'|', resp.status_code, resp.reason)

https://www.buzzfeed.com/huffpost/heidi-montag-undergoes-breast-reduction-surgery-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/new-photography-book-chicks-with-guns-sh-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/liza-monroy-marriage-changes-things-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/blue-ivy-beyonce-jay-z-grab-lunch-in-paris-phot-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/mexico-meth-bust-army-finds-15-tons-of-pure-metha-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/teen-jeopardy-contestant-leonard-cooper-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/elin-nordegren-returns-to-college-as-divorce-rumor-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/the-17-weirdest-things-schools-have-banned-photos-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/boston-mayor-vows-to-block-chick-fil-a-from-openin-7p0 | 404 Not Found
https://www.buzzfeed.com/huffpost/glory-jay-zs-song-for-daughter-blu-7p0 | 404 Not Found

And this is another issue in the dataset. The articles of the top contributors don't exist anymore. I didn't check them all, and the proper way is to go through all of the half-a-million URLs to quantify this issue.
The existence of such a large number of articles by large news organizations might be the answer to the question of the sudden increase in the volume of content on BuzzFeed. It's definitely a problem to have 404's in your sitemap, but in our case it is great that they didn't remove them, because we have a better view on the history of the site, even though many URLs don't exist anymore. This also means that there might be other non-existent URLs that were removed and we don't know about.

With such a large website, you can expect some issues especially going back seven or eight years were many things change, and many things are not relevant anymore. So let's do the same exercise for a recent period, the years 2019 and 2020 (first quarter).

In [22]:

print('Number of authors:', format(buzzfeed[buzzfeed.index.year > 2018]['author'].nunique(), ','))
print('Number of articles:', format(buzzfeed[buzzfeed.index.year > 2018].__len__(), ','), '\n')
(buzzfeed[buzzfeed.index.year > 2018]['author']
 .value_counts()
 .to_frame()
 .assign(perc=lambda df: df['author'].div(df['author'].sum()),
         cum_perc=lambda df: df['perc'].cumsum())
 .reset_index()[:20]
 .rename(columns=dict(index='author', author='count', perc='%', cum_perc='cum. %'))
 .style.set_caption('Top Authors by Number of Articles 2019-2020')
 .format({'count': '{:,}', '%': '{:.1%}', 'cum. %': '{:.1%}'})
 .background_gradient(cmap='cividis'))

Number of authors: 1,582
Number of articles: 65,770

Out[22]:

Top Authors by Number of Articles 2019-2020
	author	count	%	cum. %
0	ryanschocket2	914	1.4%	1.4%
1	daves4	831	1.3%	2.7%
2	noradominick	792	1.2%	3.9%
3	ehisosifo1	784	1.2%	5.0%
4	sydrobinson1	745	1.1%	6.2%
5	mjs538	727	1.1%	7.3%
6	sarahaspler	709	1.1%	8.4%
7	mikespohr	697	1.1%	9.4%
8	briangalindo	696	1.1%	10.5%
9	mireyagonzalez	695	1.1%	11.5%
10	audreyworboys	686	1.0%	12.6%
11	stephenlaconte	685	1.0%	13.6%
12	kristatorres	664	1.0%	14.6%
13	luisdelvalle	664	1.0%	15.6%
14	ainamaruyama	658	1.0%	16.6%
15	andyneuenschwander	655	1.0%	17.6%
16	crystalro	647	1.0%	18.6%
17	hannahloewentheil	632	1.0%	19.6%
18	farrahpenn	628	1.0%	20.5%
19	alliehayes	613	0.9%	21.5%

Now all the top authors seem to be people and not organizations. We can also see that the top twenty produced 21.5% of the content in this period. And we can see how many articles each author produced, as well as the percentage of that number out of the total articles for the period.
In case you were wondering how many articles per month each author produced:

In [23]:

top16authors = buzzfeed[buzzfeed.index.year > 2018]['author'].value_counts()[:16]
df = buzzfeed[buzzfeed.index.year > 2018]
fig = make_subplots(cols=4, rows=4, subplot_titles=top16authors.index, shared_xaxes=True, shared_yaxes=True)
count = 1
for i in range(4):
    for j in range(4):
        temp_df = df[df['author']==top16authors.index[i+j]].resample('M')['loc'].count()
        fig.add_bar(x=temp_df.index, y=temp_df.values, col=i+1, row=j+1,
                    name=top16authors.index[i+j], showlegend=False)

fig.layout.paper_bgcolor = '#E5ECF6'
fig.layout.height = 650
fig.layout.title = 'Buzzfeed Top Sixteen Authors Articles per Month 2019-2020'
HTML(fig.to_html())

Out[23]:

The above was an exploratory approach, where we didn't know anything about the authors. Now that we know a little, we can use a top-down approach. The following function takes an arbitrary number of author names, and plots the monthly number of articles for them, so you can compare any two or more authors. So let's start with the top news organizations.

In [24]:

def compare_authors(*authors):
    fig = go.Figure()
    for author in authors:
        df = buzzfeed[buzzfeed['author']==author].resample('M')['loc'].count()
        fig.add_scatter(x=df.index, y=df.values, name=author, mode='markers+lines')
    fig.layout.title = 'Articles per Month by: '  + ', '.join(authors)
    fig.layout.legend = dict(orientation='h')
    fig.layout.paper_bgcolor = '#E5ECF6'
    return HTML(fig.to_html())

In [25]:

compare_authors('fabordrabfeed', 'huffpost', 'hollywoodreporter', 'soft')

Out[25]:

Now it's getting more likely that the jump in articles in April 2011 was due to content partnerships, we can also see that the partnership with HuffingtonPost was ended on November of 2013, according to the sitemap at least.
Below are the trends for the top three authors in the last five quarters.

In [26]:

compare_authors('ryanschocket2', 'daves4', 'noradominick')

Out[26]:

Content Analysis¶

We now get to the final part of the URL, the slug that contains the titles of the articles. Everything up to this point was basically creating meta data by categorizing the content by date, category, language, and author. The slugs can also be extracted into their own column using the same approach. I also replaced the dashes with spaces to more easily split and analyze.

In [27]:

buzzfeed['slugs'] =  buzzfeed['loc'].str.split('/').str[-1].str.replace('-', ' ')
buzzfeed.sample(7)

Out[27]:

	loc	sitemap	sitemap_cat	lang	author	slugs
lastmod
2011-03-16 00:00:00+00:00	https://www.buzzfeed.com/soft/indigo-rose-visu...	https://www.buzzfeed.com/sitemap/buzzfeed/2011...	buzzfeed	en	soft	indigo rose visual patch 3 5 full download cra...
2013-05-20 00:00:00+00:00	https://www.buzzfeed.com/ryanhatesthis/the-sin...	https://www.buzzfeed.com/sitemap/buzzfeed/2013...	buzzfeed	en	ryanhatesthis	the singer miguel fell on a womans head at the...
2020-03-20 00:00:00+00:00	https://www.buzzfeed.com/briangalindo/12-celeb...	https://www.buzzfeed.com/sitemap/buzzfeed/2020...	buzzfeed	en	briangalindo	12 celebrity tbt photos march 19 2020
2011-05-25 00:00:00+00:00	https://www.buzzfeed.com/huffpost/christopher-...	https://www.buzzfeed.com/sitemap/buzzfeed/2011...	buzzfeed	en	huffpost	christopher meloni out at law order sv 7p0
2012-04-09 00:00:00+00:00	https://www.buzzfeed.com/celebuzz/demi-lovato-...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed	en	celebuzz	demi lovato news demi lovato goes au naturel o...
2018-01-05 00:00:00+00:00	https://www.buzzfeed.com/tastyeditor/pigs-in-a...	https://www.buzzfeed.com/sitemap/tasty/2018_1.xml	tasty	en	tastyeditor	pigs in a blanket stadium
2012-08-27 00:00:00+00:00	https://www.buzzfeed.com/glamour/the-25-best-p...	https://www.buzzfeed.com/sitemap/buzzfeed/2012...	buzzfeed	en	glamour	the 25 best places for the rich and single or ...

To take a look at the slugs, I created a subset of them containing only English articles.

In [28]:

slugs = buzzfeed[buzzfeed['lang']=='en']['slugs']
slugs.sample(10).to_frame()

Out[28]:

	slugs
lastmod
2016-01-06 00:00:00+00:00	mindy kaling knows rom coms
2013-09-25 00:00:00+00:00	newsweek pakistans controversial cover features l
2020-02-29 00:00:00+00:00	worst romantic leads add yours
2017-06-22 00:00:00+00:00	miranda lambert answers fan questions while pl...
2016-05-13 00:00:00+00:00	mac and cheeeeese
2012-10-24 00:00:00+00:00	groupon coupons that no one uses teamcococom 25qa
2014-10-06 00:00:00+00:00	man swarmed by pug puppy dog pile experiences ...
2011-03-28 00:00:00+00:00	killink csv 1 1 buy cheap downloadable oem sof...
2018-07-16 00:00:00+00:00	literal inception level shit
2011-06-28 00:00:00+00:00	collector has 350000 pieces of erotica sxr

The simplest thing to do is to count the words in the slugs. The word_frequency function does that for us.

In [29]:

word_counts = adv.word_frequency(slugs)
(word_counts
 .iloc[:20, :2]
 .style
 .format(dict(abs_freq='{:,}'))
 .background_gradient(cmap='cividis'))

Out[29]:

	word	abs_freq
0	things	12,651
1	new	11,240
2	7p0	11,013
3	best	10,508
4	que	9,429
5	people	8,670
6	30ks	8,492
7	de	8,280
8	photos	6,279
9	like	6,042
10	2v4d	5,636
11	fbl	5,334
12	10	5,066
13	video	5,032
14	ozb	4,965
15	126l	4,944
16	know	4,935
17	5	4,705
18	day	4,550
19	products	4,467

If one word is not conveying much information, we can specify the phrase_len value as 2 to count the two-word phrases (tokens is another name for that).

In [30]:

word_counts2 = adv.word_frequency(slugs, phrase_len=2)
(word_counts2.iloc[:20, :2]
 .style.format(dict(abs_freq='{:,}'))
 .background_gradient(cmap='cividis'))

Out[30]:

	word	abs_freq
0	at the	11,442
1	of the	6,858
2	the best	4,855
3	in the	4,661
4	are you	3,969
5	the most	3,710
6	is the	3,575
7	can you	3,029
8	how to	3,007
9	on the	2,898
10	that will	2,837
11	this is	2,646
12	do you	2,629
13	make you	2,493
14	that are	2,454
15	from the	2,446
16	to be	2,406
17	for the	2,367
18	kim kardashian	2,352
19	things you	2,233

Just like we compared the authors, we can use the same approach by creating a similar function for words, which will serve as topics to be analyzed.

In [31]:

def compare_topics(*topics):
    fig = go.Figure()
    for topic in topics:
        df = slugs[slugs.str.contains(topic)].resample('M').count()
        fig.add_scatter(x=df.index, y=df.values, name=topic, mode='markers+lines')
    fig.layout.title = 'Articles per Month on: '  + ', '.join(topics)
    fig.layout.legend = dict(orientation='h')
    fig.layout.paper_bgcolor = '#E5ECF6'
    return HTML(fig.to_html())

These are the three most frequently appearing names of celebrities, and "quiz" seems also popular, so I compared them to each other.

In [32]:

compare_topics('kim kardashian', 'miley cyrus', 'justin bieber', 'quiz')

Out[32]:

This shows that probably the content HuffingtonPost and the others were publishing was celebrity heavy. It also shows how popular quizzes have been, and the massive focus they are giving to them. This raises the question of what those quizzes are about.
We can count the words in the slugs of a subset of the URLs that contain "quiz" in them. This way we can tell what topics they are using for their quizzes.

In [33]:

(adv.word_frequency(slugs[slugs.str.contains('quiz')])
 .iloc[:15, :2].style.format(dict(abs_freq='{:,}')))

Out[33]:

	word	abs_freq
0	quiz	4,121
1	trivia	275
2	food	200
3	personality	185
4	character	176
5	reveal	166
6	disney	138
7	movie	131
8	quizzes	120
9	youll	113
10	pass	110
11	youre	108
12	age	107
13	hardest	107
14	true	94

Summary¶

We now have a good overview on the size and structure of the dataset. We spotted a few issues in the data. To better structure it, we created a few columns so we can more easily aggregate by language, category, author, date, and finally the titles of articles.
Obviously you don't get the full view on the website by the sitemaps alone, but they provide a very fast way to get a lot of information on publishing activity and content as seen above. The way we dealt with "lastmod" is pretty standard (many sites also provide the time of publishing and not just the date), but the URLs are different for every site.

After this preparation and getting familiar with some of the possible pitfalls you may face, you can now start a proper analysis of the content. Some ideas you might want to explore: topic modelling, word co-occurrence, entity extraction, document clustering, and doing these for different time ranges and for any of the other available parameters we created.