Content Analysis with Sitemaps and Python

From the "Run" menu please select "Run All Cells"

Where would you start if you wanted to get an understanding of a website's content, especially large publishers? I'm usually interested in the following questions:

  • How often and how much do they publish?
  • Are there daily, weekly, monthly, or annual trends in their publishing activity?
  • What topics to they write about or what products do they sell?
  • What are the trends in their topics? Which topics are gaining in volume and which are not?
  • How is the content or product split across languages, regions, categories, or authors?

In their most basic form, sitemaps are required to only have the "loc" tag (under the parent "url" tag). Essentially, a sitemap is allowed to simply be list of URLs. Other optional tags are allowed, most importantly "lastmod", as well as "changefreq", "priority", and in some cases "alternate". If you have "lastmod" in the sitemap (and most reputable sites do), then you can get all the information related to publishing activity and trends. Then the richness of URLs determines how much information you can extract (if the URLs are structured with no real information like then you won't be able to get much from the sitemap).
The goal of this tutorial is to make sitemaps less boring objects!

I'll be analyzing the sitemaps of BuzzFeed, and since they have "lastmod" as well as consistent and rich URLs, we will be able to answer all of the questions raised above.

I'll be using Python for the analysis and an interactive version of the article is available here. I encourage you to check it out if you want to follow along. This way you can make changes and explore other things that you might be curious about. The data visualizations are also interactive, so you will be able to zoom, hover, and explore a little better.
If you don't know any programming, you can safely ignore all the code snippets (which I will be explaining anyways).

To get the sitemaps in a table format, I'll use the sitemap_to_df function from the advertools package. "df" is short for DataFrame, which is basically a data table. You simply pass the URL of a sitemap (or a sitemap index URL) to the function, and it returns the sitemap(s) in tabular format. If you give it a sitemap index, then it will go through all the sub-sitemaps and extract the URLs and whatever other data is available.
In addition to advertools, I'll be using pandas for data manipulation, as well as plotly for data visualization.

In [1]:
import advertools as adv
import plotly.graph_objects as go
import pandas as pd

# buzzfeed_generic = adv.sitemap_to_df('')
# buzzfeed_tasty = adv.sitemap_to_df('')
# buzzfeed_video = adv.sitemap_to_df('')
# buzzfeed_shopping = adv.sitemap_to_df('')
# buzzfeed = pd.concat([buzzfeed_generic, buzzfeed_tasty, buzzfeed_video, buzzfeed_shopping],
#                      ignore_index=True)

Since I have saved them to CSV files so you don't have to re-import them, we'll open the files directly, and put them in one big DataFrame. We will set the "lastmod" column to become the index and set its type as "datetime" so we can access special date and time functionality.

In [2]:
import os
In [3]:
buzzfeed = pd.concat((pd.read_csv('data/' + file) for file in os.listdir('data/')), 
buzzfeed['lastmod'] = pd.to_datetime(buzzfeed['lastmod'])
buzzfeed = buzzfeed.set_index('lastmod')
buzzfeed = buzzfeed.drop(columns=['video'])
loc sitemap
... ... ...
2020-03-31 00:00:00+00:00
2020-03-27 00:00:00+00:00
2020-03-28 00:00:00+00:00
2020-03-30 00:00:00+00:00
2020-03-31 00:00:00+00:00

512320 rows × 2 columns

The above is how the DataFrame looks like. "lastmod" is the index, and we have two columns; "loc" which is the URLs, and "sitemap", which is the URL of the siteamp from which the URL was retreived.
NaT stands for "not-a-time", which is the missing value representation of datetime objects.
As you can see, we have around half a million URLs to go through.

Sitemap Categories

If you look at the URLs of the sitemaps, you will see that they contain the website category, for example:

This can be helpful in understanding which category the URL falls under. To extract the category from those URLs, the following line splits the XML URLs by the forward slash character, and takes the fifth element (index 4) of the resulting list. The extracted text will be assigned to a new column called sitemap_cat.

In [4]:
buzzfeed['sitemap_cat'] = buzzfeed['sitemap'].str.split('/').str[4]
loc sitemap sitemap_cat
2012-05-25 00:00:00+00:00 buzzfeed
2016-04-12 00:00:00+00:00 buzzfeed
2019-01-12 00:00:00+00:00 buzzfeed
2012-02-20 00:00:00+00:00 buzzfeed
2017-02-15 00:00:00+00:00 buzzfeed

Now that we have a column showing the categories, we can count how many URLs they have and get an overview of the relative volume of content under each. The following code simply counts the values in that column and formats the resulting DataFrame.

In [5]:
 .assign(percentage=lambda df: df['sitemap_cat'].div(df['sitemap_cat'].sum()))
 .style.format(dict(sitemap_cat='{:,}', percentage='{:.1%}')))
sitemap_cat percentage
buzzfeed 478,430 93.4%
shopping 13,774 2.7%
video 10,657 2.1%
tasty 5,337 1.0%
asis 4,122 0.8%

It's clear the "buzzfeed" is the major category, which is basically the main site, and the others are very small in comparison.

Before proceeding further, it's important to get a better understanding of the NaT values that we saw at the beginning. Let's see what category they fall under.

In [6]:
NaT    video
NaT    video
NaT    video
NaT    video
NaT    video
Name: sitemap_cat, dtype: object

The first five fall under "video", but is that true for all the missing values?
The following line takes a subset of the DataFrame buzzfeed (the subset where the index contains missing values), then takes the sitemap_cat column, and counts the number of unique values. Since we saw that some values are "video", if the number of unique values is one, then all categories of missing dates fall under "video".

In [7]:

We have now uncovered a limitation in our dataset, which we know affects 2.1% of the URLs. We will not be able to analyze date-related issues with the video URLs. Nor will we be able to get any information about the content of those URLs for that matter:

In [8]:
Name: loc, dtype: object

Let's check how many articles they publish per year, and whether or not there were higher/lower years.
The following code resamples the DataFrame by "A" (for annual), and counts the rows.

In [9]:
articles_per_year = buzzfeed.resample('A')['loc'].count()
2008-12-31 00:00:00+00:00 2646
2009-12-31 00:00:00+00:00 3514
2010-12-31 00:00:00+00:00 11994
2011-12-31 00:00:00+00:00 46974
2012-12-31 00:00:00+00:00 62006
2013-12-31 00:00:00+00:00 61941
2014-12-31 00:00:00+00:00 62563
2015-12-31 00:00:00+00:00 56018
2016-12-31 00:00:00+00:00 49835
2017-12-31 00:00:00+00:00 38084
2018-12-31 00:00:00+00:00 40318
2019-12-31 00:00:00+00:00 54470
2020-12-31 00:00:00+00:00 11300
In [10]:
from IPython.display import HTML
fig = go.Figure()
fig.add_bar(x=articles_per_year.index, y=articles_per_year.values)
fig.layout.title = 'BuzzFeed Articles per Year (excluding video)'
fig.layout.yaxis.title = 'Number of articles'
fig.layout.paper_bgcolor = '#E5ECF6'