Analyzing Search Engine Results Pages on a Large Scale

If you do SEO, a big part of your job is tracking your rankings on SERP's, as well as your competitors'.
I'm going to be sharing a way to get SERP data and have it in a DataFrame (table / csv / excel sheet) for analysis, on a large scale, and in an automated way.

I will be using the programming language Python, so there will be some coding involved. If you don't know any programming, you can simply ignore the code snippets below, as you don't need to understand them to follow along.

So how exactly are we going to get the data, and what are we going to do with it?

Importing the data

Google's Custom Search Engine is a service that allows you to create your own customized search engine, where you can specify certain sites to crawl, and set your own relevancy rules. You can also programmatically pull the data throug their API, which is what we are going to do. If you don't specify any specific rules then your custom search engine will basically be searching the whole web. In addition, you have the ability to specify many parameters for your search queries; the location of the user, the language of the site, image search, and many more.

Here are the steps to setup your account to import data (skip if you don't want to run the code yourself).

  1. Create a custom search engine. At first you might be asked to enter a site to search. Enter any domain, then go to the control panel and remove it. Make sure you enable "Search the entire web" and image search. You will also need to get your search engine ID, which you can find on the control panel page.
  2. Enable the custom search API. You will need to create a project for this first.
  3. Create credentials for this project so you can get your key.
  4. Enable billing for your project If you want to run more than 100 queries per day. The first 100 queries are free, then for each 1,000 queries you pay $5.

Handling the data

We will be using three Python packages for our work:

  • advertools: To connect to the Google CSE API and receive SERPs in a table format.
  • pandas: For data manipulation, reshaping, merging, sorting, etc.
  • matplotlib: For data visualization.

To give you a quick idea, here is a sample from the SERP response that we will be working with:

In [1]:
import pandas as pd
serp_flights = pd.read_csv('serp_flights.csv')
serp_flights.head()
Out[1]:
searchTerms rank title snippet displayLink link queryTime totalResults cacheId formattedUrl ... count startIndex inputEncoding outputEncoding safe cx gl searchTime formattedSearchTime formattedTotalResults
0 flights to hong kong 1 Cheap Flights to Hong Kong (HKG) from $401 - K... Find flights to Hong Kong on XiamenAir, Hong K... www.kayak.com https://www.kayak.com/flight-routes/United-Sta... 2018-11-21 03:03:21.596540+00:00 12700000 W42baDpas_gJ https://www.kayak.com/flight-routes/United.../... ... 10 1 utf8 utf8 off 012859022920491477448:pubdbfjmmec us 0.428278 0.43 12,700,000
1 flights to hong kong 2 $458 Flights to Hong Kong, China (HKG) - TripA... Cheap Flights to Hong Kong: Enter your dates o... www.tripadvisor.com https://www.tripadvisor.com/Flights-g294217-Ho... 2018-11-21 03:03:21.596540+00:00 12700000 5jsjhmxgxvkJ https://www.tripadvisor.com/Flights-g294217-Ho... ... 10 1 utf8 utf8 off 012859022920491477448:pubdbfjmmec us 0.428278 0.43 12,700,000
2 flights to hong kong 3 Cheap Flights to Hong Kong SAR, Asia $307.69 i... Expedia Add-On Advantage: Book a flight & unlo... www.expedia.com https://www.expedia.com/Cheap-Flights-To-Hong-... 2018-11-21 03:03:21.596540+00:00 12700000 iEmRbX0P8CAJ https://www.expedia.com/Cheap-Flights-To-Hong-... ... 10 1 utf8 utf8 off 012859022920491477448:pubdbfjmmec us 0.428278 0.43 12,700,000
3 flights to hong kong 4 Cheap Flights to Hong Kong International from ... The Magnificent City of Hong Kong. Check for c... www.skyscanner.com https://www.skyscanner.com/flights-to/hkg/chea... 2018-11-21 03:03:21.596540+00:00 12700000 EI78TI4Q1qYJ https://www.skyscanner.com/flights-to/hkg/chea... ... 10 1 utf8 utf8 off 012859022920491477448:pubdbfjmmec us 0.428278 0.43 12,700,000
4 flights to hong kong 5 Flights to Hong Kong (HKG) | Hong Kong Flight ... Find & book great deals on Hong Kong (HKG) Fli... flights.cathaypacific.com https://flights.cathaypacific.com/en-us/flight... 2018-11-21 03:03:21.596540+00:00 12700000 q6kszaD7mggJ https://flights.cathaypacific.com/en-us/flight... ... 10 1 utf8 utf8 off 012859022920491477448:pubdbfjmmec us 0.428278 0.43 12,700,000

5 rows × 26 columns

A few notes on the different columns available:

"queryTime" is the time that the query was run (when I made the request). This is different from "searchTime" which is the amount of time it took Google to run the query (usually less than one second). Most of the main columns will always be there, but if you pass different parameters you will have more or less columns. For example, you would have columns describing the images, in case you specify the type of search to be "image".

The dataset

We are going to take a look at the airlines tickets industry, and here are the details:

  • Destinations: I got the top 100 destinations from Wikipedia and used them as the basis for the queries.
  • Keywords: each destination was prepended with two variations, so we will be looking at "trips to destination" and "tickets to destination"
  • Countries: Each variation of those was requested for one of two English-speaking countries; The United States, and The United Kingdom
  • SERPs: Naturally, each result contains ten links, together with their metadata.

As a result we have 100 destinations x 2 variations x 2 countries x 10 results = 4,000 rows of data.

We begin by importing the packages that we will use, and defining our Google CSE ID and key:

In [2]:
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import advertools as adv
import pandas as pd
pd.set_option('display.max_columns', None)
cx = 'YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID'
key = 'YOUR_GOOGLE_DEVELOPER_KEY'

Now we can import the Wikipedia table, and here are the top destinations with some additional data:

In [3]:
# top_dest = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors', 
# header=0)[0]
# top_dest.to_csv('top_destinations.csv', index=False)
top_dest = pd.read_csv('top_destinations.csv')
top_dest.head().style.format({'Arrivals 2016Euromonitor': '{:,}'})
Out[3]:
RankEuromonitor RankMastercard City Country Arrivals 2016Euromonitor Arrivals 2016Mastercard Growthin arrivalsEuromonitor Income(billions $)Mastercard
0 1 11 Hong Kong Hong Kong 25,695,800.0 8.37e+06 −3.1 % 6.84
1 2 1 Bangkok Thailand 23,270,600.0 2.147e+07 9.5 % 14.84
2 3 2 London United Kingdom 19,842,800.0 1.988e+07 3.4 % 19.76
3 4 6 Singapore Singapore 17,681,800.0 1.211e+07 6.1 % 12.54
4 5 nan Macau Macau 16,299,100.0 nan 5.9 % nan

Create the keywords by concatenating the two variations mentioned above:

In [4]:
cities = top_dest['City'].tolist()
queries = ['flights to ' + c.lower() for c in cities] + ['tickets to ' + c.lower() for c in cities]
queries[:3] + queries[-3:] + ['etc...']
Out[4]:
['flights to hong kong',
 'flights to bangkok',
 'flights to london',
 'tickets to qingdao',
 'tickets to philadelphia',
 'tickets to lagos',
 'etc...']

With the main parameters defined, we can now send the requests to Google as follows:

In [5]:
# serp_flights = adv.serp_goog(cx=cx, key=key, q=queries, gl=['us', 'uk'])  # imports data
In [6]:
serp_flights = pd.read_csv('serp_flights.csv',parse_dates=['queryTime'])
serp_us = serp_flights[serp_flights['gl'] == 'us'].copy() # create a subset for US
serp_uk = serp_flights[serp_flights['gl'] == 'uk'].copy() # create a subset for UK

Let's now take a quick look at the top domains.

In [7]:
print('Domain Summary - Overall')
(serp_flights
 .pivot_table('rank', 'displayLink', 
              aggfunc=['count', 'mean'])
 .sort_values([('count', 'rank'), ('mean', 'rank')], 
              ascending=[False, True])
 .assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10)
 .head(10).style.format({("coverage", ''): "{:.1%}", 
                         ('mean', 'rank'): '{:.2f}'}))
Domain Summary - Overall
Out[7]:
count mean coverage
rank rank
displayLink
www.skyscanner.net 364 3.32 91.0%
www.tripadvisor.com 273 4.01 68.3%
www.expedia.com 243 3.06 60.7%
www.kayak.com 196 2.57 49.0%
www.cheapflights.com 180 5.68 45.0%
www.kayak.co.uk 175 4.91 43.8%
www.cheapflights.co.uk 174 4.51 43.5%
www.skyscanner.com 167 4.87 41.8%
www.expedia.co.uk 165 5.25 41.2%
www.tripadvisor.co.uk 135 6.07 33.8%

As you see, since we are mainly interested in the ranking of domains we have it summarized by three main metrics:

  1. Count: the number of times that the domain appeared in the searches that we made
  2. Mean: the mean (average) rank of each of the domains
  3. Coverage: this is simply the count divided by the number of queries.

The above pivot table is for all the results, and to get a quick overview. I think it's more meaningful to split the data into two different pivot tables, one for each of the countries:

In [8]:
print('Domain Summary - US')
(serp_flights[serp_flights['gl']=='us']
 .pivot_table('rank', 'displayLink', 
              aggfunc=['count', 'mean'])
 .sort_values([('count', 'rank'), ('mean', 'rank')], 
              ascending=[False, True])
 .assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10 * 2)
 .head(10).style.format({("coverage", ''): "{:.1%}", 
                         ('mean', 'rank'): '{:.2f}'}))
Domain Summary - US
Out[8]:
count mean coverage
rank rank
displayLink
www.expedia.com 195 2.28 97.5%
www.tripadvisor.com 195 3.14 97.5%
www.kayak.com 176 2.07 88.0%
www.skyscanner.com 158 4.73 79.0%
www.skyscanner.net 148 6.05 74.0%
www.cheapflights.com 147 5.68 73.5%
www.cheapoair.com 74 6.84 37.0%
www.orbitz.com 60 6.70 30.0%
www.makemytrip.com 45 5.80 22.5%
www.lufthansa.com 37 6.92 18.5%

For coverage, I divided by 400 in the first table, but for the countries I'm dividing by 200, because we are interested in queries for that country. An interesting point here, is that kayak.com has lower coverage than tripadvisor.com, but it has a higher mean rank. In top positions, the difference between position two and three is quite high in terms of value. Depending on your case, you might value one or the other (metric).

In [9]:
print('Domain Summary - UK')
(serp_flights[serp_flights['gl']=='uk']
 .pivot_table('rank', 'displayLink', 
              aggfunc=['count', 'mean'])
 .sort_values([('count', 'rank'), ('mean', 'rank')], 
              ascending=[False, True])
 .assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10*2)
 .head(10).style.format({("coverage", ''): "{:.1%}", 
                         ('mean', 'rank'): '{:.2f}'}))
Domain Summary - UK
Out[9]:
count mean coverage
rank rank
displayLink
www.skyscanner.net 216 1.45 108.0%
www.cheapflights.co.uk 161 4.25 80.5%
www.kayak.co.uk 159 4.75 79.5%
www.expedia.co.uk 150 5.14 75.0%
www.tripadvisor.co.uk 128 6.10 64.0%
www.lastminute.com 112 5.97 56.0%
www.opodo.co.uk 109 5.96 54.5%
www.travelsupermarket.com 83 6.16 41.5%
www.tripadvisor.com 78 6.18 39.0%
www.britishairways.com 69 6.22 34.5%

Having a coverage of 108% means that skyskanner.net has appeared on all searches, and in some cases they appeared more than once in the same SERP.
Note that their mean rank is 1.45, much higher than the second domain.
No joking with SkySkanner!

Now that we have an idea about the number of times they appeared and the average ranks they have, it might also be good to visualize the data, so we can see how it is distributed.
We first get the top 10 domains for each country, and define two new DataFrames (tables) containing only the filtered data and then visualize:

In [10]:
top10_domains = serp_flights.displayLink.value_counts()[:10].index
top10_df = serp_flights[serp_flights['displayLink'].isin(top10_domains)]

top10_domains_us = serp_us.displayLink.value_counts()[:10].index
top10_df_us = serp_flights[serp_flights['displayLink'].isin(top10_domains_us)]

top10_domains_uk = serp_uk.displayLink.value_counts()[:10].index
top10_df_uk = serp_flights[serp_flights['displayLink'].isin(top10_domains_uk)]
In [11]:
fig, ax = plt.subplots(facecolor='#ebebeb')
fig.set_size_inches(15, 9)
ax.set_frame_on(False)
ax.scatter(top10_df['displayLink'].str.replace('www.', ''), 
           top10_df['rank'], s=850, alpha=0.02, edgecolor='k', lw=2)
ax.grid(alpha=0.25)
ax.invert_yaxis()
ax.yaxis.set_ticks(range(1, 11))
ax.tick_params(labelsize=15, rotation=9, labeltop=True,
               labelbottom=False)
ax.set_ylabel('Search engine results page rank', fontsize=16)
ax.set_title('Top 10 Tickets and Flights Domains', pad=75, fontsize=24)
ax.text(4.5, -0.5, 'Organic Search Rankings for 200 Keywords in US & UK', 
        ha='center', fontsize=15)
fig.savefig(ax.get_title() + '.png', 
            facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()

For each appearance on an SERP we plot a very light circle in the position where that domain appeared (from one to ten). The more frequently a domain appears, the darker the circle.
For example, kayak.com, expedia.com, and skyskanner.net have solid blue circles on position one, as well as lighter ones on different positions.

A minor issue in this analysis so far is that it treats all keywords equally. The number of tourists in the top one hundred list varies between two and twentysix million, so they are clearly not equal. Also, for your specific case, you might have your own set of "top 100" based on the website you are working on. But since we are exploring the industry and trying to understand the positions of the different players, I don't think it's a bad assumption. Just keep this in mind when doing a similar analysis for a specific case.

As above, this was for the overall data, and below is the same visualization split by country:

In [12]:
top10_dfs = [top10_df_us, top10_df_uk]
colors = ['darkred', 'olive']
suffixes = [' - US', ' - UK']
fig, ax = plt.subplots(2, 1, facecolor='#ebebeb')
fig.set_size_inches(15, 18)
for i in range(2):
    ax[i].set_frame_on(False)
    ax[i].scatter(top10_dfs[i]['displayLink'].str.replace('www.', ''), 
                  top10_dfs[i]['rank'], s=850, alpha=0.02, 
                  edgecolor='k', lw=2, color='darkred')
    ax[i].grid(alpha=0.25)
    ax[i].invert_yaxis()
    ax[i].yaxis.set_ticks(range(1, 11))
    ax[i].tick_params(labelsize=15, rotation=12, labeltop=True,
                   labelbottom=False)
    ax[i].set_ylabel('Search engine results page rank', fontsize=16)
    ax[i].set_title('Top 10 Tickets and Flights Domains' + suffixes[i], 
                    pad=75, fontsize=24)
    ax[i].text(4.5, -0.5, 'Organic Search Rankings for 200 Keywords', 
            ha='center', fontsize=15)
plt.tight_layout()
fig.savefig(ax[i].get_title() + '.png', 
            facecolor='#eeeeee', dpi=150, bbox_inches='tight')
    

plt.show()