IP anonymization and its impact on visitor localization in Google Analytics¶

Tools used¶

pandas and its Google Analytics connector to fetch and wrangle the data,
bokeh to visualize it.

Data sources¶

The traffic data comes from Google Analytics and concerns blog.liip.ch. Two properties track pageviews on that 'tech' blog, on with IP anonymizing enabled, since september 2015.

Exploration 1: What's the impact of IP anonymizing on the user country dimension?¶

In [221]:

import pandas as pd
import pandas.io.ga as ga
import numpy as np

%matplotlib inline

In [220]:

import sys
print ("PYTHON ", sys.version)
print ("PANDAS ", pd.__version__)

PYTHON  3.4.3 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:27:51) 
[GCC 4.2.1 (Apple Inc. build 5577)]
PANDAS  0.17.0

Fetching data¶

Now that our tools are loaded, let us fetch the data from the two Google Analytics properties.

The traffic data without anonymization since September 1st, 2015.

In [4]:

sources = {
    'full' : {
        'property_id': "UA-424540-4",
        'profile_id': "5334921",
    },
    'anon' : {
        'property_id': "UA-424540-11",
        'profile_id': "107030134",
    },
}

In [5]:

source_data = {}

for key in sources:    
    source_data[key] = ga.read_ga(
        property_id = sources[key]['property_id'],
        profile_id = sources[key]['profile_id'],
        metrics     = "sessions",
        dimensions  = ['country','city'],
        start_date  = "2015-09-01",
        index_col   = ['country','city'],
    )

    print(source_data[key]['sessions'].sum())

49376
49161

Less than 1% difference in volumes. Since the tracking is not forcefully simultaneous, that was expected. Let's just have a look at one of them.

In [6]:

source_data['full'].head()

Out[6]:

		sessions
country	city
(not set)	(not set)	126
Afghanistan	(not set)	2
Albania	(not set)	1
Albania	Tirana	12
Algeria	(not set)	15

Let us now join those two dataframes based on their country/city index:

In [222]:

data = pd.concat(source_data, axis=1, join='outer')

# rename homonymous columns
data.columns=['full_ip_sessions', 'anon_ip_sessions']

data.head()

Out[222]:

		full_ip_sessions	anon_ip_sessions
country	city
(not set)	(not set)	130	126
Afghanistan	(not set)	2	2
Albania	(not set)	1	1
Albania	Tirana	12	12
Algeria	(not set)	14	15

Let's list the countries where the biggest proportional losses & wins happen.

In [225]:

# group by level 0 of the index (i.e. countries) and sum columns for groups
country_data = data.groupby(level=0).sum()

# compute delta and its proportion
country_data['delta'] = country_data.anon_ip_sessions - country_data.full_ip_sessions
country_data['dprop'] = country_data.delta / country_data.full_ip_sessions

# sort by prop. delta, ascending
country_data.sort_values(by='dprop', inplace=True)

Countries with proportionally large losses:

In [229]:

country_data.query('full_ip_sessions > 200').head(10)

Out[229]:

	full_ip_sessions	anon_ip_sessions	delta	dprop
country
Austria	387	356	-31	-0.080103
China	341	326	-15	-0.043988
Japan	443	429	-14	-0.031603
United States	7705	7465	-240	-0.031149
Denmark	242	238	-4	-0.016529
Bulgaria	207	206	-1	-0.004831
Finland	275	274	-1	-0.003636
Mexico	332	332	0	0.000000
Taiwan	278	278	0	0.000000
Czech Republic	420	420	0	0.000000

Countries with proportionally large gains:

In [231]:

country_data.query('full_ip_sessions > 200').tail(10)

Out[231]:

	full_ip_sessions	anon_ip_sessions	delta	dprop
country
Philippines	266	272	6	0.022556
Lithuania	217	222	5	0.023041
India	4142	4243	101	0.024384
Indonesia	430	441	11	0.025581
Portugal	230	236	6	0.026087
Malaysia	233	242	9	0.038627
Ukraine	1100	1144	44	0.040000
Ireland	202	211	9	0.044554
Sweden	478	506	28	0.058577
Singapore	311	331	20	0.064309

What's the proportion of the fluctuation?

In [232]:

country_data.delta.map(abs).sum()/country_data.full_ip_sessions.sum()

Out[232]:

0.018429242692378105

The deltas are below 10%, either positive or negative. And there's globally less than 2% of country attribution mismatch.

One can then say that Country attribution is largely insensitive to IP anonymization.

Exploration 2 : what does it mean locally in Switzerland?¶

Let us dive one level deeper: at city level. We will focus on Switzerland since we have enough traffic from it.

In [233]:

country_data.query('country == "Switzerland"')

Out[233]:

	full_ip_sessions	anon_ip_sessions	delta	dprop
country
Switzerland	4141	4156	15	0.003622

Less than 1 percent loss at country level for Switzerland, rather stable. But what's happening at city level?

In [234]:

# create a clean subset
swiss_data = data.query('country == "Switzerland"').copy()
swiss_data.sum()

Out[234]:

full_ip_sessions    4141
anon_ip_sessions    4156
dtype: float64

In [235]:

swiss_data['delta'] = swiss_data.anon_ip_sessions - swiss_data.full_ip_sessions
swiss_data['dprop'] = swiss_data.delta / swiss_data.full_ip_sessions

swiss_data.sort_values(by='dprop', inplace=True)

In [237]:

swiss_data.query('full_ip_sessions > 50')

Out[237]:

		full_ip_sessions	anon_ip_sessions	delta	dprop
country	city
Switzerland	Porrentruy	138	1	-137	-0.992754
	Ebikon	81	1	-80	-0.987654
	Basel	162	111	-51	-0.314815
	Lugano	54	42	-12	-0.222222
	Lucerne	71	58	-13	-0.183099
	Lausanne	341	345	4	0.011730
	Zurich	1476	1504	28	0.018970
	Bern	206	232	26	0.126214
	Saint Gallen	86	104	18	0.209302
	Winterthur	77	97	20	0.259740
	Geneva	118	189	71	0.601695
	Fribourg	105	300	195	1.857143

In [238]:

# absolute sum of delta 
swiss_data.delta.map(abs).sum()/swiss_data.full_ip_sessions.sum()

Out[238]:

0.2711905336875151

Quite some turmoil at city level! For example, Fribourg gains 185% of attributions while Basel loses 30%, Something wild going on in Porrentruy and Ebikon, ...

Overall, we see more than 25% mismatch in city attribution for Switzerland.