IP anonymization and its impact on visitor localization in Google Analytics

Tools used

Data sources

  • The traffic data comes from Google Analytics and concerns blog.liip.ch. Two properties track pageviews on that 'tech' blog, on with IP anonymizing enabled, since september 2015.

Exploration 1: What's the impact of IP anonymizing on the user country dimension?

In [221]:
import pandas as pd
import pandas.io.ga as ga
import numpy as np

%matplotlib inline
In [220]:
import sys
print ("PYTHON ", sys.version)
print ("PANDAS ", pd.__version__)
PYTHON  3.4.3 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:27:51) 
[GCC 4.2.1 (Apple Inc. build 5577)]
PANDAS  0.17.0

Fetching data

Now that our tools are loaded, let us fetch the data from the two Google Analytics properties.

The traffic data without anonymization since September 1st, 2015.

In [4]:
sources = {
    'full' : {
        'property_id': "UA-424540-4",
        'profile_id': "5334921",
    },
    'anon' : {
        'property_id': "UA-424540-11",
        'profile_id': "107030134",
    },
}
In [5]:
source_data = {}

for key in sources:    
    source_data[key] = ga.read_ga(
        property_id = sources[key]['property_id'],
        profile_id = sources[key]['profile_id'],
        metrics     = "sessions",
        dimensions  = ['country','city'],
        start_date  = "2015-09-01",
        index_col   = ['country','city'],
    )

    print(source_data[key]['sessions'].sum())
49376
49161

Less than 1% difference in volumes. Since the tracking is not forcefully simultaneous, that was expected. Let's just have a look at one of them.

In [6]:
source_data['full'].head()
Out[6]:
sessions
country city
(not set) (not set) 126
Afghanistan (not set) 2
Albania (not set) 1
Tirana 12
Algeria (not set) 15

Let us now join those two dataframes based on their country/city index:

In [222]:
data = pd.concat(source_data, axis=1, join='outer')

# rename homonymous columns
data.columns=['full_ip_sessions', 'anon_ip_sessions']

data.head()
Out[222]:
full_ip_sessions anon_ip_sessions
country city
(not set) (not set) 130 126
Afghanistan (not set) 2 2
Albania (not set) 1 1
Tirana 12 12
Algeria (not set) 14 15

Let's list the countries where the biggest proportional losses & wins happen.

In [225]:
# group by level 0 of the index (i.e. countries) and sum columns for groups
country_data = data.groupby(level=0).sum()

# compute delta and its proportion
country_data['delta'] = country_data.anon_ip_sessions - country_data.full_ip_sessions
country_data['dprop'] = country_data.delta / country_data.full_ip_sessions

# sort by prop. delta, ascending
country_data.sort_values(by='dprop', inplace=True)

Countries with proportionally large losses:

In [229]:
country_data.query('full_ip_sessions > 200').head(10)
Out[229]:
full_ip_sessions anon_ip_sessions delta dprop
country
Austria 387 356 -31 -0.080103
China 341 326 -15 -0.043988
Japan 443 429 -14 -0.031603
United States 7705 7465 -240 -0.031149
Denmark 242 238 -4 -0.016529
Bulgaria 207 206 -1 -0.004831
Finland 275 274 -1 -0.003636
Mexico 332 332 0 0.000000
Taiwan 278 278 0 0.000000
Czech Republic 420 420 0 0.000000

Countries with proportionally large gains:

In [231]:
country_data.query('full_ip_sessions > 200').tail(10)
Out[231]:
full_ip_sessions anon_ip_sessions delta dprop
country
Philippines 266 272 6 0.022556
Lithuania 217 222 5 0.023041
India 4142 4243 101 0.024384
Indonesia 430 441 11 0.025581
Portugal 230 236 6 0.026087
Malaysia 233 242 9 0.038627
Ukraine 1100 1144 44 0.040000
Ireland 202 211 9 0.044554
Sweden 478 506 28 0.058577
Singapore 311 331 20 0.064309

What's the proportion of the fluctuation?

In [232]:
country_data.delta.map(abs).sum()/country_data.full_ip_sessions.sum()
Out[232]:
0.018429242692378105

The deltas are below 10%, either positive or negative. And there's globally less than 2% of country attribution mismatch.

One can then say that Country attribution is largely insensitive to IP anonymization.

Exploration 2 : what does it mean locally in Switzerland?

Let us dive one level deeper: at city level. We will focus on Switzerland since we have enough traffic from it.

In [233]:
country_data.query('country == "Switzerland"')
Out[233]:
full_ip_sessions anon_ip_sessions delta dprop
country
Switzerland 4141 4156 15 0.003622

Less than 1 percent loss at country level for Switzerland, rather stable. But what's happening at city level?

In [234]:
# create a clean subset
swiss_data = data.query('country == "Switzerland"').copy()
swiss_data.sum()
Out[234]:
full_ip_sessions    4141
anon_ip_sessions    4156
dtype: float64
In [235]:
swiss_data['delta'] = swiss_data.anon_ip_sessions - swiss_data.full_ip_sessions
swiss_data['dprop'] = swiss_data.delta / swiss_data.full_ip_sessions

swiss_data.sort_values(by='dprop', inplace=True)
In [237]:
swiss_data.query('full_ip_sessions > 50')
Out[237]:
full_ip_sessions anon_ip_sessions delta dprop
country city
Switzerland Porrentruy 138 1 -137 -0.992754
Ebikon 81 1 -80 -0.987654
Basel 162 111 -51 -0.314815
Lugano 54 42 -12 -0.222222
Lucerne 71 58 -13 -0.183099
Lausanne 341 345 4 0.011730
Zurich 1476 1504 28 0.018970
Bern 206 232 26 0.126214
Saint Gallen 86 104 18 0.209302
Winterthur 77 97 20 0.259740
Geneva 118 189 71 0.601695
Fribourg 105 300 195 1.857143
In [238]:
# absolute sum of delta 
swiss_data.delta.map(abs).sum()/swiss_data.full_ip_sessions.sum()
Out[238]:
0.2711905336875151

Quite some turmoil at city level! For example, Fribourg gains 185% of attributions while Basel loses 30%, Something wild going on in Porrentruy and Ebikon, ...

Overall, we see more than 25% mismatch in city attribution for Switzerland.