# Plotting imports
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
When building the tracker maps that you see on popular site profiles on whotracks.me, sankey diagrams seemed like a good fit to map categories of tracking to companies that own the trackers. Each link would be a tracker, going from a category to a company.
Given we had decided to use plottly.offline to generate the interactive images, I wanted to use the sankey diagram supported in plotly. The fuction itself is pretty straightforward, as you can see in sankey_diagram()
, but figuring out how the structure of the input data took a bit. Hopefully the following example will make it easier for those reading this post, should they ever decided to try sankey diagrams.
The goal here is to show some very small dataset, structured in a way that the plotly diagram (and other plotting solutions e.g.: d3.js) understand. We will be mapping cities to the countries they are part of. The value of each link, will be the city population (in millions).
city_data = dict(
nodes = dict(
label=["Germany", "Berlin", "Munich", "Cologne", "France", "Paris", "Lyon", "Bordeaux"],
color=["beige", "black", "red", "yellow", "beige", "blue", "white", "red"]
),
links = dict(
source=[0, 0, 0, 4, 4, 4],
target=[1, 2, 3, 5, 6, 7],
value= [3.5, 1.5, 1, 2.2, 0.5, 0.2],
label=["capital", "city", "city", "capital", "city", "city"],
color=["black", "red", "yellow", "blue", "whitesmoke", "red"]
)
)
Note how there are two keys in the dictionary
, nodes
and links
, and each has some attributes. Let's go over them. Each node has a label (e.g. Germany
) and a corresponding color (in this case beige
). Note than labels
and colors
are stored in lists of equal length, and the pairing is done based on the index.
Links contain information about how to link nodes. Eeach has a source
, target
, value
, label
and color
. Source cointains the index in the list of the source node, whereas target the index in the list of the target node. Value determines how thick the link should be (in our case it will be the population of each link, hence each city), Label and color, as the name suggests, specify the label and color of the link. Links too, are paired based on index.
Now let's write a simple function to plot these data nicely. Most of the work has already been done, given we're feeding the data in a format that's easy to parse.
def sankey_diagram(sndata, title):
# First part of a plotly plot is the `trace`
data_trace = dict(
type='sankey',
node=dict(
pad=10,
thickness=30,
# label could easily be equal to sndatap['node]['label']. The following is just cosmetics
label=list(map(lambda x: x.replace("_", " ").capitalize(), sndata['nodes']['label'])),
color=sndata['nodes']['color']
),
link=sndata["links"],
# configuration options for the diagram
domain=dict(
x=[0, 1],
y=[0, 1]
),
hoverinfo="none",
orientation="h"
)
# Second part of a plotly plot is the `layout`
layout = dict(
title=title,
font=dict(
size=12
)
)
fig = dict(data=[data_trace], layout=layout)
return iplot(fig)
All that is left now, is feeding the city_data to the sankey_diagram function and we're done.
sankey_diagram(city_data, "A few European Cities")
Doing Sankey diagrams for cities may have been fun. I am not sure the result of doing the same for trackers on your favorite sites will be equally fun. In fact it may be terrifying. We'll be using public data from whotracks.me to map tracker categories to Companies present on a particular site. Each link will be a tracker the company owns. This gives imediate visual insights on who's watching you an why.
whotracksme
¶from whotracksme.data.loader import DataSource
from whotracksme.website.plotting.colors import tracker_category_colors, cliqz_colors
DataSource
is a class that provides access to trackers, websites and companies. The functionality of DataSource
is something we'll be constantly trying to improve and expand. Online tracking is messy enough to analyze, so the tooling should be not.
DATA = DataSource()
These entities are loaded into DataSource, but an API is provided for some common operations on each of them. For more details, have a look at whotracksme.data.loader
. As far as we're concerned, we can load them like this:
trackers = DATA.trackers
sites = DATA.sites
companies = DATA.companies
Most people know what reddit is. For you that don't, check it out - there are some great communities there. Now we'll look at the tracking landscape in reddit. To do that, we only need to know the reddit site_id
, which is reddit.com
. Each site has a site_id
, most often its url
.
reddit_id = "reddit.com"
reddit_data = DATA.sites.get_site(reddit_id)
# reddit_data is a dictionary. And a site object has the following keys:
reddit_data.keys()
# apps refers to trackers. Naming is hard, but it'll soon be changed to trackers.
dict_keys(['apps', 'category', 'history', 'name', 'overview', 'subdomains'])
Here we will be mapping the trackers on reddit to the category they belong to (on the left) and to the companies that own them (on the right). This means each link is a tracker, nodes on the left are categories, and nodes on the right are companies.
def sankey_data(site_id, data=DATA):
nodes = []
link_source = []
link_target = []
link_value = []
link_label = []
for (tracker, category, company) in data.sites.trackers_on_site(site_id, data.trackers, data.companies):
# index of this category in nodes
if category in nodes:
cat_idx = nodes.index(category)
else:
nodes.append(category)
cat_idx = len(nodes) - 1
# index of this company in nodes
if company in nodes:
com_idx = nodes.index(company)
else:
nodes.append(company)
com_idx = len(nodes) - 1
link_source.append(cat_idx)
link_target.append(com_idx)
link_label.append(tracker["name"])
link_value.append(100.0 * tracker["frequency"])
label_colors = [tracker_category_colors[l] if l in tracker_category_colors else cliqz_colors["purple"] for l in nodes]
return dict(
nodes = dict(
label=nodes,
color=label_colors
),
links = dict(
source=link_source,
target=link_target,
value=link_value,
label=link_label,
color=["#dedede"] * len(link_label)
)
)
input_data = sankey_data(reddit_id, data=DATA)
sankey_diagram(input_data, reddit_id)
Don't forget to check out the article on https://whotracks.me/blog/trackers_in_your_favorite_site.html