Risky Domains

As one of our first notebooks we're going to keep it fairly simple and just focus on TLDs. We'll revisit Risky Domains with another notebook where use all the parts of the domain and we'll cover more advanced modeling and machine leanrning techniques.

This notebook explores the modeling of risky domains where the usage of the term risky has the dual characteristics of being both 'not common' and 'associated with bad'.

Domain blacklists are great but they only go so far. In this notebook we explore and analyze domain blacklists. The approach will be to pin down indicators or patterns and then use those to flag domains. We're trying to differentiate the common vs. uncommon or more specificially the common vs. blacklist. Our intention is to cast a wider net than the blacklist. In general we trying to achieve the following benefits:

  • We don't have to exactly match the blacklist (which is probably already out of date).
  • We might identify common patterns that capture a family or larger set of malicious domains.

In this notebook we're going to use data from MalwareDomains, Malwarebytes and CyberCrime Tracker. We're going to analyze those domains with a statistical technique called G-Test. We'll use the statistical results to evaluate and score new domains streaming in from BroIDS.

Data Used



Shout Outs:

In [1]:
import os
import bat
from bat.utils import file_utils
print('bat: {:s}'.format(bat.__version__))
import pandas as pd
print('Pandas: {:s}'.format(pd.__version__))
import numpy as np
print('Numpy: {:s}'.format(np.__version__))
from sklearn.externals import joblib
import sklearn.ensemble
from sklearn.feature_extraction.text import CountVectorizer
print('Scikit Learn Version:', sklearn.__version__)
bat: 0.1.5
Pandas: 0.19.2
Numpy: 1.12.1
Scikit Learn Version: 0.18.1
In [2]:
# Grab all the datasets
notebook_path = %%pwd
data_path = os.path.join(notebook_path, 'data')
block_file = os.path.join(data_path, 'mal_dom_block.txt')
cyber_file = os.path.join(data_path, 'cybercrime.txt')
emd_file = os.path.join(data_path, 'emd.txt')
alexa_file = os.path.join(data_path, 'alexa_1m.csv')
umbrella_file = os.path.join(data_path, 'umbrella_1m.csv')
with open(block_file) as bfp:
    block_domains = [row.strip() for row in bfp.readlines()]
with open(cyber_file) as bfp:
    cyber_domains = [row.strip() for row in bfp.readlines()]
with open(emd_file) as bfp:
    emd_domains = [row.split('\t')[1].strip() for row in bfp.readlines() if '#' not in row]
with open(alexa_file) as afp:
    alexa_domains = [row.split(',')[1].strip() for row in afp.readlines()]
with open(umbrella_file) as afp:
    umbrella_domains = [row.split(',')[1].strip() for row in afp.readlines()]

Always look at the data

When you pull in data, always make sure to visually inspect it before going any further. In my experience about 75% of the time you aren't getting what you think on the first try.

In [3]:
# Look at the Cisco Umbrella domains
In [4]:
# Look at the Alexa domains
In [5]:
# Look at all the known bad domains
print('Malware Domain Blocklist: {:d}'.format(len(block_domains)))
print('\nMalwarebytes(hpHosts EMD): {:d}'.format(len(cyber_domains)))
print('\nCyberCrime Tracker: {:d}'.format(len(emd_domains)))
Malware Domain Blocklist: 18383
['amazon.co.uk.security-check.ga', 'autosegurancabrasil.com', 'christianmensfellowshipsoftball.org', 'dadossolicitado-antendimento.sad879.mobi', 'hitnrun.com.my']

Malwarebytes(hpHosts EMD): 10111
['fpbqrouphaiti.com/sales!11-04/admin.php', 'jensonsintrenational.com/class/fat/cp.php?m=login', 'cboy.sytes.net/mypage/admin.php', '', 'frankweb.club/temple/admin.php']

CyberCrime Tracker: 156698
['-sso.anbtr.com', '0.gvt0.com', '0.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfre18704554415.error2212.in', '000-101.org', '00005ik.rcomhost.com', '0000663c.tslocosumo.us', '0000a-fast-proxy.de', '0000pv6.rxportalhosting.com', '000my001.eu', '000my002.eu']

We see we need to do some cleanup/normalization

Data cleanup or data normalization is always part of doing data analysis and we often spend quite a bit of time on it. Thanksfully in this case the tldextract Python module does the hard work for us. For the purposes of this notebook we'll be using the following terminology for the parts of the fully qualified domain name (subdomain.domain.tld). So for example:

  • www.google.com: www=subdomain, google=domain, com=tld
In [6]:
import tldextract
def clean_domains(domain_list):
    for domain in domain_list:
        ext = tldextract.extract(domain)
        if ext.suffix: # If we don't have suffix either IP address or 'local/home/lan/etc'
            yield ext.subdomain, ext.domain, ext.suffix
In [7]:
# Clean up and combine bad domains
bad_domains = [domain for domain in clean_domains(block_domains)]
bad_domains += [domain for domain in clean_domains(cyber_domains)]
bad_domains += [domain for domain in clean_domains(emd_domains)]
[('amazon.co.uk', 'security-check', 'ga'),
 ('', 'autosegurancabrasil', 'com'),
 ('', 'christianmensfellowshipsoftball', 'org'),
 ('dadossolicitado-antendimento', 'sad879', 'mobi'),
 ('', 'hitnrun', 'com.my')]
In [8]:
# Remove ALL domain.tld duplicates
# Note: This will be a lot as these lists will often have many subdomain s
print('Before duplication removal: {:d}'.format(len(bad_domains)))
bad_domains = list(set(bad_domains))
print('After duplication removal: {:d}'.format(len(bad_domains)))
Before duplication removal: 183405
After duplication removal: 179029

Alexa or Cisco Umbrella?

The Alexa dataset does not contain subdomain information so although all these sites are extremely popular:

  • www.google.com, accounts.google.com, apis.google.com, play.google.com, mtalk.google.com, mail.google.com

All of these will simply get rolled up into 'google.com' in the Alexa set. Since we're interested in the subdomains (in a later notebook) we're going to use the Umbrella dataset.

NOTE: The benefit Alexa has is that it covers more domains, so one could argue that our statistics below are 'wrong' because Umbrella doesn't cover ALL one million domains. We recognize and understand this. The stats below are for common vs. known bad and Umbrella is certainly covering the common domains (even if it's not one million total).

Process the common domains (common not 'good')

Notice here that we're using the term common instead of good. It's well known that Alexa/Umbrella lists contain some malicious/hacked domains.

In [9]:
# Clean up and append common domains
common_domains = [domain for domain in clean_domains(umbrella_domains)]
[('', 'google', 'com'),
 ('www', 'google', 'com'),
 ('', 'facebook', 'com'),
 ('', 'microsoft', 'com'),
 ('', 'doubleclick', 'net')]
In [10]:
# Remove any duplicates
print('Before duplication removal: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains))
print('After duplication removal: {:d}'.format(len(common_domains)))
Before duplication removal: 994946
After duplication removal: 994946
In [11]:
# Speaking of common instead of 'good' lets look at some
# of the domains that intersect the blacklists
bad_common = set(common_domains).intersection(bad_domains)
print('Number of overlaps: {:d}'.format(len(bad_common)))
Number of overlaps: 1468
[('', 'ddth', 'com'),
 ('', 'cjb', 'net'),
 ('', 'ludashi', 'com'),
 ('xpi', 'searchtabnew', 'com'),
 ('dnspod-free', 'mydnspod', 'net'),
 ('', 'rol', 'ru'),
 ('dl', 'pconline', 'com.cn'),
 ('', 'webshieldonline', 'com'),
 ('rep', 'ytdownloader', 'com'),
 ('start', 'funmoods', 'com')]

Removing blacklisted domains from common list

So now we're going to remove any blacklisted domains from the common list (Cisco Umbrella list)

In [12]:
print('Original common: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains).difference(bad_domains))
print('Blacklisted domains removed: {:d}'.format(len(common_domains)))
Original common: 994946
Blacklisted domains removed: 993478

Create Pandas DataFrames

  • DataFrames are used in both R and Python. Pandas has an excellent implementation that really helps when doing any kind of processing, statistics or machine learning work.
In [13]:
# Create dataframes
df_bad = pd.DataFrame.from_records(bad_domains, columns=['subdomain', 'domain', 'tld'])
df_bad['label'] = 'bad'
df_common = pd.DataFrame.from_records(common_domains, columns=['subdomain', 'domain', 'tld'])
df_common['label'] = 'common'
In [14]:
print('Bad Domains: {:d}'.format(len(df_bad)))
Bad Domains: 179029
subdomain domain tld label
0 www qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg... com bad
1 advancecomputers online bad
2 perseepona com bad
3 www downloadfriend info bad
4 2o9jkm6yfj centade com bad
In [15]:
print('Common Domains: {:d}'.format(len(df_common)))
Common Domains: 993478
subdomain domain tld label
0 msgapp com common
1 online citibank co.in common
2 emhapfokdlyxtfgucmjm cx common
3 asset affectv com common
4 gamerswithjobs com common
In [16]:
# Now that the records have labels on them (bad/common) we can combine them into one DataFrame
df_all = df_common.append(df_bad, ignore_index=True)
subdomain domain tld label
0 msgapp com common
1 online citibank co.in common
2 emhapfokdlyxtfgucmjm cx common
3 asset affectv com common
4 gamerswithjobs com common

Just the TLDs (for now)

We're going to do some statistics on the TLDs to see how they're distributed between the bad and common domains. The bat python package provides a nice set of functionality for statistics on Pandas DataFrames (https://github.com/Kitware/bat).

In [17]:
# Run a bunch of statistics from the bat python package
from bat.utils import df_stats
In [18]:
# Print out the contingency_table
print('\nContingency Table')
cont_table = df_stats.contingency_table(df_all, 'tld', 'label')
Contingency Table
label bad common All
ab.ca 0.0 24.0 24.0
abbott 0.0 2.0 2.0
abruzzo.it 0.0 1.0 1.0
ac 3.0 440.0 443.0
ac.ae 0.0 12.0 12.0
In [19]:
# Print out the expected_counts
print('\nExpected Counts Table')
expect_counts = df_stats.expected_counts(df_all, 'tld', 'label')
Expected Counts Table
label bad common All
ab.ca 3.664538 20.335462 24.0
abbott 0.305378 1.694622 2.0
abruzzo.it 0.152689 0.847311 1.0
ac 67.641257 375.358743 443.0
ac.ae 1.832269 10.167731 12.0
In [20]:
# Print out the g_test scores
print('\nG-Test Scores')
g_scores = df_stats.g_test_scores(df_all, 'tld', 'label')
G-Test Scores
label bad common
ab.ca -7 7
abbott 0 0
abruzzo.it 0 0
ac -121 121
ac.ae -3 3

Sort the GTest Scores

For a formal interpretation of these scores please see (https://en.wikipedia.org/wiki/G-test). Informally, the higher the score the more that item stands out from a probability perspective from what the expected counts would be given the null hypothesis that the TLDs should occur equally likely in both classes.


The tk TLD occured about ~6200 times across both datasets. So because we have about 5x more common domains than bad domains then if all else is equal we should see it about ~950 times in the bad set and ~5250 times in the common set. The actual observation is that we see it 6071 times in the bad set and only 94 times in the common set. So seeing a tk domain pass through your IDS would definitely be a good thing to put on your 'short list'.

See Expected Counts and Actual Counts Below

In [21]:
# Sort the GTest scores    
g_scores.sort_values('bad', ascending=False).head(15)
label bad common
info 35476 -35476
tk 21877 -21877
xyz 20444 -20444
online 6701 -6701
club 3252 -3252
ru 3124 -3124
website 1923 -1923
in 1069 -1069
ws 752 -752
top 693 -693
site 614 -614
work 576 -576
biz 556 -556
name 516 -516
tech 478 -478
In [22]:
# Lets look at some of the TLDs Expected Counts vs. Actual Counts
interesting_tlds = ['info', 'tk', 'xyz', 'online', 'club']
print('Expected Counts:') 
Expected Counts:
label bad common All
club 310.569562 1723.430438 2034.0
info 2732.523545 15163.476455 17896.0
online 392.258213 2176.741787 2569.0
tk 941.328099 5223.671901 6165.0
xyz 1216.015730 6747.984270 7964.0
In [23]:
print('Actual Counts:')
Actual Counts:
label bad common All
club 1459.0 575.0 2034.0
info 14053.0 3843.0 17896.0
online 2259.0 310.0 2569.0
tk 6071.0 94.0 6165.0
xyz 6958.0 1006.0 7964.0

Phase1 Complete

We can see from the sorted GTest Score table that domains like info, tk, xyz, club, ... occur much more often in the blacklists than they do in the common lists (Umbrella/Alexa). So even with this small insight we could set up a Bro Script (or a bat Python script) to mark domains with those TLDs as risky.

In Phase 2 of this notebook we'll dive into the domains and subdomains. Using NGram extraction and our G-Test statistics on all the extracted features to do feature selection for a sparse data machine learning model. We'll leverage the fantastic set of models available in the Python scikit-learn module and we'll show how to use bat to deploy that model so that new domains coming from Bro can be evaluated and scored in realtime.

Deployment with bat

Now that we know which TLDs are 'risky' we can take action with bat. See the example risky_domains.py that uses these results to flag the realtime DNS logs coming from Bro and makes a Virus Total query on any flagged domains. If the Virus Total query returns positives then we report the observation.

Although this sounds simplistic it's actually quite effective. The number of VT queries we make is extremely small compared to the total volume of DNS queries and given the statistical results the probably of a 'hit' is reasonably high and of course we're casting a wider net then the original blacklist.

Try it Out

If you liked this notebook please visit the bat project for more notebooks and examples. You can run all the examples with a simple $pip install bat (and a running Bro instance of course)