Risky Domains¶

As one of our first notebooks we're going to keep it fairly simple and just focus on TLDs. We'll revisit Risky Domains with another notebook where use all the parts of the domain and we'll cover more advanced modeling and machine leanrning techniques.

This notebook explores the modeling of risky domains where the usage of the term risky has the dual characteristics of being both 'not common' and 'associated with bad'.

Domain blacklists are great but they only go so far. In this notebook we explore and analyze domain blacklists. The approach will be to pin down indicators or patterns and then use those to flag domains. We're trying to differentiate the common vs. uncommon or more specificially the common vs. blacklist. Our intention is to cast a wider net than the blacklist. In general we trying to achieve the following benefits:

We don't have to exactly match the blacklist (which is probably already out of date).
We might identify common patterns that capture a family or larger set of malicious domains.

In this notebook we're going to use data from MalwareDomains, Malwarebytes and CyberCrime Tracker. We're going to analyze those domains with a statistical technique called G-Test. We'll use the statistical results to evaluate and score new domains streaming in from Zeek IDS.

Data Used

Malware Domain Blocklist: http://www.malwaredomains.com
Malwarebytes(hpHosts EMD): https://hosts-file.net/emd.txt
CyberCrime Tracker: http://cybercrime-tracker.net/
Cisco Umbrella: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
Alexa: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

Software

zat: https://github.com/SuperCowPowers/zat
Pandas: https://github.com/pandas-dev/pandas
TLDExtract: https://github.com/john-kurkowski/tldextract

Techniques

G-Test: https://en.wikipedia.org/wiki/G-test

Shout Outs:

Netresec (Alexa vs. Umbrella Blog): http://netres.ec/?b=1743FAE
Netresec (Threat Hunting Rinse-Repeat): http://netres.ec/?b=1582D1D

In [1]:

import os
import zat
from zat.utils import file_utils
print('zat: {:s}'.format(zat.__version__))
import pandas as pd
print('Pandas: {:s}'.format(pd.__version__))
import numpy as np
print('Numpy: {:s}'.format(np.__version__))
from sklearn.externals import joblib
import sklearn.ensemble
from sklearn.feature_extraction.text import CountVectorizer
print('Scikit Learn Version:', sklearn.__version__)

zat: 0.1.5
Pandas: 0.19.2
Numpy: 1.12.1
Scikit Learn Version: 0.18.1

In [2]:

# Grab all the datasets
notebook_path = %%pwd
data_path = os.path.join(notebook_path, 'data')
block_file = os.path.join(data_path, 'mal_dom_block.txt')
cyber_file = os.path.join(data_path, 'cybercrime.txt')
emd_file = os.path.join(data_path, 'emd.txt')
alexa_file = os.path.join(data_path, 'alexa_1m.csv')
umbrella_file = os.path.join(data_path, 'umbrella_1m.csv')
with open(block_file) as bfp:
    block_domains = [row.strip() for row in bfp.readlines()]
with open(cyber_file) as bfp:
    cyber_domains = [row.strip() for row in bfp.readlines()]
with open(emd_file) as bfp:
    emd_domains = [row.split('\t')[1].strip() for row in bfp.readlines() if '#' not in row]
with open(alexa_file) as afp:
    alexa_domains = [row.split(',')[1].strip() for row in afp.readlines()]
with open(umbrella_file) as afp:
    umbrella_domains = [row.split(',')[1].strip() for row in afp.readlines()]

## Always look at the data When you pull in data, always make sure to visually inspect it before going any further. In my experience about 75% of the time you aren't getting what you think on the first try.

In [3]:

# Look at the Cisco Umbrella domains
print(len(umbrella_domains))
umbrella_domains[:10]

Out[3]:

['google.com',
 'www.google.com',
 'facebook.com',
 'microsoft.com',
 'doubleclick.net',
 'g.doubleclick.net',
 'clients4.google.com',
 'googleads.g.doubleclick.net',
 'google-analytics.com',
 'apple.com']

In [4]:

# Look at the Alexa domains
print(len(alexa_domains))
alexa_domains[:10]

Out[4]:

['google.com',
 'youtube.com',
 'facebook.com',
 'baidu.com',
 'wikipedia.org',
 'yahoo.com',
 'reddit.com',
 'google.co.in',
 'qq.com',
 'twitter.com']

In [5]:

# Look at all the known bad domains
print('Malware Domain Blocklist: {:d}'.format(len(block_domains)))
print(block_domains[:5])
print('\nMalwarebytes(hpHosts EMD): {:d}'.format(len(cyber_domains)))
print(cyber_domains[:5])
print('\nCyberCrime Tracker: {:d}'.format(len(emd_domains)))
print(emd_domains[:10])

Malware Domain Blocklist: 18383
['amazon.co.uk.security-check.ga', 'autosegurancabrasil.com', 'christianmensfellowshipsoftball.org', 'dadossolicitado-antendimento.sad879.mobi', 'hitnrun.com.my']

Malwarebytes(hpHosts EMD): 10111
['fpbqrouphaiti.com/sales!11-04/admin.php', 'jensonsintrenational.com/class/fat/cp.php?m=login', 'cboy.sytes.net/mypage/admin.php', '46.183.223.114/igere/3/admin.php', 'frankweb.club/temple/admin.php']

CyberCrime Tracker: 156698
['-sso.anbtr.com', '0.gvt0.com', '0.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfre18704554415.error2212.in', '000-101.org', '00005ik.rcomhost.com', '0000663c.tslocosumo.us', '0000a-fast-proxy.de', '0000pv6.rxportalhosting.com', '000my001.eu', '000my002.eu']

## We see we need to do some cleanup/normalization Data cleanup or data normalization is always part of doing data analysis and we often spend quite a bit of time on it. Thanksfully in this case the **tldextract** Python module does the hard work for us. For the purposes of this notebook we'll be using the following terminology for the parts of the fully qualified domain name (**subdomain.domain.tld**). So for example: - www.google.com: **www**=subdomain, **google**=domain, **com**=tld

In [6]:

import tldextract
def clean_domains(domain_list):
    for domain in domain_list:
        ext = tldextract.extract(domain)
        if ext.suffix: # If we don't have suffix either IP address or 'local/home/lan/etc'
            yield ext.subdomain, ext.domain, ext.suffix

In [7]:

# Clean up and combine bad domains
bad_domains = [domain for domain in clean_domains(block_domains)]
bad_domains += [domain for domain in clean_domains(cyber_domains)]
bad_domains += [domain for domain in clean_domains(emd_domains)]
bad_domains[:5]

Out[7]:

[('amazon.co.uk', 'security-check', 'ga'),
 ('', 'autosegurancabrasil', 'com'),
 ('', 'christianmensfellowshipsoftball', 'org'),
 ('dadossolicitado-antendimento', 'sad879', 'mobi'),
 ('', 'hitnrun', 'com.my')]

In [8]:

# Remove ALL domain.tld duplicates
# Note: This will be a lot as these lists will often have many subdomain s
print('Before duplication removal: {:d}'.format(len(bad_domains)))
bad_domains = list(set(bad_domains))
print('After duplication removal: {:d}'.format(len(bad_domains)))

Before duplication removal: 183405
After duplication removal: 179029

## Alexa or Cisco Umbrella? The Alexa dataset does not contain subdomain information so although all these sites are extremely popular: - **www.google.com, accounts.google.com, apis.google.com, play.google.com, mtalk.google.com, mail.google.com**

All of these will simply get rolled up into 'google.com' in the Alexa set. Since we're interested in the subdomains (in a later notebook) we're going to use the Umbrella dataset.

NOTE: The benefit Alexa has is that it covers more domains, so one could argue that our statistics below are 'wrong' because Umbrella doesn't cover ALL one million domains. We recognize and understand this. The stats below are for common vs. known bad and Umbrella is certainly covering the common domains (even if it's not one million total).

Process the common domains (common not 'good')¶

Notice here that we're using the term common instead of good. It's well known that Alexa/Umbrella lists contain some malicious/hacked domains.

See Netresec blog http://netres.ec/?b=1743FAE for more info

In [9]:

# Clean up and append common domains
common_domains = [domain for domain in clean_domains(umbrella_domains)]
common_domains[:5]

Out[9]:

[('', 'google', 'com'),
 ('www', 'google', 'com'),
 ('', 'facebook', 'com'),
 ('', 'microsoft', 'com'),
 ('', 'doubleclick', 'net')]

In [10]:

# Remove any duplicates
print('Before duplication removal: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains))
print('After duplication removal: {:d}'.format(len(common_domains)))

Before duplication removal: 994946
After duplication removal: 994946

In [11]:

# Speaking of common instead of 'good' lets look at some
# of the domains that intersect the blacklists
bad_common = set(common_domains).intersection(bad_domains)
print('Number of overlaps: {:d}'.format(len(bad_common)))
list(bad_common)[:10]

Number of overlaps: 1468

Out[11]:

[('', 'ddth', 'com'),
 ('', 'cjb', 'net'),
 ('', 'ludashi', 'com'),
 ('xpi', 'searchtabnew', 'com'),
 ('dnspod-free', 'mydnspod', 'net'),
 ('', 'rol', 'ru'),
 ('dl', 'pconline', 'com.cn'),
 ('', 'webshieldonline', 'com'),
 ('rep', 'ytdownloader', 'com'),
 ('start', 'funmoods', 'com')]

Removing blacklisted domains from common list¶

So now we're going to remove any blacklisted domains from the common list (Cisco Umbrella list)

In [12]:

print('Original common: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains).difference(bad_domains))
print('Blacklisted domains removed: {:d}'.format(len(common_domains)))

Original common: 994946
Blacklisted domains removed: 993478

Create Pandas DataFrames¶

DataFrames are used in both R and Python. Pandas has an excellent implementation that really helps when doing any kind of processing, statistics or machine learning work.

In [13]:

# Create dataframes
df_bad = pd.DataFrame.from_records(bad_domains, columns=['subdomain', 'domain', 'tld'])
df_bad['label'] = 'bad'
df_common = pd.DataFrame.from_records(common_domains, columns=['subdomain', 'domain', 'tld'])
df_common['label'] = 'common'

In [14]:

print('Bad Domains: {:d}'.format(len(df_bad)))
df_bad.head()

Bad Domains: 179029

Out[14]:

	subdomain	domain	tld	label
0	www	qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg...	com	bad
1		advancecomputers	online	bad
2		perseepona	com	bad
3	www	downloadfriend	info	bad
4	2o9jkm6yfj	centade	com	bad

In [15]:

print('Common Domains: {:d}'.format(len(df_common)))
df_common.head()

Common Domains: 993478

Out[15]:

	subdomain	domain	tld	label
0		msgapp	com	common
1	online	citibank	co.in	common
2		emhapfokdlyxtfgucmjm	cx	common
3	asset	affectv	com	common
4		gamerswithjobs	com	common

In [16]:

# Now that the records have labels on them (bad/common) we can combine them into one DataFrame
df_all = df_common.append(df_bad, ignore_index=True)
df_all.head()

Out[16]:

	subdomain	domain	tld	label
0		msgapp	com	common
1	online	citibank	co.in	common
2		emhapfokdlyxtfgucmjm	cx	common
3	asset	affectv	com	common
4		gamerswithjobs	com	common

Just the TLDs (for now)¶

We're going to do some statistics on the TLDs to see how they're distributed between the bad and common domains. The zat python package provides a nice set of functionality for statistics on Pandas DataFrames (https://github.com/SuperCowPowers/zat).

In [17]:

# Run a bunch of statistics from the zat python package
import zat.dataframe_stats as df_stats

In [18]:

# Print out the contingency_table
print('\nContingency Table')
cont_table = df_stats.contingency_table(df_all, 'tld', 'label')
cont_table.head()

Contingency Table

Out[18]:

label	bad	common	All
tld
ab.ca	0.0	24.0	24.0
abbott	0.0	2.0	2.0
abruzzo.it	0.0	1.0	1.0
ac	3.0	440.0	443.0
ac.ae	0.0	12.0	12.0

In [19]:

# Print out the expected_counts
print('\nExpected Counts Table')
expect_counts = df_stats.expected_counts(df_all, 'tld', 'label')
expect_counts.head()

Expected Counts Table

Out[19]:

label	bad	common	All
tld
ab.ca	3.664538	20.335462	24.0
abbott	0.305378	1.694622	2.0
abruzzo.it	0.152689	0.847311	1.0
ac	67.641257	375.358743	443.0
ac.ae	1.832269	10.167731	12.0

In [20]:

# Print out the g_test scores
print('\nG-Test Scores')
g_scores = df_stats.g_test_scores(df_all, 'tld', 'label')
g_scores.head()

G-Test Scores

Out[20]:

label	bad	common
tld
ab.ca	-7	7
abbott	0	0
abruzzo.it	0	0
ac	-121	121
ac.ae	-3	3

Sort the GTest Scores¶

For a formal interpretation of these scores please see (https://en.wikipedia.org/wiki/G-test). Informally, the higher the score the more that item stands out from a probability perspective from what the expected counts would be given the null hypothesis that the TLDs should occur equally likely in both classes.

Example:

The tk TLD occured about ~6200 times across both datasets. So because we have about 5x more common domains than bad domains then if all else is equal we should see it about ~950 times in the bad set and ~5250 times in the common set. The actual observation is that we see it 6071 times in the bad set and only 94 times in the common set. So seeing a tk domain pass through your IDS would definitely be a good thing to put on your 'short list'.

See Expected Counts and Actual Counts Below

In [21]:

# Sort the GTest scores    
g_scores.sort_values('bad', ascending=False).head(15)

Out[21]:

label	bad	common
tld
info	35476	-35476
tk	21877	-21877
xyz	20444	-20444
online	6701	-6701
club	3252	-3252
ru	3124	-3124
website	1923	-1923
in	1069	-1069
ws	752	-752
top	693	-693
site	614	-614
work	576	-576
biz	556	-556
name	516	-516
tech	478	-478

In [22]:

# Lets look at some of the TLDs Expected Counts vs. Actual Counts
interesting_tlds = ['info', 'tk', 'xyz', 'online', 'club']
print('Expected Counts:') 
expect_counts[expect_counts.index.isin(interesting_tlds)]

Expected Counts:

Out[22]:

label	bad	common	All
tld
club	310.569562	1723.430438	2034.0
info	2732.523545	15163.476455	17896.0
online	392.258213	2176.741787	2569.0
tk	941.328099	5223.671901	6165.0
xyz	1216.015730	6747.984270	7964.0

In [23]:

print('Actual Counts:')
cont_table[cont_table.index.isin(interesting_tlds)]

Actual Counts:

Out[23]:

label	bad	common	All
tld
club	1459.0	575.0	2034.0
info	14053.0	3843.0	17896.0
online	2259.0	310.0	2569.0
tk	6071.0	94.0	6165.0
xyz	6958.0	1006.0	7964.0

Phase1 Complete¶

We can see from the sorted GTest Score table that domains like info, tk, xyz, club, ... occur much more often in the blacklists than they do in the common lists (Umbrella/Alexa). So even with this small insight we could set up a Zeek Script (or a zat Python script) to mark domains with those TLDs as risky.

In Phase 2 of this notebook we'll dive into the domains and subdomains. Using NGram extraction and our G-Test statistics on all the extracted features to do feature selection for a sparse data machine learning model. We'll leverage the fantastic set of models available in the Python scikit-learn module and we'll show how to use zat to deploy that model so that new domains coming from Zeek can be evaluated and scored in realtime.

Deployment with zat¶

Now that we know which TLDs are 'risky' we can take action with zat. See the example risky_domains.py that uses these results to flag the realtime DNS logs coming from Zeek and makes a Virus Total query on any flagged domains. If the Virus Total query returns positives then we report the observation.

Although this sounds simplistic it's actually quite effective. The number of VT queries we make is extremely small compared to the total volume of DNS queries and given the statistical results the probably of a 'hit' is reasonably high and of course we're casting a wider net then the original blacklist.

Try it Out¶

If you liked this notebook please visit the zat project for more notebooks and examples. You can run all the examples with a simple $pip install zat (and a running Zeek instance of course)

In [ ]: