As one of our first notebooks we're going to keep it fairly simple and just focus on TLDs. We'll revisit Risky Domains with another notebook where use all the parts of the domain and we'll cover more advanced modeling and machine leanrning techniques.
This notebook explores the modeling of risky domains where the usage of the term risky has the dual characteristics of being both 'not common' and 'associated with bad'.
Domain blacklists are great but they only go so far. In this notebook we explore and analyze domain blacklists. The approach will be to pin down indicators or patterns and then use those to flag domains. We're trying to differentiate the common vs. uncommon or more specificially the common vs. blacklist. Our intention is to cast a wider net than the blacklist. In general we trying to achieve the following benefits:
In this notebook we're going to use data from MalwareDomains, Malwarebytes and CyberCrime Tracker. We're going to analyze those domains with a statistical technique called G-Test. We'll use the statistical results to evaluate and score new domains streaming in from Zeek IDS.
Data Used
Malware Domain Blocklist: http://www.malwaredomains.com
Malwarebytes(hpHosts EMD): https://hosts-file.net/emd.txt
CyberCrime Tracker: http://cybercrime-tracker.net/
Cisco Umbrella: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
Software
Techniques
Shout Outs:
import os
import zat
from zat.utils import file_utils
print('zat: {:s}'.format(zat.__version__))
import pandas as pd
print('Pandas: {:s}'.format(pd.__version__))
import numpy as np
print('Numpy: {:s}'.format(np.__version__))
from sklearn.externals import joblib
import sklearn.ensemble
from sklearn.feature_extraction.text import CountVectorizer
print('Scikit Learn Version:', sklearn.__version__)
zat: 0.1.5 Pandas: 0.19.2 Numpy: 1.12.1 Scikit Learn Version: 0.18.1
# Grab all the datasets
notebook_path = %%pwd
data_path = os.path.join(notebook_path, 'data')
block_file = os.path.join(data_path, 'mal_dom_block.txt')
cyber_file = os.path.join(data_path, 'cybercrime.txt')
emd_file = os.path.join(data_path, 'emd.txt')
alexa_file = os.path.join(data_path, 'alexa_1m.csv')
umbrella_file = os.path.join(data_path, 'umbrella_1m.csv')
with open(block_file) as bfp:
block_domains = [row.strip() for row in bfp.readlines()]
with open(cyber_file) as bfp:
cyber_domains = [row.strip() for row in bfp.readlines()]
with open(emd_file) as bfp:
emd_domains = [row.split('\t')[1].strip() for row in bfp.readlines() if '#' not in row]
with open(alexa_file) as afp:
alexa_domains = [row.split(',')[1].strip() for row in afp.readlines()]
with open(umbrella_file) as afp:
umbrella_domains = [row.split(',')[1].strip() for row in afp.readlines()]
# Look at the Cisco Umbrella domains
print(len(umbrella_domains))
umbrella_domains[:10]
1000000
['google.com', 'www.google.com', 'facebook.com', 'microsoft.com', 'doubleclick.net', 'g.doubleclick.net', 'clients4.google.com', 'googleads.g.doubleclick.net', 'google-analytics.com', 'apple.com']
# Look at the Alexa domains
print(len(alexa_domains))
alexa_domains[:10]
1000000
['google.com', 'youtube.com', 'facebook.com', 'baidu.com', 'wikipedia.org', 'yahoo.com', 'reddit.com', 'google.co.in', 'qq.com', 'twitter.com']
# Look at all the known bad domains
print('Malware Domain Blocklist: {:d}'.format(len(block_domains)))
print(block_domains[:5])
print('\nMalwarebytes(hpHosts EMD): {:d}'.format(len(cyber_domains)))
print(cyber_domains[:5])
print('\nCyberCrime Tracker: {:d}'.format(len(emd_domains)))
print(emd_domains[:10])
Malware Domain Blocklist: 18383 ['amazon.co.uk.security-check.ga', 'autosegurancabrasil.com', 'christianmensfellowshipsoftball.org', 'dadossolicitado-antendimento.sad879.mobi', 'hitnrun.com.my'] Malwarebytes(hpHosts EMD): 10111 ['fpbqrouphaiti.com/sales!11-04/admin.php', 'jensonsintrenational.com/class/fat/cp.php?m=login', 'cboy.sytes.net/mypage/admin.php', '46.183.223.114/igere/3/admin.php', 'frankweb.club/temple/admin.php'] CyberCrime Tracker: 156698 ['-sso.anbtr.com', '0.gvt0.com', '0.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfre18704554415.error2212.in', '000-101.org', '00005ik.rcomhost.com', '0000663c.tslocosumo.us', '0000a-fast-proxy.de', '0000pv6.rxportalhosting.com', '000my001.eu', '000my002.eu']
import tldextract
def clean_domains(domain_list):
for domain in domain_list:
ext = tldextract.extract(domain)
if ext.suffix: # If we don't have suffix either IP address or 'local/home/lan/etc'
yield ext.subdomain, ext.domain, ext.suffix
# Clean up and combine bad domains
bad_domains = [domain for domain in clean_domains(block_domains)]
bad_domains += [domain for domain in clean_domains(cyber_domains)]
bad_domains += [domain for domain in clean_domains(emd_domains)]
bad_domains[:5]
[('amazon.co.uk', 'security-check', 'ga'), ('', 'autosegurancabrasil', 'com'), ('', 'christianmensfellowshipsoftball', 'org'), ('dadossolicitado-antendimento', 'sad879', 'mobi'), ('', 'hitnrun', 'com.my')]
# Remove ALL domain.tld duplicates
# Note: This will be a lot as these lists will often have many subdomain s
print('Before duplication removal: {:d}'.format(len(bad_domains)))
bad_domains = list(set(bad_domains))
print('After duplication removal: {:d}'.format(len(bad_domains)))
Before duplication removal: 183405 After duplication removal: 179029
All of these will simply get rolled up into 'google.com' in the Alexa set. Since we're interested in the subdomains (in a later notebook) we're going to use the Umbrella dataset.
NOTE: The benefit Alexa has is that it covers more domains, so one could argue that our statistics below are 'wrong' because Umbrella doesn't cover ALL one million domains. We recognize and understand this. The stats below are for common vs. known bad and Umbrella is certainly covering the common domains (even if it's not one million total).
Notice here that we're using the term common instead of good. It's well known that Alexa/Umbrella lists contain some malicious/hacked domains.
# Clean up and append common domains
common_domains = [domain for domain in clean_domains(umbrella_domains)]
common_domains[:5]
[('', 'google', 'com'), ('www', 'google', 'com'), ('', 'facebook', 'com'), ('', 'microsoft', 'com'), ('', 'doubleclick', 'net')]
# Remove any duplicates
print('Before duplication removal: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains))
print('After duplication removal: {:d}'.format(len(common_domains)))
Before duplication removal: 994946 After duplication removal: 994946
# Speaking of common instead of 'good' lets look at some
# of the domains that intersect the blacklists
bad_common = set(common_domains).intersection(bad_domains)
print('Number of overlaps: {:d}'.format(len(bad_common)))
list(bad_common)[:10]
Number of overlaps: 1468
[('', 'ddth', 'com'), ('', 'cjb', 'net'), ('', 'ludashi', 'com'), ('xpi', 'searchtabnew', 'com'), ('dnspod-free', 'mydnspod', 'net'), ('', 'rol', 'ru'), ('dl', 'pconline', 'com.cn'), ('', 'webshieldonline', 'com'), ('rep', 'ytdownloader', 'com'), ('start', 'funmoods', 'com')]
So now we're going to remove any blacklisted domains from the common list (Cisco Umbrella list)
print('Original common: {:d}'.format(len(common_domains)))
common_domains = list(set(common_domains).difference(bad_domains))
print('Blacklisted domains removed: {:d}'.format(len(common_domains)))
Original common: 994946 Blacklisted domains removed: 993478
# Create dataframes
df_bad = pd.DataFrame.from_records(bad_domains, columns=['subdomain', 'domain', 'tld'])
df_bad['label'] = 'bad'
df_common = pd.DataFrame.from_records(common_domains, columns=['subdomain', 'domain', 'tld'])
df_common['label'] = 'common'
print('Bad Domains: {:d}'.format(len(df_bad)))
df_bad.head()
Bad Domains: 179029
subdomain | domain | tld | label | |
---|---|---|---|---|
0 | www | qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg... | com | bad |
1 | advancecomputers | online | bad | |
2 | perseepona | com | bad | |
3 | www | downloadfriend | info | bad |
4 | 2o9jkm6yfj | centade | com | bad |
print('Common Domains: {:d}'.format(len(df_common)))
df_common.head()
Common Domains: 993478
subdomain | domain | tld | label | |
---|---|---|---|---|
0 | msgapp | com | common | |
1 | online | citibank | co.in | common |
2 | emhapfokdlyxtfgucmjm | cx | common | |
3 | asset | affectv | com | common |
4 | gamerswithjobs | com | common |
# Now that the records have labels on them (bad/common) we can combine them into one DataFrame
df_all = df_common.append(df_bad, ignore_index=True)
df_all.head()
subdomain | domain | tld | label | |
---|---|---|---|---|
0 | msgapp | com | common | |
1 | online | citibank | co.in | common |
2 | emhapfokdlyxtfgucmjm | cx | common | |
3 | asset | affectv | com | common |
4 | gamerswithjobs | com | common |
We're going to do some statistics on the TLDs to see how they're distributed between the bad and common domains. The zat python package provides a nice set of functionality for statistics on Pandas DataFrames (https://github.com/SuperCowPowers/zat).
# Run a bunch of statistics from the zat python package
import zat.dataframe_stats as df_stats
# Print out the contingency_table
print('\nContingency Table')
cont_table = df_stats.contingency_table(df_all, 'tld', 'label')
cont_table.head()
Contingency Table
label | bad | common | All |
---|---|---|---|
tld | |||
ab.ca | 0.0 | 24.0 | 24.0 |
abbott | 0.0 | 2.0 | 2.0 |
abruzzo.it | 0.0 | 1.0 | 1.0 |
ac | 3.0 | 440.0 | 443.0 |
ac.ae | 0.0 | 12.0 | 12.0 |
# Print out the expected_counts
print('\nExpected Counts Table')
expect_counts = df_stats.expected_counts(df_all, 'tld', 'label')
expect_counts.head()
Expected Counts Table
label | bad | common | All |
---|---|---|---|
tld | |||
ab.ca | 3.664538 | 20.335462 | 24.0 |
abbott | 0.305378 | 1.694622 | 2.0 |
abruzzo.it | 0.152689 | 0.847311 | 1.0 |
ac | 67.641257 | 375.358743 | 443.0 |
ac.ae | 1.832269 | 10.167731 | 12.0 |
# Print out the g_test scores
print('\nG-Test Scores')
g_scores = df_stats.g_test_scores(df_all, 'tld', 'label')
g_scores.head()
G-Test Scores
label | bad | common |
---|---|---|
tld | ||
ab.ca | -7 | 7 |
abbott | 0 | 0 |
abruzzo.it | 0 | 0 |
ac | -121 | 121 |
ac.ae | -3 | 3 |
For a formal interpretation of these scores please see (https://en.wikipedia.org/wiki/G-test). Informally, the higher the score the more that item stands out from a probability perspective from what the expected counts would be given the null hypothesis that the TLDs should occur equally likely in both classes.
Example:
The tk TLD occured about ~6200 times across both datasets. So because we have about 5x more common domains than bad domains then if all else is equal we should see it about ~950 times in the bad set and ~5250 times in the common set. The actual observation is that we see it 6071 times in the bad set and only 94 times in the common set. So seeing a tk domain pass through your IDS would definitely be a good thing to put on your 'short list'.
See Expected Counts and Actual Counts Below
# Sort the GTest scores
g_scores.sort_values('bad', ascending=False).head(15)
label | bad | common |
---|---|---|
tld | ||
info | 35476 | -35476 |
tk | 21877 | -21877 |
xyz | 20444 | -20444 |
online | 6701 | -6701 |
club | 3252 | -3252 |
ru | 3124 | -3124 |
website | 1923 | -1923 |
in | 1069 | -1069 |
ws | 752 | -752 |
top | 693 | -693 |
site | 614 | -614 |
work | 576 | -576 |
biz | 556 | -556 |
name | 516 | -516 |
tech | 478 | -478 |
# Lets look at some of the TLDs Expected Counts vs. Actual Counts
interesting_tlds = ['info', 'tk', 'xyz', 'online', 'club']
print('Expected Counts:')
expect_counts[expect_counts.index.isin(interesting_tlds)]
Expected Counts:
label | bad | common | All |
---|---|---|---|
tld | |||
club | 310.569562 | 1723.430438 | 2034.0 |
info | 2732.523545 | 15163.476455 | 17896.0 |
online | 392.258213 | 2176.741787 | 2569.0 |
tk | 941.328099 | 5223.671901 | 6165.0 |
xyz | 1216.015730 | 6747.984270 | 7964.0 |
print('Actual Counts:')
cont_table[cont_table.index.isin(interesting_tlds)]
Actual Counts:
label | bad | common | All |
---|---|---|---|
tld | |||
club | 1459.0 | 575.0 | 2034.0 |
info | 14053.0 | 3843.0 | 17896.0 |
online | 2259.0 | 310.0 | 2569.0 |
tk | 6071.0 | 94.0 | 6165.0 |
xyz | 6958.0 | 1006.0 | 7964.0 |
We can see from the sorted GTest Score table that domains like info, tk, xyz, club, ... occur much more often in the blacklists than they do in the common lists (Umbrella/Alexa). So even with this small insight we could set up a Zeek Script (or a zat Python script) to mark domains with those TLDs as risky.
In Phase 2 of this notebook we'll dive into the domains and subdomains. Using NGram extraction and our G-Test statistics on all the extracted features to do feature selection for a sparse data machine learning model. We'll leverage the fantastic set of models available in the Python scikit-learn module and we'll show how to use zat to deploy that model so that new domains coming from Zeek can be evaluated and scored in realtime.
Now that we know which TLDs are 'risky' we can take action with zat. See the example risky_domains.py that uses these results to flag the realtime DNS logs coming from Zeek and makes a Virus Total query on any flagged domains. If the Virus Total query returns positives then we report the observation.
Although this sounds simplistic it's actually quite effective. The number of VT queries we make is extremely small compared to the total volume of DNS queries and given the statistical results the probably of a 'hit' is reasonably high and of course we're casting a wider net then the original blacklist.
If you liked this notebook please visit the zat project for more notebooks and examples. You can run all the examples with a simple $pip install zat (and a running Zeek instance of course)