Exercise to detect Algorithmically Generated Domain Names.

In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). We have 'legit' in quotes as we're using the domains in Alexa as the 'legit' set. The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn't work out that well and results that could definitely be improved upon. In general capturing what worked and what didn't is not only more realistic but often much more informative. :)

Python Modules Used:

Suggestions/Comments: Please send suggestions or bugs (I'm sure) to clicklabs at clicksecurity.com. Also if you have some datasets or would like to explore alternative approaches please touch base.

In [42]:
import sklearn.feature_extraction
sklearn.__version__
Out[42]:
'0.14.1'
In [43]:
import pandas as pd
pd.__version__
Out[43]:
'0.12.0'
In [44]:
# Set default pylab stuff
pylab.rcParams['figure.figsize'] = (14.0, 5.0)
pylab.rcParams['axes.grid'] = True
In [45]:
# Version 0.12.0 of Pandas has a DeprecationWarning about Height blah that I'm ignoring
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
In [46]:
# This is the Alexa 100k domain list, we're not using the 1 Million just for speed reasons. Results
# for the Alexa 1M are given at the bottom of the notebook.
alexa_dataframe = pd.read_csv('data/alexa_100k.csv', names=['rank','uri'], header=None, encoding='utf-8')
alexa_dataframe.head()
Out[46]:
rank uri
0 1 facebook.com
1 2 google.com
2 3 youtube.com
3 4 yahoo.com
4 5 baidu.com
In [47]:
# Okay for this exercise we need the 2LD and nothing else
import tldextract

def domain_extract(uri):
    ext = tldextract.extract(uri)
    if (not ext.suffix):
        return np.nan
    else:
        return ext.domain

alexa_dataframe['domain'] = [ domain_extract(uri) for uri in alexa_dataframe['uri']]
del alexa_dataframe['rank']
del alexa_dataframe['uri']
alexa_dataframe.head()
Out[47]:
domain
0 facebook
1 google
2 youtube
3 yahoo
4 baidu
In [48]:
alexa_dataframe.tail()
Out[48]:
domain
99995 rhbabyandchild
99996 rm
99997 sat1
99998 nahimunkar
99999 musi
In [49]:
# It's possible we have NaNs from blanklines or whatever
alexa_dataframe = alexa_dataframe.dropna()
alexa_dataframe = alexa_dataframe.drop_duplicates()

# Set the class
alexa_dataframe['class'] = 'legit'

# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
alexa_total = alexa_dataframe.shape[0]
print 'Total Alexa domains %d' % alexa_total

# Hold out 10%
hold_out_alexa = alexa_dataframe[alexa_total*.9:]
alexa_dataframe = alexa_dataframe[:alexa_total*.9]

print 'Number of Alexa domains: %d' % alexa_dataframe.shape[0]
Total Alexa domains 91712
Number of Alexa domains: 82540
In [50]:
alexa_dataframe.head()
Out[50]:
domain class
20904 transworld legit
82690 lkfun legit
85167 islam2all legit
62859 pulitzer legit
85573 sge legit
In [51]:
# Read in the DGA domains
dga_dataframe = pd.read_csv('data/dga_domains.txt', names=['raw_domain'], header=None, encoding='utf-8')

# We noticed that the blacklist values just differ by captilization or .com/.org/.info
dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
del dga_dataframe['raw_domain']

# It's possible we have NaNs from blanklines or whatever
dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_total = dga_dataframe.shape[0]
print 'Total DGA domains %d' % dga_total

# Set the class
dga_dataframe['class'] = 'dga'

# Hold out 10%
hold_out_dga = dga_dataframe[dga_total*.9:]
dga_dataframe = dga_dataframe[:dga_total*.9]

print 'Number of DGA domains: %d' % dga_dataframe.shape[0]
Total DGA domains 2664
Number of DGA domains: 2397
In [52]:
dga_dataframe.head()
Out[52]:
domain class
0 04055051be412eea5a61b7da8438be3d dga
1 1cb8a5f36f dga
2 30acd347397c34fc273e996b22951002 dga
3 336c986a284e2b3bc0f69f949cb437cb dga
5 40a43e61e56a5c218cf6c22aca27f7ee dga
In [53]:
# Concatenate the domains in a big pile!
all_domains = pd.concat([alexa_dataframe, dga_dataframe], ignore_index=True)
In [54]:
# Add a length field for the domain
all_domains['length'] = [len(x) for x in all_domains['domain']]

# Okay since we're trying to detect dynamically generated domains and short
# domains (length <=6) are crazy random even for 'legit' domains we're going
# to punt on short domains (perhaps just white/black list for short domains?)
all_domains = all_domains[all_domains['length'] > 6]
In [55]:
# Grabbed this from Rosetta Code (rosettacode.org)
import math
from collections import Counter
 
def entropy(s):
    p, lns = Counter(s), float(len(s))
    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())
In [56]:
# Add a entropy field for the domain
all_domains['entropy'] = [entropy(x) for x in all_domains['domain']]
In [57]:
all_domains.head()
Out[57]:
domain class length entropy
0 transworld legit 10 3.121928
2 islam2all legit 9 2.419382
3 pulitzer legit 8 3.000000
6 danarimedia legit 11 2.663533
7 heartbreakers legit 13 2.815072
In [58]:
all_domains.tail()
Out[58]:
domain class length entropy
84932 ulxxqduryvv dga 11 2.913977
84933 ummvzhin dga 8 2.750000
84934 umsgnwgc dga 8 2.750000
84935 umzsbhpkrgo dga 11 3.459432
84936 umzuyjrfwyf dga 11 2.913977

Lets plot some stuff!

In [59]:
# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot

# Plot the length and entropy of domains
all_domains.boxplot('length','class')
pylab.ylabel('Domain Length')
all_domains.boxplot('entropy','class')
pylab.ylabel('Domain Entropy')
Out[59]:
<matplotlib.text.Text at 0x10ce00b50>
In [60]:
# Split the classes up so we can set colors, size, labels
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
alexa = all_domains[~cond]
plt.scatter(alexa['length'], alexa['entropy'], s=140, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['entropy'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Domain Entropy')

# Below you can see that our DGA domains do tend to have higher entropy than Alexa on average.
Out[60]:
<matplotlib.text.Text at 0x10680ff50>
In [61]:
# Lets look at the types of domains that have entropy higher than 4
high_entropy_domains = all_domains[all_domains['entropy'] > 4]
print 'Num Domains above 4 entropy: %.2f%% %d (out of %d)' % \
            (100.0*high_entropy_domains.shape[0]/all_domains.shape[0],high_entropy_domains.shape[0],all_domains.shape[0])
print "Num high entropy legit: %d" % high_entropy_domains[high_entropy_domains['class']=='legit'].shape[0]
print "Num high entropy DGA: %d" % high_entropy_domains[high_entropy_domains['class']=='dga'].shape[0]
high_entropy_domains[high_entropy_domains['class']=='legit'].head()

# Looking at the results below, we do see that there are more domains
# in the DGA group that are high entropy but only a small percentage
# of the domains are in that high entropy range...
Num Domains above 4 entropy: 0.57% 361 (out of 63294)
Num high entropy legit: 3
Num high entropy DGA: 358
Out[61]:
domain class length entropy
29392 theukwebdesigncompany legit 21 4.070656
37378 texaswithlove1982-amomentlikethis legit 33 4.051822
55073 congresomundialjjrperu2009 legit 26 4.056021
In [62]:
high_entropy_domains[high_entropy_domains['class']=='dga'].head()
Out[62]:
domain class length entropy
82558 a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52 dga 37 4.540402
82559 a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq dga 39 4.631305
82560 a17d60gtnxk47gskti15izhvlviyksh64nqkz dga 37 4.270132
82561 a17erpzfzh64c69csi35bqgvp52drita67jzmy dga 38 4.629249
82562 a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr dga 41 4.305859
In [63]:
# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = all_domains.as_matrix(['length', 'entropy'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist()) # Yes, this is weird but it needs 
                                            # to be an np.array of strings
In [64]:
# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=20, compute_importances=True) # Trees in the forest
In [65]:
# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=4)
print scores
[ 0.9688759   0.96784896  0.96729599  0.96753298  0.96887344]
In [66]:
# Wow 96% accurate! At this point we could claim success and we'd be gigantic morons...
# Recall that we have ~100k 'legit' domains and only 3.5k DGA domains
# So a classifier that marked everything as legit would be about
# 96% accurate....

# So we dive in a bit and look at the predictive performance more deeply.

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In [67]:
# Now plot the results of the 80/20 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)

def plot_cm(cm, labels):
    
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  # Derp, I'm sure there's a better way
    
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks kermit666 from stackoverflow :)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm')
    pylab.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    pylab.xlabel('Predicted')
    pylab.ylabel('True')
    pylab.show()

plot_cm(cm, labels)

# We can see below that our suspicions were correct and the classifier is
# marking almost everything as Alexa. We FAIL.. science is hard... lets go drinking....
Confusion Matrix Stats
legit/legit: 99.89% (12152/12165)
legit/dga: 0.11% (13/12165)
dga/legit: 80.16% (396/494)
dga/dga: 19.84% (98/494)
In [68]:
# Well our Mom told us we were still cool.. so with that encouragement we're
# going to compute NGrams for every Alexa domain and see if we can use the
# NGrams to help us better differentiate and mark DGA domains...

# Scikit learn has a nice NGram generator that can generate either char NGrams or word NGrams (we're using char).
# Parameters: 
#       - ngram_range=(3,5)  # Give me all ngrams of length 3, 4, and 5
#       - min_df=1e-4        # Minimumum document frequency. At 1e-4 we're saying give us NGrams that 
#                            # happen in at least .1% of the domains (so for 100k... at least 100 domains)
alexa_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-4, max_df=1.0)
In [69]:
# I'm SURE there's a better way to store all the counts but not sure...
# At least the min_df parameters has already done some thresholding
counts_matrix = alexa_vc.fit_transform(alexa_dataframe['domain'])
alexa_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = alexa_vc.get_feature_names()
In [70]:
# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, alexa_counts), key=operator.itemgetter(1), reverse=True)
print 'Alexa NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
    print ngram, count
Alexa NGrams: 27012
ing 3.40001963507
lin 3.3818368
ine 3.35295391171
tor 3.22349594096
ter 3.21827285357
ion 3.20411998266
ent 3.18184358794
por 3.1562461904
the 3.15228834438
ree 3.11693964655
In [71]:
# We're also going to throw in a bunch of dictionary words
word_dataframe = pd.read_csv('data/words.txt', names=['word'], header=None, dtype={'word': np.str}, encoding='utf-8')

# Cleanup words from dictionary
word_dataframe = word_dataframe[word_dataframe['word'].map(lambda x: str(x).isalpha())]
word_dataframe = word_dataframe.applymap(lambda x: str(x).strip().lower())
word_dataframe = word_dataframe.dropna()
word_dataframe = word_dataframe.drop_duplicates()
word_dataframe.head(10)
Out[71]:
word
37 a
48 aa
51 aaa
53 aaaa
54 aaaaaa
55 aaal
56 aaas
57 aaberg
58 aachen
59 aae
In [72]:
# Now compute NGrams on the dictionary words
# Same logic as above...
dict_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-5, max_df=1.0)
counts_matrix = dict_vc.fit_transform(word_dataframe['word'])
dict_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = dict_vc.get_feature_names()
In [73]:
# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, dict_counts), key=operator.itemgetter(1), reverse=True)
print 'Word NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
    print ngram, count
Word NGrams: 142275
ing 4.38730082245
ess 4.20487933376
ati 4.19334725639
ion 4.16503647999
ter 4.16241503611
nes 4.11250445877
tio 4.07682242334
ate 4.07236020396
ent 4.06963110262
tion 4.04960561259
In [74]:
# We use the transform method of the CountVectorizer to form a vector
# of ngrams contained in the domain, that vector is than multiplied
# by the counts vector (which is a column sum of the count matrix).
def ngram_count(domain):
    alexa_match = alexa_counts * alexa_vc.transform([domain]).T  # Woot vector multiply and transpose Woo Hoo!
    dict_match = dict_counts * dict_vc.transform([domain]).T
    print '%s Alexa match:%d Dict match: %d' % (domain, alexa_match, dict_match)

# Examples:
ngram_count('google')
ngram_count('facebook')
ngram_count('1cb8a5f36f')
ngram_count('pterodactylfarts')
ngram_count('ptes9dro-dwacty2lfa5rrts')
ngram_count('beyonce')
ngram_count('bey666on4ce')
google Alexa match:17 Dict match: 14
facebook Alexa match:30 Dict match: 27
1cb8a5f36f Alexa match:0 Dict match: 0
pterodactylfarts Alexa match:34 Dict match: 77
ptes9dro-dwacty2lfa5rrts Alexa match:19 Dict match: 28
beyonce Alexa match:15 Dict match: 16
bey666on4ce Alexa match:2 Dict match: 1
In [75]:
# Compute NGram matches for all the domains and add to our dataframe
all_domains['alexa_grams']= alexa_counts * alexa_vc.transform(all_domains['domain']).T 
all_domains['word_grams']= dict_counts * dict_vc.transform(all_domains['domain']).T 
all_domains.head()
Out[75]:
domain class length entropy alexa_grams word_grams
0 transworld legit 10 3.121928 39.051439 44.033642
2 islam2all legit 9 2.419382 15.475215 17.367964
3 pulitzer legit 8 3.000000 14.458222 28.441721
6 danarimedia legit 11 2.663533 40.189599 54.829856
7 heartbreakers legit 13 2.815072 45.354321 69.734483
In [76]:
all_domains.tail()
Out[76]:
domain class length entropy alexa_grams word_grams
84932 ulxxqduryvv dga 11 2.913977 3.745231 6.464859
84933 ummvzhin dga 8 2.750000 6.183945 7.180022
84934 umsgnwgc dga 8 2.750000 3.272306 3.847079
84935 umzsbhpkrgo dga 11 3.459432 1.653213 2.546543
84936 umzuyjrfwyf dga 11 2.913977 0.000000 0.000000
In [77]:
# Use the vectorized operations of the dataframe to investigate differences
# between the alexa and word grams
all_domains['diff'] = all_domains['alexa_grams'] - all_domains['word_grams']
all_domains.sort(['diff'], ascending=True).head(10)

# The table below shows those domain names that are more 'dictionary' and less 'web'
Out[77]:
domain class length entropy alexa_grams word_grams diff
63819 bipolardisorderdepressionanxiety legit 32 3.616729 115.885999 193.844156 -77.958157
34524 stirringtroubleinternationally legit 30 3.481728 131.209086 207.204729 -75.995643
63954 americansforresponsiblesolutions legit 32 3.667838 145.071369 218.363956 -73.292587
49070 channel4embarrassingillnesses legit 29 3.440070 98.201709 169.721499 -71.519790
5902 pragmatismopolitico legit 19 3.326360 59.877723 121.536223 -61.658500
49210 egaliteetreconciliation legit 23 3.186393 92.257111 152.125325 -59.868214
74130 interoperabilitybridges legit 23 3.588354 93.803640 153.626312 -59.822673
36976 foreclosurephilippines legit 22 3.447402 72.844280 132.514638 -59.670358
47055 corazonindomablecapitulos legit 25 3.813661 74.706878 133.762750 -59.055872
70113 annamalicesissyselfhypnosis legit 27 3.429908 68.066490 126.667692 -58.601201
In [78]:
all_domains.sort(['diff'], ascending=False).head(50)

# The table below shows those domain names that are more 'web' and less 'dictionary'
# Good O' web....
Out[78]:
domain class length entropy alexa_grams word_grams diff
22647 gay-sex-pics-porn-pictures-gay-sex-porn-gay-se... legit 56 3.661056 160.035734 85.124184 74.911550
44091 article-directory-free-submission-free-content legit 46 3.786816 233.518879 188.230453 45.288426
63865 stream-free-movies-online legit 25 3.509275 118.944026 74.496915 44.447110
38570 top-bookmarking-site-list legit 25 3.723074 117.162056 74.126061 43.035995
79963 best-online-shopping-site legit 25 3.452879 122.152194 79.596640 42.555554
12532 watch-free-movie-online legit 23 3.708132 101.010995 58.943451 42.067543
30198 free-online-directory legit 21 3.403989 122.359797 80.735030 41.624767
40859 free-links-articles-directory legit 29 3.702472 152.063809 110.955361 41.108448
30875 online-web-directory legit 20 3.584184 114.439863 74.082948 40.356915
79001 web-directory-online legit 20 3.584184 114.313583 74.082948 40.230634
78947 movie-news-online legit 17 3.175123 81.036910 41.705735 39.331174
51532 xxx-porno-sexvideos legit 19 3.260828 73.025165 35.176549 37.848617
42200 free-tv-video-online legit 20 3.284184 83.341214 45.662984 37.678230
40771 freegamesforyourwebsite legit 23 3.551191 114.291735 78.515881 35.775855
58275 free-web-mobile-themes legit 22 3.356492 88.503556 54.149725 34.353831
70724 seowebdirectoryonline legit 21 3.499228 126.111921 91.819498 34.292423
28283 download-free-games legit 19 3.576618 84.492962 50.661490 33.831472
18894 web-link-directory-site legit 23 3.729446 102.993078 69.367186 33.625893
4838 the-web-directory legit 17 3.454822 87.520339 54.697986 32.822353
65871 social-bookmarking-site legit 23 3.762267 116.664791 84.545021 32.119769
21743 free-links-directory legit 20 3.646439 104.050046 71.956644 32.093402
74449 money-news-online legit 17 3.101881 77.587799 45.775375 31.812424
48456 free-sexvideosfc2 legit 17 3.381580 63.659477 31.878432 31.781045
57427 your-new-directory-site legit 23 3.555533 99.130671 67.468067 31.662605
49041 addsiteurlfreewebdirectory legit 26 3.609496 134.446230 103.178748 31.267482
34821 own-free-website legit 16 3.250000 59.564153 28.839294 30.724859
10080 web-directory-plus legit 18 3.836592 89.030979 58.484138 30.546841
43762 web-directory-sites legit 19 3.471354 98.528255 68.088416 30.439839
34811 free-sex-for-you legit 16 3.030639 46.653059 16.670504 29.982555
21390 online-deal-coupon legit 18 3.308271 77.862004 47.886115 29.975889
48204 acme-people-search-forum legit 24 3.553509 87.829242 57.898987 29.930255
73304 free-webdirectory legit 17 3.337175 93.606205 63.858372 29.747833
44221 good-web-directory legit 18 3.461320 88.201881 58.629789 29.572091
50633 all-free-download legit 17 3.219528 69.337916 39.909696 29.428220
57095 free-link-directory legit 19 3.536887 95.869062 66.507042 29.362020
58652 global-web-directory legit 20 3.721928 100.465474 71.293587 29.171887
74259 online-games-zone legit 17 3.292770 74.987811 45.881826 29.105985
77290 us-web-directory legit 16 3.625000 80.044863 50.969551 29.075312
72128 bookmarking-sites-lists legit 23 3.621176 115.664939 86.595393 29.069546
64948 web-marketing-directory legit 23 3.849224 125.587313 96.714227 28.873086
79557 freewebdirectory101 legit 19 3.471354 100.131488 71.474824 28.656664
72737 free-seo-news legit 13 2.777363 45.267539 17.089020 28.178520
53449 website-traffic-hog legit 19 3.721612 77.199578 49.156126 28.043452
50837 myonlinewebdirectory legit 20 3.584184 121.155376 93.276322 27.879054
29303 business-web-directorys legit 23 3.621176 125.854338 98.160126 27.694212
41310 free-online-submission legit 22 3.413088 113.459411 85.792712 27.666699
76645 linkdirectoryonline legit 19 3.326360 116.879367 89.392747 27.486621
30430 online-deal-site legit 16 3.202820 68.103656 40.887484 27.216172
27227 free-site-submit legit 16 3.202820 64.158023 37.127294 27.030729
62951 mybusiness-web-directory legit 24 3.772055 124.553982 97.538670 27.015312
In [79]:
# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['alexa_grams'], s=120, c='#aaaaff', label='Alexa', alpha=.1)
plt.scatter(dga['length'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')
Out[79]:
<matplotlib.text.Text at 0x110c87210>
In [80]:
# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['entropy'], legit['alexa_grams'],  s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['entropy'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Entropy')
pylab.ylabel('Alexa Gram Matches')
Out[80]:
<matplotlib.text.Text at 0x11058e490>
In [81]:
# Lets plot some stuff!
# Here we want to see whether our new 'word_grams' feature can help us differentiate between Legit/DGA
# Note: It doesn't look quite as good as the Alexa_grams but it might generalize better (less overfit).
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['word_grams'],  s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['word_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')
Out[81]:
<matplotlib.text.Text at 0x10ef8f750>
In [82]:
# Lets look at which Legit domains are scoring low on the word gram count
all_domains[(all_domains['word_grams']==0)].head()
Out[82]:
domain class length entropy alexa_grams word_grams diff
3429 dftc777 legit 7 2.128085 2.707570 0 2.707570
3715 5221766 legit 7 2.235926 0.000000 0 0.000000
4144 28365365 legit 8 2.250000 4.050612 0 4.050612
4235 mm-mm-mm legit 8 0.811278 4.260668 0 4.260668
4297 fzzfgjj legit 7 1.950212 0.954243 0 0.954243
In [83]:
# Okay these look kinda weird, lets use some nice Pandas functionality
# to look at some statistics around our new features.
all_domains[all_domains['class']=='legit'].describe()
Out[83]:
length entropy alexa_grams word_grams diff
count 60897.000000 60897.000000 60897.000000 60897.000000 60897.000000
mean 10.873032 2.930306 33.083440 40.901852 -7.818413
std 3.393407 0.347134 19.233994 23.302539 9.388916
min 7.000000 -0.000000 0.000000 0.000000 -77.958157
25% 8.000000 2.725481 19.136340 24.056214 -12.938013
50% 10.000000 2.947703 28.703813 36.259089 -7.108820
75% 13.000000 3.169925 42.400101 53.036218 -1.995136
max 56.000000 4.070656 233.518879 233.648571 74.911550
In [84]:
# Lets look at how many domains that are both low in word_grams and alexa_grams (just plotting the max of either)
legit = all_domains[(all_domains['class']=='legit')]
max_grams = np.maximum(legit['alexa_grams'],legit['word_grams'])
ax = max_grams.hist(bins=80)
ax.figure.suptitle('Histogram of the Max NGram Score for Domains')
pylab.xlabel('Number of Domains')
pylab.ylabel('Maximum NGram Score')
Out[84]:
<matplotlib.text.Text at 0x114ee5450>
In [85]:
# Lets look at which Legit domains are scoring low on both alexa and word gram count
weird_cond = (all_domains['class']=='legit') & (all_domains['word_grams']<3) & (all_domains['alexa_grams']<2)
weird = all_domains[weird_cond]
print weird.shape[0]
weird.head(30)
79
Out[85]:
domain class length entropy alexa_grams word_grams diff
85 9to5lol legit 7 2.235926 1.991226 2.359835 -0.368609
2611 akb48mt legit 7 2.807355 1.301030 1.041393 0.259637
3715 5221766 legit 7 2.235926 0.000000 0.000000 0.000000
4297 fzzfgjj legit 7 1.950212 0.954243 0.000000 0.954243
6045 crx7601 legit 7 2.807355 0.000000 0.000000 0.000000
8531 mw7zrv2 legit 7 2.807355 0.000000 0.000000 0.000000
10802 jmm1818 legit 7 1.950212 0.903090 0.000000 0.903090
11961 qq66699 legit 7 1.556657 1.322219 0.000000 1.322219
13200 twcczhu legit 7 2.521641 1.724276 0.000000 1.724276
13756 hljdns4 legit 7 2.807355 1.724276 0.000000 1.724276
14763 6470355 legit 7 2.521641 0.000000 0.000000 0.000000
17322 d20pfsrd legit 8 2.750000 0.000000 0.000000 0.000000
20591 lgcct27 legit 7 2.521641 1.176091 0.845098 0.330993
23458 jdoqocy legit 7 2.521641 0.000000 2.813581 -2.813581
24661 95178114 legit 8 2.405639 1.591065 0.000000 1.591065
24720 ggmmxxoo legit 8 2.000000 1.113943 0.602060 0.511883
26454 ggmm777 legit 7 1.556657 1.477121 0.602060 0.875061
27222 rkg1866 legit 7 2.521641 0.954243 0.000000 0.954243
27676 1616bbs legit 7 1.950212 1.806180 1.322219 0.483961
29142 5278bbs legit 7 2.521641 1.806180 1.322219 0.483961
29551 05tz2e9 legit 7 2.807355 0.000000 0.000000 0.000000
29858 1532777 legit 7 2.128085 1.477121 0.000000 1.477121
30119 5311314 legit 7 1.842371 1.000000 0.000000 1.000000
30290 zzgcjyzx legit 8 2.405639 0.000000 0.000000 0.000000
30739 xn--g5t518j legit 11 3.095795 1.000000 0.000000 1.000000
31465 7210578 legit 7 2.521641 0.903090 0.000000 0.903090
31951 fj96336 legit 7 2.235926 0.000000 0.000000 0.000000
34455 xn--42cgk1gc8crdb1htg3d legit 23 3.849224 1.255273 2.411620 -1.156347
35554 720pmkv legit 7 2.807355 0.000000 0.000000 0.000000
36166 d4ffr55 legit 7 2.235926 1.079181 2.260071 -1.180890
In [86]:
# Epiphany... Alexa really may not be the best 'exemplar' set...  
#             (probably a no-shit moment for everyone else :)
#
# Discussion: If you're using these as exemplars of NOT DGA, then your probably
#             making things very hard on your machine learning algorithm.
#             Perhaps we should have two categories of Alexa domains, 'legit'
#             and a 'weird'. based on some definition of weird.
#             Looking at the entries above... we have approx 80 domains
#             that we're going to mark as 'weird'.
#
all_domains.loc[weird_cond, 'class'] = 'weird'
print all_domains['class'].value_counts()
all_domains[all_domains['class'] == 'weird'].head()
legit    60818
dga       2397
weird       79
dtype: int64
Out[86]:
domain class length entropy alexa_grams word_grams diff
85 9to5lol weird 7 2.235926 1.991226 2.359835 -0.368609
2611 akb48mt weird 7 2.807355 1.301030 1.041393 0.259637
3715 5221766 weird 7 2.235926 0.000000 0.000000 0.000000
4297 fzzfgjj weird 7 1.950212 0.954243 0.000000 0.954243
6045 crx7601 weird 7 2.807355 0.000000 0.000000 0.000000
In [87]:
# Now we try our machine learning algorithm again with the new features
# Alexa and Dictionary NGrams and the exclusion of the bad exemplars.
X = all_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist())

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In [88]:
# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'weird', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
legit/legit: 99.58% (12140/12191)
legit/weird: 0.01% (1/12191)
legit/dga: 0.41% (50/12191)
weird/legit: 0.00% (0/10)
weird/weird: 30.00% (3/10)
weird/dga: 70.00% (7/10)
dga/legit: 14.63% (67/458)
dga/weird: 0.22% (1/458)
dga/dga: 85.15% (390/458)
In [89]:
# Hun, well that seem to work 'ok', but you don't really want a classifier
# that outputs 3 classes, you'd like a classifier that flags domains as DGA or not.
# This was a path that seemed like a good idea until it wasn't....
In [90]:
# Perhaps we will just exclude the weird class from our ML training
not_weird = all_domains[all_domains['class'] != 'weird']
X = not_weird.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(not_weird['class'].tolist())

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In [91]:
# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels) 
Confusion Matrix Stats
legit/legit: 99.56% (12111/12165)
legit/dga: 0.44% (54/12165)
dga/legit: 17.99% (86/478)
dga/dga: 82.01% (392/478)
In [92]:
# Well it's definitely better.. but haven't we just cheated by removing
# the weird domains?  Well perhaps, but on some level we're removing
# outliers that are bad exemplars. So to validate that the model is still
# doing the right thing lets try our new model prediction on our hold out sets.

# First train on the whole thing before looking at prediction performance
clf.fit(X, y)

# Pull together our hold out set
hold_out_domains = pd.concat([hold_out_alexa, hold_out_dga], ignore_index=True)

# Add a length field for the domain
hold_out_domains['length'] = [len(x) for x in hold_out_domains['domain']]
hold_out_domains = hold_out_domains[hold_out_domains['length'] > 6]

# Add a entropy field for the domain
hold_out_domains['entropy'] = [entropy(x) for x in hold_out_domains['domain']]

# Compute NGram matches for all the domains and add to our dataframe
hold_out_domains['alexa_grams']= alexa_counts * alexa_vc.transform(hold_out_domains['domain']).T
hold_out_domains['word_grams']= dict_counts * dict_vc.transform(hold_out_domains['domain']).T

hold_out_domains.head()
Out[92]:
domain class length entropy alexa_grams word_grams
0 alcatelonetouch legit 15 3.106891 49.001768 79.015001
1 optumhealthfinancial legit 20 3.584184 68.667084 87.158661
4 elderscrollsonline legit 18 3.016876 76.441834 94.462092
5 mobango legit 7 2.521641 18.020832 22.072036
6 costaud legit 7 2.807355 16.037393 25.008755
In [93]:
# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
hold_X = hold_out_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
hold_y = np.array(hold_out_domains['class'].tolist())

# Now run through the predictive model
hold_y_pred = clf.predict(hold_X)

# Add the prediction array to the dataframe
hold_out_domains['pred'] = hold_y_pred

# Now plot the results
labels = ['legit', 'dga']
cm = confusion_matrix(hold_y, hold_y_pred, labels)
plot_cm(cm, labels) 
Confusion Matrix Stats
legit/legit: 99.51% (6713/6746)
legit/dga: 0.49% (33/6746)
dga/legit: 15.73% (42/267)
dga/dga: 84.27% (225/267)
In [94]:
# Okay so on our 10% hold out set of 10k domains about ~100 domains were mis-classified
# at this point we're made some good progress so we're going to claim success :)
#       - Out of 10k domains 100 were mismarked
#       - false positives (Alexa marked as DGA) = ~0.6%
#       - about 80% of the DGA are getting marked

# Note: Alexa 1M results on the 10% hold out (100k domains) were in the same ballpark 
#       - Out of 100k domains 432 were mismarked
#       - false positives (Alexa marked as DGA) = 0.4%
#       - about 70% of the DGA are getting marked

# Now were going to just do some post analysis on how the ML algorithm performed.

# Lets look at a couple of plots to see which domains were misclassified.
# Looking at Length vs. Alexa NGrams
fp_cond = ((hold_out_domains['class'] == 'legit') & (hold_out_domains['pred']=='dga'))
fp = hold_out_domains[fp_cond]
fn_cond = ((hold_out_domains['class'] == 'dba') & (hold_out_domains['pred']=='legit'))
fn = hold_out_domains[fn_cond]
okay = hold_out_domains[hold_out_domains['class'] == hold_out_domains['pred']]
plt.scatter(okay['length'], okay['alexa_grams'], s=100,  c='#aaaaff', label='Okay', alpha=.1)
plt.scatter(fp['length'], fp['alexa_grams'], s=40, c='r', label='False Positive', alpha=.5)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')
Out[94]:
<matplotlib.text.Text at 0x115ea7390>
In [95]:
# Looking at Length vs. Dictionary NGrams
cond = (hold_out_domains['class'] != hold_out_domains['pred'])
misclassified = hold_out_domains[cond]
okay = hold_out_domains[~cond]
plt.scatter(okay['length'], okay['word_grams'], s=100,  c='#aaaaff', label='Okay', alpha=.2)
plt.scatter(misclassified['length'], misclassified['word_grams'], s=40, c='r', label='Misclassified', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')
Out[95]:
<matplotlib.text.Text at 0x115e9b990>
In [96]:
misclassified.head()
Out[96]:
domain class length entropy alexa_grams word_grams pred
896 dom2-fan legit 8 3.000000 6.568955 5.656685 dga
1296 mm8mm8-6642 legit 11 2.368523 0.000000 0.000000 dga
1378 4390208 legit 7 2.521641 0.000000 0.000000 dga
1514 sqrt121 legit 7 2.521641 0.000000 0.000000 dga
1687 02022222222 legit 11 0.684038 0.903090 0.000000 dga
In [97]:
misclassified[misclassified['class'] == 'dga'].head()
Out[97]:
domain class length entropy alexa_grams word_grams pred
9184 usbiezgac dga 9 3.169925 7.825928 9.172547 legit
9185 ushcnewo dga 8 3.000000 12.265642 13.904812 legit
9187 usnspdph dga 8 2.500000 5.182278 6.556287 legit
9190 utamehz dga 7 2.807355 10.741352 14.733893 legit
9192 utfowept dga 8 2.750000 7.095911 17.416355 legit
In [98]:
# We can also look at what features the learning algorithm thought were the most important
importances = zip(['length', 'entropy', 'alexa_grams', 'word_grams'], clf.feature_importances_)
importances

# From the list below we see our feature importance scores. There's a lot of feature selection,
# sensitivity study, etc stuff that you could do if you wanted at this point.
Out[98]:
[('length', 0.13110737655160343),
 ('entropy', 0.15589784074688856),
 ('alexa_grams', 0.48657282029928439),
 ('word_grams', 0.22642196240222362)]
In [99]:
# Discussion for how to use the resulting models.
# Typically Machine Learning comes in two phases
#    - Training of the Model
#    - Evaluation of new observations against the Model
# This notebook is about exploration of the data and training the model.
# After you have a model that you are satisfied with, just 'pickle' it
# at the end of the your training script and then in a separate
# evaluation script 'unpickle' it and evaluate/score new observations
# coming in (through a file, or ZeroMQ, or whatever...)
#
# In this case we'd have to pickle the RandomForest classifier
# and the two vectorizing transforms (alexa_grams and word_grams).
# See 'test_it' below for how to use them in evaluation mode.


# test_it shows how to do evaluation, also fun for manual testing below :)
def test_it(domain):
    
    _alexa_match = alexa_counts * alexa_vc.transform([domain]).T  # Woot matrix multiply and transpose Woo Hoo!
    _dict_match = dict_counts * dict_vc.transform([domain]).T
    _X = [len(domain), entropy(domain), _alexa_match, _dict_match]
    print '%s : %s' % (domain, clf.predict(_X)[0])
In [100]:
# Examples (feel free to change these and see the results!)
test_it('google')
test_it('google88')
test_it('facebook')
test_it('1cb8a5f36f')
test_it('pterodactylfarts')
test_it('ptes9dro-dwacty2lfa5rrts')
test_it('beyonce')
test_it('bey666on4ce')
test_it('supersexy')
test_it('yourmomissohotinthesummertime')
test_it('35-sdf-09jq43r')
test_it('clicksecurity')
google : legit
google88 : legit
facebook : legit
1cb8a5f36f : dga
pterodactylfarts : legit
ptes9dro-dwacty2lfa5rrts : dga
beyonce : legit
bey666on4ce : dga
supersexy : legit
yourmomissohotinthesummertime : legit
35-sdf-09jq43r : dga
clicksecurity : legit

Conclusions:

The combination of IPython, Pandas and Scikit Learn let us pull in some junky data, clean it up, plot it, understand it and slap it with some machine learning!

Clearly a lot more formality could be used, plotting learning curves, adjusting for overfitting, feature selection, on and on... there are some really great machine learning resources that cover this deeper material. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/

Some papers on detecting DGA domains:

  • S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated malicious domain names,” presented at the the 10th annual conference, New York, New York, USA, 2010, pp. 48–61. [http://conferences.sigcomm.org/imc/2010/papers/p48.pdf]
  • S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated domain-flux attacks with DNS traffic analysis,” IEEE/ACM Transactions on Networking (TON, vol. 20, no. 5, Oct. 2012.
  • A. Reddy, “Detecting Networks Employing Algorithmically Generated Domain Names,” 2010.
  • Z. Wei-wei and G. Qian, “Detecting Machine Generated Domain Names Based on Morpheme Features,” 2013.
  • P. Barthakur, M. Dahal, and M. K. Ghose, “An Efficient Machine Learning Based Classification Scheme for Detecting Distributed Command & Control Traffic of P2P Botnets,” International Journal of Modern …, 2013.