Java Class File Analysis

In this notebook we're going to explore, understand, and classify java class files as being 'benign' or 'malicious'. We will explore the data, apply machine learning algorithms to the data, add new features, do more machine learning. Then we will test our classifier on a large amount of files to measure it's effectiveness.

References

  • http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html
</font>

Python Modules Used:

Imports and plot defaults

In [117]:
import pandas as pd
print 'pandas version is', pd.__version__
import numpy as np
print 'numpy version is', np.__version__
import sklearn
print 'scikit-learn version is', sklearn.__version__
import matplotlib
print 'matplotlib version is', matplotlib.__version__
import matplotlib.pyplot as plt
pandas version is 0.13.0
numpy version is 1.7.1
scikit-learn version is 0.14.1
matplotlib version is 1.4.1

Plotting defaults and helper functions

In [118]:
%matplotlib inline
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 16.0, 5.0
In [119]:
def plot_cm(cm, labels):
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks to kermit666 from stackoverflow
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm',vmin=0,vmax=100)
    plt.title('')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
In [120]:
def extract_character_info(string):
    lowercase_runs = []
    uppercase_runs = []
    digit_runs = []
    lower = map(str.islower, str(string))
    upper = map(str.isupper, str(string))
    digits = map(str.isdigit, str(string))
    
    current_length = 0
    current = False
    for l in lower:
        if l:
            current_length += 1
            current = True
        else:
            if current:
                lowercase_runs.append(current_length)
                current_length = 0
            current = False
    if current:
        lowercase_runs.append(current_length)

    current_length = 0
    current = False
    for u in upper:
        if u:
            current_length += 1
            current = True
        else:
            if current:
                uppercase_runs.append(current_length)
                current_length = 0
            current = False
    if current:
        uppercase_runs.append(current_length)

    current_length = 0
    current = False
    for d in digits:
        if d:
            current_length += 1
            current = True
        else:
            if current:
                digit_runs.append(current_length)
                current_length = 0
            current = False
    if current:
        digit_runs.append(current_length)
    
    return lowercase_runs, uppercase_runs, digit_runs

Function to extract out the features we are interested with from the json blob

In [121]:
def extract_features(data):
    features = {}
    try:
        features['sha256'] = data['metadata']['sha256']
        features['size'] = data['metadata']['file_size']
        features['entropy'] = data['metadata']['entropy']
        if 'sourcefile' in data['characteristics']['java']:
            features['source file'] = data['characteristics']['java']['sourcefile']
        else:
            features['source file'] = 'No Source File'
        if 'access_permissions' in data['characteristics']['java']:
            features['ap_count'] = len(data['characteristics']['java']['access_permissions'])
            for ap in data['characteristics']['java']['access_permissions']:
                features[str.lower(str(ap).replace(" ", "_"))] = 1
        features['class name'] = data['characteristics']['java']['class_name']
        features['class_name_slash_count'] = features['class name'].count('/')
        features['class_name_length'] = len(features['class name'])
        cn_lowercase_runs, cn_uppercase_runs, cn_digit_runs = extract_character_info(features['class name'])
        cn_lowercase_run_longest = 0
        cn_lowercase_run_average = 0
        cn_uppercase_run_longest = 0
        cn_uppercase_run_average = 0
        cn_digit_run_longest = 0
        cn_digit_run_average = 0
        if cn_lowercase_runs:
            cn_lowercase_run_longest = np.max(cn_lowercase_runs)
            cn_lowercase_run_average = np.mean(cn_lowercase_runs)
        features['class_name_lowercase_run_longest'] = cn_lowercase_run_longest
        features['class_name_lowercase_run_avg'] = cn_lowercase_run_average
                    
        if cn_uppercase_runs:
            cn_uppercase_run_longest = np.max(cn_uppercase_runs)
            cn_uppercase_run_average = np.mean(cn_uppercase_runs)
        features['class_name_uppercase_run_longest'] = cn_uppercase_run_longest
        features['class_name_uppercase_run_avg'] = cn_uppercase_run_average

        if cn_digit_runs:
            cn_digit_run_longest = np.max(cn_digit_runs)
            cn_digit_run_average = np.mean(cn_digit_runs)
        features['class_name_digit_run_longest'] = cn_digit_run_longest
        features['class_name_digit_run_avg'] = cn_digit_run_average
        
        features['major version'] = data['characteristics']['java']['major_version']
        features['minor version'] = data['characteristics']['java']['minor_version']
        if 'method_names' in data['characteristics']['java']:
            features['method names'] = data['characteristics']['java']['method_names']
        else:
            features['method names'] = []
        features['methods_count'] = len(features['method names'])
        lowercase_run_longest = 0
        lowercase_run_average = 0 
        lowercase_runs = []
        uppercase_run_longest = 0
        uppercase_run_average = 0 
        uppercase_runs = []
        digit_run_longest = 0
        digit_run_average = 0
        digit_runs = []

        for method in features['method names']:
            lc, uc, d = extract_character_info(method)
            lowercase_runs.extend(lc)
            uppercase_runs.extend(uc)
            digit_runs.extend(d)
            
        if lowercase_runs:
            lowercase_run_longest = np.max(lowercase_runs)
            lowercase_run_average = np.mean(lowercase_runs)
        features['method_name_lowercase_run_longest'] = lowercase_run_longest
        features['method_name_lowercase_run_avg'] = lowercase_run_average
                    
        if uppercase_runs:
            uppercase_run_longest = np.max(uppercase_runs)
            uppercase_run_average = np.mean(uppercase_runs)
        features['method_name_uppercase_run_longest'] = uppercase_run_longest
        features['method_name_uppercase_run_avg'] = uppercase_run_average

        if digit_runs:
            digit_run_longest = np.max(digit_runs)
            digit_run_average = np.mean(digit_runs)
        features['method_name_digit_run_longest'] = digit_run_longest
        features['method_name_digit_run_avg'] = digit_run_average
        
        if 'interfaces' in data['characteristics']['java']:
            features['interfaces'] = data['characteristics']['java']['interfaces']
        else:
            features['interfaces'] = []
        features['interface_count'] = len(features['interfaces'])
        features['constant_pool_count'] = data['characteristics']['java']['const_pool_count']

    except KeyError as ke:
        print 'ERROR:', ke, data['metadata']['sha256']
    return features
In [122]:
def load_files(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_features(json.loads(f.read()))
            features_list.append(features)
    return features_list

We read in the benign and malicious files

In [123]:
# Good files
import glob

good_list = glob.glob('data/clean/*.results')
good_features = load_files(good_list)
print "Files:", len(good_list)
Files: 500
In [124]:
# Bad files
bad_list = glob.glob('data/malicious/*.results')
bad_features = load_files(bad_list)
print "Files:", len(bad_list)
Files: 520
In [125]:
df_good = pd.DataFrame.from_records(good_features)
df_good.fillna(0, inplace=True)
df_good['label'] = 'benign'
df_good.head()
Out[125]:
acc_abstract acc_annotation acc_enum acc_final acc_interface acc_public acc_super acc_synthetic ap_count class name class_name_digit_run_avg class_name_digit_run_longest class_name_length class_name_lowercase_run_avg class_name_lowercase_run_longest class_name_slash_count class_name_uppercase_run_avg class_name_uppercase_run_longest constant_pool_count entropy
0 1 0 0 0 0 1 1 0 3 com/google/common/collect/ForwardingConcurrentMap 0 0 49 6.000000 9 4 1.0 1 54 4.990507 ...
1 0 0 0 0 0 0 1 0 1 org/apache/hadoop/io/compress/GzipCodec$GzipOu... 0 0 82 4.846154 8 5 1.5 5 39 5.205063 ...
2 0 0 0 1 0 0 1 0 2 com/google/common/collect/Multisets$Unmodifiab... 0 0 62 6.625000 11 4 1.0 1 131 4.996721 ...
3 0 0 0 0 0 0 1 0 1 hu/openig/mechanics/StaticDefensePlanner$1 1 1 42 5.666667 9 3 1.0 1 56 5.282413 ...
4 0 0 0 0 0 1 1 0 2 org/apache/commons/io/LineIterator 0 0 34 4.666667 7 4 1.0 1 95 5.285082 ...

5 rows × 36 columns

In [126]:
df_bad = pd.DataFrame.from_records(bad_features)
df_bad.fillna(0, inplace=True)
df_bad['label'] = 'malicious'
df_bad.head()
Out[126]:
acc_final acc_public acc_super ap_count class name class_name_digit_run_avg class_name_digit_run_longest class_name_length class_name_lowercase_run_avg class_name_lowercase_run_longest class_name_slash_count class_name_uppercase_run_avg class_name_uppercase_run_longest constant_pool_count entropy interface_count interfaces major version method names method_name_digit_run_avg
0 0 1 1 2 Main 0 0 4 3.0 3 0 1.000000 1 86 6.114522 0 [] 48 [<init>, init] 0 ...
1 0 0 1 1 YdCdHX/VcZaXVjyy 0 0 16 1.4 3 1 1.333333 2 52 5.539514 0 [] 49 [<init>, ktCgxlqo, <clinit>] 0 ...
2 0 1 1 2 aOcMSp 0 0 6 1.0 1 0 1.500000 2 159 5.953528 0 [] 49 [<init>, gvuNr, <clinit>] 0 ...
3 0 0 1 1 a/zylasqwjlpbqyrwrr 0 0 19 9.0 17 1 0.000000 0 478 6.348531 0 [] 49 [<init>, eiaxyercdfvbgscpbv, yginlmcynkyuohnfh... 0 ...
4 0 1 1 2 tljpjunbjwtqlywm/sdnrybknlf 0 0 27 13.0 16 1 0.000000 0 122 5.376762 0 [] 49 [<init>, dvvwse, <clinit>] 0 ...

5 rows × 31 columns

In [127]:
df = pd.concat([df_bad, df_good], ignore_index=True)
df.fillna(0, inplace=True)

Let's explore the data.

We start by comparing the file sizes.

In [129]:
df.boxplot(column='size', by='label')
plt.ylabel('File Size')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[129]:
<matplotlib.text.Text at 0x12b873d90>

Let's zoom in. We can see some separation between the files here.

In [130]:
df.boxplot(column='size', by='label')
plt.ylabel('File Size')
plt.xlabel('')
plt.title('')
plt.suptitle('')
plt.ylim(0, 15000)
Out[130]:
(0, 15000)

Next we compare the entropy.

In [131]:
df.boxplot('entropy', 'label')
plt.ylabel('Entropy')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[131]:
<matplotlib.text.Text at 0x12b8ab5d0>

Next we compare the number of items in the constant pool

In [132]:
df.boxplot(column='constant_pool_count', by='label')
plt.ylabel('Constant Pool Count')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[132]:
<matplotlib.text.Text at 0x12b8b8650>

It's hard to see the detail, so we zoom in

In [133]:
df.boxplot(column='constant_pool_count', by='label')
plt.xlabel('')
plt.ylabel('Constant Pool Count')
plt.title('')
plt.suptitle('')
plt.ylim(0, 1000)
Out[133]:
(0, 1000)

Here we see the number of methods the class file has.

In [134]:
df.boxplot(column='methods_count', by='label')
plt.ylabel('Number of Methods')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[134]:
<matplotlib.text.Text at 0x12b8b1210>

Finally, we check out the number of interfaces the class file has.

In [135]:
df.boxplot(column='interface_count', by='label')
plt.ylabel('Number of Interfaces')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[135]:
<matplotlib.text.Text at 0x12b68ced0>

First we try classifying binaries just based on some simple features we already have

In [136]:
my_seed = 1022
my_tsize = .2
In [137]:
import sklearn.ensemble
clf_simple = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
simple_features = ['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface',
                    'acc_public', 'acc_super', 'acc_synthetic', 'ap_count', 'constant_pool_count',
                    'entropy', 'size', 'interface_count', 'major version', 'methods_count', 'minor version']

X = df.as_matrix(simple_features)
y = np.array(df['label'].tolist())

scores = sklearn.cross_validation.cross_val_score(clf_simple, X, y, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.927 (+/- 0.044)

Not terribly great, but we're just getting started.

In [138]:
import sklearn.ensemble
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf_simple.fit(X_train, y_train)
y_pred = clf_simple.predict(X_test)
labels = ['benign', 'malicious']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 93.07% (94/101)
benign/malicious: 6.93% (7/101)
malicious/benign: 10.68% (11/103)
malicious/malicious: 89.32% (92/103)

Here we have the classifier tell use which features were most important and how important they were. The entropy and constant pool count are the two most important features by far.

In [139]:
importances = zip(simple_features, clf_simple.feature_importances_)
importances.sort(key=lambda k:k[1], reverse=True)
for idx, im in enumerate(importances):
    print (str(idx+1) + ':').ljust(4), im[0].ljust(20), round(im[1], 5)
1:   entropy              0.29813
2:   constant_pool_count  0.24814
3:   size                 0.14489
4:   methods_count        0.08492
5:   interface_count      0.08152
6:   major version        0.05055
7:   ap_count             0.02908
8:   acc_public           0.02446
9:   acc_final            0.01284
10:  acc_abstract         0.00982
11:  minor version        0.00623
12:  acc_super            0.00541
13:  acc_interface        0.00275
14:  acc_enum             0.00125
15:  acc_synthetic        2e-05
16:  acc_annotation       -0.0

What about the method names?

Next we will explore the names of the methods

In [145]:
bad = []
good = []
for strings, label in zip(df['method names'], df['label']):
    for name in strings:
        d = {'method name': name}
        if label == 'malicious' and d not in bad:
            bad.append(d)
        elif label == 'benign' and d not in good:
            good.append(d)

df_method_names_bad = pd.DataFrame.from_records(bad)
df_method_names_good = pd.DataFrame.from_records(good)
In [146]:
df_method_names_bad.head(50)
Out[146]:
method name
0 <init>
1 init
2 ktCgxlqo
3 <clinit>
4 gvuNr
5 eiaxyercdfvbgscpbv
6 yginlmcynkyuohnfhe
7 mtyvzetsjhvnbyz
8 fxxhgjttqfavlooxcb
9 wyjgamzmowywjihkuuf
10 kgthsnqdqutacivcptong
11 qgasjqrogibkblyzourtq
12 glfouhczfxzyskaystx
13 mikczoanebdkwpyb
14 bwssduenvebnvgix
15 wafrcwijizypmitodmb
16 bfznyeevclzzxxqbw
17 jmzisxwtxhekbkl
18 szivddjiptybevduli
19 forwnxmgnutbtdwvptj
20 mwwmrvljafpkwzdiy
21 vvpbdzrhvvnzaieyi
22 qkkxoygluwwlnwbxu
23 dvvwse
24 c
25 k
26 main
27 writeEmbeddedFile
28 bootstrap
29 getJreExecutable
30 addExtension
31 findInDir
32 normalize
33 dissect
34 class$
35 tgznSIAR
36 kWfVWtw
37 BodFzDax
38 xXVBwx
39 VdJiGyZfj
40 taddhnwrkj
41 C
42 ALLATORI_DEMO
43 jvsamhqyvgekftsj
44 knjkb
45 B
46 cmjnkr
47 jmdpes
48 tqffjybms
49 vtvtmh

50 rows × 1 columns

In [147]:
df_method_names_good.head(50)
Out[147]:
method name
0 <init>
1 delegate
2 putIfAbsent
3 remove
4 replace
5 resetState
6 comparator
7 createElementSet
8 elementSet
9 descendingMultiset
10 firstEntry
11 lastEntry
12 pollFirstEntry
13 pollLastEntry
14 headMultiset
15 subMultiset
16 tailMultiset
17 invoke
18 hasNext
19 isValidLine
20 next
21 nextLine
22 close
23 closeQuietly
24 exec
25 getInitial
26 getIntermed
27 getFinal
28 max
29 outputSchema
30 estimateLength
31 appendTo
32 getXPath
33 run
34 secToHMS
35 contribute
36 onBeforeRender
37 setCloseEvent
38 setSelectEvent
39 setChangeEvent
40 setSource
41 statement
42 setDocumentLocator
43 startDocument
44 endDocument
45 startPrefixMapping
46 endPrefixMapping
47 startElement
48 endElement
49 characters

50 rows × 1 columns

When we look at method names, we see that the benign ones look "normal" and while some of the malicioius ones look "normal", some look like gibberish. To try to capture that feature, we extract the longest run of consecutive lowercase, uppercase, and digits from the methods name. We also calculate the average run of these sequences as well.

In [148]:
df.boxplot('method_name_lowercase_run_longest', 'label')
plt.ylabel('Max length of lower case letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[148]:
<matplotlib.text.Text at 0x112560510>
In [149]:
df.boxplot('method_name_lowercase_run_avg', 'label')
plt.ylabel('Avg length of lower case letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[149]:
<matplotlib.text.Text at 0x113767090>
In [150]:
df.boxplot('method_name_uppercase_run_longest', 'label')
plt.ylabel('Max length of upper case letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[150]:
<matplotlib.text.Text at 0x1136dd650>
In [151]:
df.boxplot('method_name_uppercase_run_avg', 'label')
plt.ylabel('Avg length of upper case letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[151]:
<matplotlib.text.Text at 0x10ea06190>
In [152]:
df.boxplot('method_name_digit_run_longest', 'label')
plt.ylabel('Max length of digits')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[152]:
<matplotlib.text.Text at 0x10ea53810>
In [153]:
df.boxplot('method_name_digit_run_avg', 'label')
plt.ylabel('Avg length of digits')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[153]:
<matplotlib.text.Text at 0x10eabb2d0>

We train the classifier using these new features.

In [154]:
import sklearn.ensemble
clf_methods = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
method_name_features = ['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface',
                    'acc_public', 'acc_super', 'acc_synthetic', 'ap_count', 'constant_pool_count',
                    'entropy', 'size', 'interface_count', 'major version', 'methods_count',
                    'minor version',
                    'method_name_digit_run_avg', 'method_name_digit_run_longest',
                    'method_name_lowercase_run_avg', 'method_name_lowercase_run_longest',
                    'method_name_uppercase_run_avg', 'method_name_uppercase_run_longest']

X = df.as_matrix(method_name_features)
y = np.array(df['label'].tolist())

scores = sklearn.cross_validation.cross_val_score(clf_methods, X, y, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.950 (+/- 0.037)

We do see improvement in the classifier, so that is encouraging. But definitely still work to be done.

In [155]:
import sklearn.ensemble
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf_methods.fit(X_train, y_train)
y_pred = clf_methods.predict(X_test)
labels = ['benign', 'malicious']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 94.90% (93/98)
benign/malicious: 5.10% (5/98)
malicious/benign: 10.38% (11/106)
malicious/malicious: 89.62% (95/106)

We check the importance from the classifier again. The method names features are not the most important features, but they rank relatively relatively high in importance. This explains the only slight improvement in the classifier.

importances = zip(method_name_features, clf_methods.featureimportances) importances.sort(key=lambda k:k[1], reverse=True) for idx, im in enumerate(importances[0:15]): print (str(idx+1) + ':').ljust(4), im[0].ljust(35), round(im[1], 5)

What about the class names?

In [157]:
for idx, gcn in enumerate(df_good['class name']):
    print gcn
    if idx == 19:
        break
com/google/common/collect/ForwardingConcurrentMap
org/apache/hadoop/io/compress/GzipCodec$GzipOutputStream$ResetableGZIPOutputStream
com/google/common/collect/Multisets$UnmodifiableSortedMultiset
hu/openig/mechanics/StaticDefensePlanner$1
org/apache/commons/io/LineIterator
org/apache/pig/builtin/IntMax
org/apache/commons/lang/time/FastDateFormat$StringLiteral
hu/openig/screen/items/ResearchProductionScreen$15
org/dom4j/XPathException
threadWordlistExec
org/odlabs/wiquery/ui/autocomplete/AbstractAutocompleteComponent$InnerAutocomplete
org/xml/sax/ContentHandler
org/apache/commons/httpclient/protocol/SSLProtocolSocketFactory
org/apache/pig/PigException
org/junit/runners/Enclosed
org/jets3t/service/io/ProgressMonitoredInputStream
org/apache/pig/impl/util/CastUtils
com/google/common/base/Joiner$2
com/google/common/io/CharStreams
org/apache/commons/compress/archivers/cpio/CpioArchiveOutputStream
In [158]:
for idx, gcn in enumerate(df_bad['class name']):
    print gcn
    if idx == 19:
        break
Main
YdCdHX/VcZaXVjyy
aOcMSp
a/zylasqwjlpbqyrwrr
tljpjunbjwtqlywm/sdnrybknlf
Mainer
Main
mNIJnGIOkm/Payload
aHrMCrboe/chspSxY
Main
OSrAfQWThe/SHLeanN
hhIji/XQDODV
a/dwrwbjyhllzu
H
enudwwlhl/wsshvntsenuwajehdujlchpms
Main
iACVKaBQCV/HhtBSGn
GondadGondadExp
Main
enudwwlhl/yhfwcgjacjjauyvut

Much like the method names, the malicious class names look like gibberish. They also seem shorter and do not have a lot of slashes.

In [159]:
df.boxplot('class_name_length', 'label')
plt.ylabel('Class Name Length')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[159]:
<matplotlib.text.Text at 0x12b7d9590>
In [160]:
df.boxplot('class_name_slash_count', 'label')
plt.ylabel('Class Name Slash Count')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[160]:
<matplotlib.text.Text at 0x12b64ee10>
In [161]:
df.boxplot('class_name_lowercase_run_longest', 'label')
plt.ylabel('Max Run of Lower Case Letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[161]:
<matplotlib.text.Text at 0x11278b750>
In [162]:
df.boxplot('class_name_lowercase_run_avg', 'label')
plt.ylabel('Avg Run of Lower Case Letters')
plt.xlabel('')
plt.title('')
plt.suptitle('')
Out[162]:
<matplotlib.text.Text at 0x1127ce7d0>
In [163]:
import sklearn.ensemble
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

clf_all = sklearn.ensemble.RandomForestClassifier(n_estimators=75)
all_features = ['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface',
                    'acc_public', 'acc_super', 'acc_synthetic', 'ap_count', 'constant_pool_count',
                    'entropy', 'interface_count', 'major version', 'methods_count',
                    'size', 'minor version',
                    'method_name_digit_run_avg', 'method_name_digit_run_longest',
                    'method_name_lowercase_run_avg', 'method_name_lowercase_run_longest',
                    'method_name_uppercase_run_avg', 'method_name_uppercase_run_longest',
                    'class_name_digit_run_avg', 'class_name_digit_run_longest',
                    'class_name_length', 'class_name_lowercase_run_avg',
                    'class_name_lowercase_run_longest', 'class_name_slash_count',
                    'class_name_uppercase_run_avg', 'class_name_uppercase_run_longest']                

X = df.as_matrix(all_features)
y = np.array(df['label'].tolist())
labels = ['good', 'bad']

scores = sklearn.cross_validation.cross_val_score(clf_all, X, y, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.993 (+/- 0.018)

Finally some significant improvment.

In [172]:
import sklearn.ensemble
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf_all.fit(X_train, y_train)
y_pred = clf_all.predict(X_test)
labels = ['benign', 'malicious']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 97.06% (99/102)
benign/malicious: 2.94% (3/102)
malicious/benign: 0.00% (0/102)
malicious/malicious: 100.00% (102/102)

One of the features of the classifier is the ability to give a probability that the feature set belongs to a specific class. In the code below, we are asking the classifier to give us the probability the feature set (file) belongs to the malicious class and only marking the file as malicious if 80% of the trees declare it malicious.

In [173]:
y_probs = clf_all.predict_proba(X_test)[:,1]
thres = .80 # This can be set to whatever you'd like
y_pred[y_probs>thres] = 'malicious'
y_pred[y_probs<=thres] = 'benign'
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 99.02% (101/102)
benign/malicious: 0.98% (1/102)
malicious/benign: 2.94% (3/102)
malicious/malicious: 97.06% (99/102)
In [9]:
#### We do the same, but set the threshold lower, to only 20%
In [174]:
y_probs = clf_all.predict_proba(X_test)[:,1]
thres = .20 # This can be set to whatever you'd like
y_pred[y_probs>thres] = 'malicious'
y_pred[y_probs<=thres] = 'benign'
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 95.10% (97/102)
benign/malicious: 4.90% (5/102)
malicious/benign: 0.00% (0/102)
malicious/malicious: 100.00% (102/102)

You can see that we still get good results even when modifying the threshold level. When we plot the scores, we see that most of the scores are either near 0 or 1. This indicates that the classifier is confident about most of the labels.

In [175]:
scores = clf_all.predict_proba(X_test)[:,1]
plt.hist(scores, bins=20)
plt.grid(True)
plt.show()

We ask the classifier to tell the importance again, and we see the features extracted from the class name now totally dominate the classifier. On one hand, we got great results with this classifier, on the other, we see that is almost completely reliant on one major feature of the file.

In [176]:
importances = zip(all_features, clf_all.feature_importances_)
importances.sort(key=lambda k:k[1], reverse=True)
sum = 0
for idx, im in enumerate(importances):
    sum += round(im[1], 5)
    print (str(idx+1) + ':').ljust(4), im[0].ljust(35), round(im[1], 5), sum
1:   class_name_slash_count              0.31448 0.31448
2:   class_name_length                   0.28798 0.60246
3:   class_name_lowercase_run_longest    0.08185 0.68431
4:   entropy                             0.06346 0.74777
5:   class_name_lowercase_run_avg        0.06043 0.8082
6:   constant_pool_count                 0.03532 0.84352
7:   size                                0.02862 0.87214
8:   class_name_uppercase_run_longest    0.02839 0.90053
9:   class_name_uppercase_run_avg        0.02537 0.9259
10:  method_name_lowercase_run_avg       0.01137 0.93727
11:  interface_count                     0.01073 0.948
12:  class_name_digit_run_avg            0.00947 0.95747
13:  method_name_lowercase_run_longest   0.00751 0.96498
14:  acc_public                          0.00604 0.97102
15:  methods_count                       0.00509 0.97611
16:  method_name_uppercase_run_longest   0.00459 0.9807
17:  ap_count                            0.0039 0.9846
18:  class_name_digit_run_longest        0.00336 0.98796
19:  major version                       0.00286 0.99082
20:  method_name_uppercase_run_avg       0.00277 0.99359
21:  acc_abstract                        0.00181 0.9954
22:  method_name_digit_run_avg           0.00148 0.99688
23:  acc_final                           0.00105 0.99793
24:  method_name_digit_run_longest       0.00103 0.99896
25:  minor version                       0.00054 0.9995
26:  acc_interface                       0.0005 1.0
27:  acc_super                           1e-05 1.00001
28:  acc_annotation                      0.0 1.00001
29:  acc_enum                            0.0 1.00001
30:  acc_synthetic                       0.0 1.00001

Using a different classifier

Let's try a different classifier: Extra Trees Classifier (like RandomForest, but even more random).
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [179]:
clf_er = sklearn.ensemble.ExtraTreesClassifier(n_estimators=50)
X_er = df.as_matrix(all_features)
y_er = np.array(df['label'].tolist())
labels = ['benign', 'malicious']

scores = sklearn.cross_validation.cross_val_score(clf_er, X_er, y_er, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.997 (+/- 0.013)
In [180]:
# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X_er, y_er, test_size=my_tsize, random_state=my_seed)
clf_er.fit(X_train, y_train)
y_pred = clf_er.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 100.00% (90/90)
benign/malicious: 0.00% (0/90)
malicious/benign: 0.00% (0/114)
malicious/malicious: 100.00% (114/114)
In [181]:
import sklearn.svm
import sklearn.preprocessing
clf_svc = sklearn.svm.SVC()
X_svc = df.as_matrix(all_features)
X_svc = sklearn.preprocessing.scale(X_svc)
y_svc = np.array(df['label'].tolist())
labels = ['benign', 'malicious']

scores = sklearn.cross_validation.cross_val_score(clf_svc, X_svc, y_svc, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.993 (+/- 0.015)
In [182]:
# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X_svc, y_svc, test_size=my_tsize, random_state=my_seed)
clf_svc.fit(X_train, y_train)
y_pred = clf_svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 98.89% (89/90)
benign/malicious: 1.11% (1/90)
malicious/benign: 0.00% (0/114)
malicious/malicious: 100.00% (114/114)
In [183]:
# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf_all, X_all, y_all, cv=20)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.992 (+/- 0.029)
In [184]:
# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf_er, X_er, y_er, cv=20)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.995 (+/- 0.017)

Testing on a large corpus of files

In the next couple steps, we train a classifier on all the data, and then test it on approximately 366K files. This data is not labeled, but I would expect them all to be benign. Maybe a few malicious ones, but a small number.

In [185]:
clf_everything = sklearn.ensemble.RandomForestClassifier(n_estimators=50)
all_features = ['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface',
                    'acc_public', 'acc_super', 'acc_synthetic', 'ap_count', 'constant_pool_count',
                    'entropy', 'size', 'interface_count', 'major version', 'methods_count', 'minor version',
                    'method_name_digit_run_avg', 'method_name_digit_run_longest',
                    'method_name_lowercase_run_avg', 'method_name_lowercase_run_longest',
                    'method_name_uppercase_run_avg', 'method_name_uppercase_run_longest',
                    'class_name_digit_run_avg', 'class_name_digit_run_longest',
                    'class_name_length', 'class_name_lowercase_run_avg',
                    'class_name_lowercase_run_longest', 'class_name_slash_count',
                    'class_name_uppercase_run_avg', 'class_name_uppercase_run_longest'] 

X_all = df.as_matrix(all_features)
y_all = np.array(df['label'].tolist())

clf_everything.fit(X_all, y_all)
Out[185]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)
In [190]:
java_big_pile_df = pd.read_hdf('data/java_clean_df.hd5', 'table')
/opt/visiblerisk/lib/python2.7/site-packages/pandas/io/pytables.py:520: DeprecationWarning: openFile() is pending deprecation, use open_file() instead. You may use the pt2to3 tool to update your source code.
  self._handle = tables.openFile(self._path, self._mode, **kwargs)
/opt/visiblerisk/lib/python2.7/site-packages/pandas/io/pytables.py:1017: DeprecationWarning: getNode() is pending deprecation, use get_node() instead. You may use the pt2to3 tool to update your source code.
  return self._handle.getNode(self.root, key)

Again, one of the features of the classifier is the ability to give a probability that the feature set belongs to a specific class. In the code below, we are asking the classifier to give us the probability the feature set (file) belongs to the malicious class. We interpret these results as anything less the 0.5 is benign, 0.5 - 0.8 is a gray area, and 0.8 and above is malicious.

In [191]:
clean = 0
gray = 0
bad = 0
for x in java_big_pile_df.as_matrix(all_features):
    try:
        score = clf_everything.predict_proba(x)[:,1][0]
        if score < 0.5:
            clean += 1
        elif score < 0.8:
            gray += 1
        else:
            bad += 1
    except:
        print "Sad"
        print x
        break

print java_big_pile_df.shape
print clean
print gray
print bad
['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface', 'acc_public', 'acc_super', 'acc_synthetic', 'ap_count', 'constant_pool_count', 'entropy', 'size', 'interface_count', 'major version', 'methods_count', 'minor version', 'method_name_digit_run_avg', 'method_name_digit_run_longest', 'method_name_lowercase_run_avg', 'method_name_lowercase_run_longest', 'method_name_uppercase_run_avg', 'method_name_uppercase_run_longest', 'class_name_digit_run_avg', 'class_name_digit_run_longest', 'class_name_length', 'class_name_lowercase_run_avg', 'class_name_lowercase_run_longest', 'class_name_slash_count', 'class_name_uppercase_run_avg', 'class_name_uppercase_run_longest']
(366341, 35)
339771
10971
15599

Wow. That looks mostly horrible. We did only start with 500 files of each, maybe more training data will help. We expand to about 2,000 malicious files and 2,000 benign files. We randomly select 2000 files from the large corpus as label them as benign.

In [219]:
java_more_bad_df = pd.read_hdf('data/java_malicious_df.hd5', 'table')
/opt/visiblerisk/lib/python2.7/site-packages/pandas/io/pytables.py:520: DeprecationWarning: openFile() is pending deprecation, use open_file() instead. You may use the pt2to3 tool to update your source code.
  self._handle = tables.openFile(self._path, self._mode, **kwargs)
/opt/visiblerisk/lib/python2.7/site-packages/pandas/io/pytables.py:1017: DeprecationWarning: getNode() is pending deprecation, use get_node() instead. You may use the pt2to3 tool to update your source code.
  return self._handle.getNode(self.root, key)
In [220]:
java_big_pile_df.head()
Out[220]:
acc_abstract acc_annotation acc_enum acc_final acc_interface acc_public acc_super acc_synthetic ap_count attributes count class name class_name_digit_run_avg class_name_digit_run_longest class_name_length class_name_lowercase_run_avg class_name_lowercase_run_longest class_name_slash_count class_name_uppercase_run_avg class_name_uppercase_run_longest constant_pool_count
0 0 0 0 0 0 1 1 0 2 1 com/jidesoft/combobox/DateChooserPanel 0 0 38 5.333333 8 3 1 1 1037 ...
1 1 0 0 0 1 1 0 0 3 0 org/jmol/modelset/BondIterator 0 0 30 5.000000 8 3 1 1 11 ...
2 0 0 0 0 0 1 1 0 2 2 org/hibernate/engine/query/ParameterParser 0 0 42 6.000000 9 4 1 1 152 ...
3 0 0 0 0 0 1 1 0 2 1 com/intellij/updater/Utils 0 0 26 5.500000 8 3 1 1 330 ...
4 0 0 0 0 0 1 1 0 2 1 com/kiwisoft/db/driver/SybaseDriver 0 0 35 4.833333 8 4 1 1 151 ...

5 rows × 35 columns

In [221]:
java_big_pile_df['class_name_length'].describe()
Out[221]:
count    366341.000000
mean         48.081181
std          19.812234
min           1.000000
25%          36.000000
50%          47.000000
75%          61.000000
max         161.000000
Name: class_name_length, dtype: float64

Randomize list

In [239]:
java_random_df = java_big_pile_df.reindex(np.random.permutation(java_big_pile_df.index))
java_random_2k_df = java_random_df[0:2000]
java_random_the_rest_df = java_random_df[2000:]
In [240]:
java_random_2k_df['label'] = 'benign'
In [241]:
java_more_bad_df['label'] = 'malicious'
In [242]:
java_4k_df = pd.concat([java_more_bad_df, java_random_2k_df], ignore_index=True)
java_4k_df.fillna(0, inplace=True)
In [243]:
clf_4k = sklearn.ensemble.RandomForestClassifier(n_estimators=75)
all_features = ['acc_abstract', 'acc_annotation', 'acc_enum', 'acc_final', 'acc_interface',
                'acc_public', 'acc_super', 'acc_synthetic', 'ap_count',
                'class_name_digit_run_avg', 'class_name_digit_run_longest',
                'class_name_length', 'class_name_lowercase_run_avg',
                'class_name_lowercase_run_longest', 'class_name_slash_count',
                'class_name_uppercase_run_avg', 'class_name_uppercase_run_longest',
                'constant_pool_count', 'entropy', 'interface_count', 'major version',
                'method_name_digit_run_avg', 'method_name_digit_run_longest',
                'method_name_lowercase_run_avg', 'method_name_lowercase_run_longest',
                'method_name_uppercase_run_avg', 'method_name_uppercase_run_longest',
                'methods_count', 'minor version', 'size']

X = java_4k_df.as_matrix(all_features)
y = np.array(java_4k_df['label'].tolist())

scores = sklearn.cross_validation.cross_val_score(clf_4k, X, y, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.989 (+/- 0.008)

Testing out this new classifier still shows us good results.

In [244]:
import sklearn.ensemble
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split

# 80/20 Split for predictive test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf_4k.fit(X_train, y_train)
y_pred = clf_4k.predict(X_test)
labels = ['benign', 'malicious']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats
benign/benign: 99.00% (398/402)
benign/malicious: 1.00% (4/402)
malicious/benign: 0.79% (3/382)
malicious/malicious: 99.21% (379/382)
In [245]:
# Feature Selection
# Which features best deferentiated the two classes?
# Here we're going to grab the feature_importances from the classifier itself, 
importances = zip(all_features, clf_4k.feature_importances_)
importances.sort(key=lambda k:k[1], reverse=True)
sum = 0
for idx, im in enumerate(importances):
    sum += round(im[1], 5)
    print (str(idx+1) + ':').ljust(4), im[0].ljust(35), round(im[1], 5), sum
1:   class_name_slash_count              0.25082 0.25082
2:   class_name_length                   0.22822 0.47904
3:   entropy                             0.0733 0.55234
4:   constant_pool_count                 0.06575 0.61809
5:   class_name_uppercase_run_avg        0.06179 0.67988
6:   size                                0.05436 0.73424
7:   class_name_lowercase_run_longest    0.05158 0.78582
8:   class_name_uppercase_run_longest    0.04584 0.83166
9:   method_name_lowercase_run_longest   0.03077 0.86243
10:  method_name_lowercase_run_avg       0.02461 0.88704
11:  class_name_lowercase_run_avg        0.02118 0.90822
12:  interface_count                     0.01524 0.92346
13:  major version                       0.01305 0.93651
14:  method_name_uppercase_run_longest   0.01267 0.94918
15:  method_name_uppercase_run_avg       0.01198 0.96116
16:  methods_count                       0.01062 0.97178
17:  ap_count                            0.00603 0.97781
18:  class_name_digit_run_avg            0.00561 0.98342
19:  minor version                       0.00547 0.98889
20:  acc_public                          0.00262 0.99151
21:  acc_abstract                        0.00221 0.99372
22:  class_name_digit_run_longest        0.00133 0.99505
23:  acc_super                           0.00132 0.99637
24:  acc_final                           0.00128 0.99765
25:  method_name_digit_run_avg           0.00089 0.99854
26:  method_name_digit_run_longest       0.0008 0.99934
27:  acc_interface                       0.00065 0.99999
28:  acc_annotation                      0.0 0.99999
29:  acc_enum                            0.0 0.99999
30:  acc_synthetic                       0.0 0.99999

Let's test on the large corpus again.

In [246]:
clf_everything_4k = sklearn.ensemble.RandomForestClassifier(n_estimators=50)

X_all = java_4k_df.as_matrix(all_features)
y_all = np.array(java_4k_df['label'].tolist())

clf_everything_4k.fit(X_all, y_all)
Out[246]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)
In [247]:
clean = 0
gray = 0
bad = 0
X_rest = java_random_the_rest_df.as_matrix(all_features)
for x in X_rest:
    score = clf_everything_4k.predict_proba(x)[:,1][0]
    if score < 0.5:
        clean += 1
    elif score < 0.8:
        gray += 1
    else:
        bad += 1

print java_random_the_rest_df.shape[0]
print clean
print gray
print bad
364341
359198
3766
1377

That's much better. But still a large number of files in the gray area and marked as malicious. More than I would expect. This means we either need more training data, or better features.

In [ ]:
 

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by Rackspace

nbviewer GitHub repository.

nbviewer version: f697053

nbconvert version: 5.4.1

Rendered (Mon, 27 May 2019 03:32:10 UTC)