SQL Injection Exercise¶

"SQL injection is a code injection technique, used to attack data driven applications, in which malicious SQL statements areA SQL injection attack consists of insertion or "injection" of a SQL query via the input data from the client to the application. A successful SQL injection exploit can read sensitive data from the database, modify database data (Insert/Update/Delete), execute administration operations on the database (such as shutdown the DBMS), recover the content of a given file present on the DBMS file system and in some cases issue commands to the operating system. SQL injection attacks are a type of injection attack, in which SQL commands are injected into data-plane input in order to effect the execution of predefined SQL commands." -OWASP

** All Code and IPython Notebooks for this talk http://clicksecurity.github.io/data_hacking **

Tools:

sqlmap: Automatic SQL injection and database takeover tool (http://sqlmap.org)
JBroFuzz (https://www.owasp.org/index.php/JBroFuzz)
sqlparse (https://github.com/andialbrecht/sqlparse)
IPython: A mad scientist notebook! (http://ipython.org)
- What did you do?
- How did you do it?
- Can I repoduce it?
- Easy to share:
  - [NB Viewer](http://nbviewer.ipython.org)
  - [Reddit IPython](http://www.reddit.com/r/ipython)
Pandas: Python Data Analysis Library (http://pandas.pydata.org)
- A fast and efficient DataFrame object
- Great set of IO Tools
- Fantastic handling of missing data
- Flexible reshaping and pivoting
- Slicing, indexing, and subsetting

Approach:

Gather data (Thanks to Ray VanHoose for his assistance!)
Exploration and Understanding
Some Simple Statistics
Feature Vectors
Random Forest Machine Learning

In [1]:

import sklearn.feature_extraction
sklearn.__version__

Out[1]:

'0.14.1'

In [2]:

import pandas as pd
pd.__version__

Out[2]:

'0.13.1'

In [3]:

# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 12.0, 5.0

In [4]:

# A plotting helper method
def plot_it(df,label_x,label_y):
    fig, ax = plt.subplots(subplot_kw={'axisbg':'#EEEEE5'})
    ax.grid(color='grey', linestyle='solid')
    df.T.plot(kind='bar', logx=True, rot=0, ax=ax, colormap='PiYG')
    ax.legend(loc=0, prop={'size':14})
    plt.subplots_adjust(left=0, right=1, bottom=0, top=1, wspace=0, hspace=0)
    plt.xlabel(label_x)
    plt.ylabel(label_y)
    return ax

In [5]:

# Read in a set of SQL statements from various sources
import os
basedir = 'data'
filelist = os.listdir(basedir) 
df_list = []
for file in filelist:
    df = pd.read_csv(os.path.join(basedir,file), sep='|||', names=['raw_sql'], header=None)
    df['type'] = 'legit' if file.split('.')[0] == 'legit' else 'malicious'
    df_list.append(df)
dataframe = pd.concat(df_list, ignore_index=True)
dataframe.dropna(inplace=True)
print dataframe['type'].value_counts()
dataframe.head()

malicious    12892
legit         1003
dtype: int64

Out[5]:

	raw_sql	type
0	'; exec master..xp_cmdshell 'ping 10.10.1.2'--	malicious
1	create user name identified by 'pass123'	malicious
2	create user name identified by pass123 tempora...	malicious
3	exec sp_addlogin 'name' , 'password'	malicious
4	exec sp_addsrvrolemember 'name' , 'sysadmin'	malicious

5 rows × 2 columns

In [6]:

!which python

/usr/bin/python

In [7]:

# Use the SQLParse module: sqlparse is a non-validating SQL parser module for Python
# https://github.com/andialbrecht/sqlparse
import sqlparse
def parse_it(raw_sql):
    parsed = sqlparse.parse(unicode(raw_sql,'utf-8'))
    return [token._get_repr_name() for parse in parsed for token in parse.tokens if token._get_repr_name() != 'Whitespace']

dataframe['parsed_sql'] = dataframe['raw_sql'].map(lambda x: parse_it(x))

In [8]:

dataframe.head()

Out[8]:

	raw_sql	type	parsed_sql
0	'; exec master..xp_cmdshell 'ping 10.10.1.2'--	malicious	[Single, Identifier, Float, Float, Float, Erro...
1	create user name identified by 'pass123'	malicious	[DDL, Keyword, Identifier, Keyword, Single]
2	create user name identified by pass123 tempora...	malicious	[DDL, Keyword, Identifier, Keyword, Identifier...
3	exec sp_addlogin 'name' , 'password'	malicious	[Keyword, Identifier, IdentifierList]
4	exec sp_addsrvrolemember 'name' , 'sysadmin'	malicious	[Keyword, Identifier, IdentifierList]

5 rows × 3 columns

In [9]:

# Looking at the SQL tokens is 'kinda' interesting but sequences of tokens and transitions
# between tokens seems more meaningful so we're also going to compute sequences by
# computing NGrams for every SQL statement...
def ngrams(lst, N):
    ngrams = []
    for n in xrange(0,N):
        ngrams += zip(*(lst[i:] for i in xrange(n+1)))
    return [str(tuple) for tuple in ngrams]

In [10]:

dataframe['sequences'] = dataframe['parsed_sql'].map(lambda x: ngrams(x, 3))

We'd like to run some simple statistics to see what correlations the data might contain. Here we want to see if certain tokens or sets of transitions are indicative of malicious sql statements.¶

In [11]:

# Helper method
def token_expansion(series, types):
    _tokens, _types = zip(*[(token,token_type) for t_list,token_type in zip(series,types) for token in t_list])
    return pd.Series(_tokens), pd.Series(_types)

In [12]:

dataframe['sequences']

Out[12]:

0     [('Single',), ('Identifier',), ('Float',), ('F...
1     [('DDL',), ('Keyword',), ('Identifier',), ('Ke...
2     [('DDL',), ('Keyword',), ('Identifier',), ('Ke...
3     [('Keyword',), ('Identifier',), ('IdentifierLi...
4     [('Keyword',), ('Identifier',), ('IdentifierLi...
5     [('DML',), ('Keyword',), ('Identifier',), ('Ke...
6     [('Keyword',), ('Keyword',), ('Keyword',), ('I...
7     [('DML',), ('Keyword',), ('Function',), ('Keyw...
8     [('Integer',), ('Error',), ('Integer',), ('Int...
9     [('Integer',), ('Keyword',), ('Comparison',), ...
10    [('Integer',), ('Single',), ('Integer',), ('Si...
11    [('Integer',), ('Keyword',), ('Function',), ('...
12    [('Error',), ('Error',), ('Punctuation',), ('O...
13    [('Integer',), ('Error',), ('Error',), ('Integ...
14    [('Integer',), ('Single',), ('Integer',), ('In...
...
13881    [('DML',), ('Identifier',), ('Keyword',), ('Co...
13882    [('DML',), ('Identifier',), ('Keyword',), ('Co...
13883    [('DML',), ('Identifier',), ('Keyword',), ('Co...
13884    [('DML',), ('Identifier',), ('Keyword',), ('Co...
13885    [('DML',), ('Identifier',), ('Keyword',), ('Id...
13886    [('DML',), ('Identifier',), ('Keyword',), ('Co...
13887    [('Identifier',), ('Builtin',), ('Identifier',...
13888    [('Identifier',), ('Function',), ('Identifier'...
13889    [('Identifier',), ('Function',), ('Identifier'...
13890    [('Identifier',), ('Function',), ('Identifier'...
13891    [('Identifier',), ('Function',), ('Identifier'...
13892    [('Identifier',), ('Function',), ('Identifier'...
13893    [('Identifier',), ('Function',), ('Identifier'...
13894    [('Keyword',), ('Identifier',), ('Function',),...
13895    [('Keyword',), ('IdentifierList',), ('DML',), ...
Name: sequences, Length: 13895, dtype: object

In [13]:

# The data hacking repository has a simple stats module we're going to use
import data_hacking.simple_stats as ss

# Spin up our g_test class
g_test = ss.GTest()

# Here we'd like to see how various sql tokens and transitions are related.
# Is there an association with particular token sets and malicious SQL statements.
tokens, types = token_expansion(dataframe['sequences'], dataframe['type'])
df_ct, df_cd, df_stats = g_test.highest_gtest_scores(tokens, types, matches=0, N=0)

df_stats.sort('malicious_g', ascending=0).head(10)

# The table below shows raw counts, conditional distributions, expected counts, and g-test score.

Out[13]:

	legit	malicious	legit_cd	malicious_cd	total_cd	legit_exp	legit_g	malicious_exp	malicious_g
('Single',)	7	10984	0.000637	0.999363	10991	1121.927951	-71.076512	9869.072049	2351.319327
('Single', 'Identifier')	0	8309	0.000000	1.000000	8309	848.157524	0.000000	7460.842476	1789.275422
('Punctuation',)	152	7707	0.019341	0.980659	7859	802.222889	-505.705813	7056.777111	1358.598582
('Identifier',)	1284	17011	0.070183	0.929817	18295	1867.498123	-962.022691	16427.501877	1187.480739
('Identifier', 'Single')	2	4222	0.000473	0.999527	4224	431.173111	-21.493450	3792.826889	905.174233
('Single', 'Identifier', 'Single')	0	4170	0.000000	1.000000	4170	425.660955	0.000000	3744.339045	897.975510
('Identifier', 'Single', 'Identifier')	0	4162	0.000000	1.000000	4162	424.844339	0.000000	3737.155661	896.252775
('Identifier', 'Identifier')	4	3957	0.001010	0.998990	3961	404.326869	-36.927434	3556.673131	844.111737
('Keyword', 'Keyword', 'DML')	18	3248	0.005511	0.994489	3266	333.383376	-105.081169	2932.616624	663.529712
('Keyword', 'DML', 'IdentifierList')	28	3157	0.008791	0.991209	3185	325.115142	-137.310594	2859.884858	624.081095

10 rows × 9 columns

In [14]:

# Now plot the the head() and the tail() of the dataframe to see who's been naughty or nice
sorted_df = df_stats.sort('malicious_g', ascending=0)
naughty = sorted_df.head(7)
nice = sorted_df.tail(7).sort('malicious_g', ascending=0)
naughty_and_nice = pd.concat([naughty, nice])
ax = plot_it(naughty_and_nice[['malicious_g']],'SQL Command Types','G-Test Scores')
ax.set_xlim(.2, 1.4)

Out[14]:

(0.2, 1.4)

In [15]:

# Documentation in sqlparse for the mapping can be found here: 
# https://github.com/andialbrecht/sqlparse/blob/master/sqlparse/keywords.py
# or here
# https://github.com/andialbrecht/sqlparse/blob/master/sqlparse/lexer.py

# Here we look at example of the SQL sequence that G-Test has indicated are good
# indicators of SQL injections.
dataframe[dataframe['sequences'].map(lambda x: "('Single', 'Identifier')" in x)].head()

Out[15]:

	raw_sql	type	parsed_sql	sequences
0	'; exec master..xp_cmdshell 'ping 10.10.1.2'--	malicious	[Single, Identifier, Float, Float, Float, Erro...	[('Single',), ('Identifier',), ('Float',), ('F...
44	anything' or 'x'='x	malicious	[Identifier, Single, Identifier, Single, Ident...	[('Identifier',), ('Single',), ('Identifier',)...
49	'; exec master..xp_cmdshell 'ping aaa.bbb.ccc....	malicious	[Single, Identifier, Error, Single]	[('Single',), ('Identifier',), ('Error',), ('S...
54	'; if not(select system_user) <> 'sa' waitfor ...	malicious	[Single, Identifier, Single, Integer, Placehol...	[('Single',), ('Identifier',), ('Single',), ('...
55	'; if is_srvrolemember('sysadmin') > 0 waitfor...	malicious	[Single, Identifier, Single, Integer, Placehol...	[('Single',), ('Identifier',), ('Single',), ('...

5 rows × 4 columns

In [16]:

dataframe[dataframe['sequences'].map(lambda x: "('Punctuation',)" in x)].head()

Out[16]:

	raw_sql	type	parsed_sql	sequences
2	create user name identified by pass123 tempora...	malicious	[DDL, Keyword, Identifier, Keyword, Identifier...	[('DDL',), ('Keyword',), ('Identifier',), ('Ke...
6	grant connect to name; grant resource to name;	malicious	[Keyword, Keyword, Keyword, Identifier, Punctu...	[('Keyword',), ('Keyword',), ('Keyword',), ('I...
7	insert into users(login, password, level) valu...	malicious	[DML, Keyword, Function, Keyword, Punctuation,...	[('DML',), ('Keyword',), ('Function',), ('Keyw...
12	\'; desc users; --	malicious	[Error, Error, Punctuation, Order, Identifier,...	[('Error',), ('Error',), ('Punctuation',), ('O...
21	1' and 1=(select count(*) from tablenames); --	malicious	[Integer, Error, Keyword, Comparison, Punctuat...	[('Integer',), ('Error',), ('Keyword',), ('Com...

5 rows × 4 columns

The results above are a mixed bag of both 'legit' and 'malicious'. As we'd expect ('Punctuation') can't really be used as a 'signature' to effectively differentiate malicious sql statements. However, it's cool that the data transformation into the parsed tokens helps us generalize and find interesing malicious patterns.

So even though these individual features can't be used to differentiate.. when we build a 'feature vector' of a set of features, machine learning algorithms can use those vectors to build non-linear functional decision boundaries in multi-dimensional spaces (and we laugh at this point because it's hard to take yourself serious after saying a bunch of fancy stuff...). BTW the image is a total non sequitur.

In [17]:

# Generating additional feature dimensions for the machine learning to expand its mind into...
# We're basically building up features to include into our 'feature vector' for ML
import math
from collections import Counter
def entropy(s):
    p, lns = Counter(s), float(len(s))
    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())
dataframe['length'] = dataframe['parsed_sql'].map(lambda x: len(x))
dataframe['entropy'] = dataframe['raw_sql'].map(lambda x: entropy(x))
dataframe.head()

Out[17]:

	raw_sql	type	parsed_sql	sequences	length	entropy
0	'; exec master..xp_cmdshell 'ping 10.10.1.2'--	malicious	[Single, Identifier, Float, Float, Float, Erro...	[('Single',), ('Identifier',), ('Float',), ('F...	7	4.368792
1	create user name identified by 'pass123'	malicious	[DDL, Keyword, Identifier, Keyword, Single]	[('DDL',), ('Keyword',), ('Identifier',), ('Ke...	5	4.037326
2	create user name identified by pass123 tempora...	malicious	[DDL, Keyword, Identifier, Keyword, Identifier...	[('DDL',), ('Keyword',), ('Identifier',), ('Ke...	11	4.028603
3	exec sp_addlogin 'name' , 'password'	malicious	[Keyword, Identifier, IdentifierList]	[('Keyword',), ('Identifier',), ('IdentifierLi...	3	4.030493
4	exec sp_addsrvrolemember 'name' , 'sysadmin'	malicious	[Keyword, Identifier, IdentifierList]	[('Keyword',), ('Identifier',), ('IdentifierLi...	3	4.010013

5 rows × 6 columns

In [18]:

# For each SQL statement aggregate the malicious and legit g-test scores as features
import numpy as np
def g_aggregate(sequence, name):
    try:
        g_scores = [df_stats.ix[item][name] for item in sequence]
    except KeyError:
        return 0
    return sum(g_scores)/len(g_scores) if g_scores else 0 # Average
dataframe['malicious_g'] = dataframe['sequences'].map(lambda x: g_aggregate(x, 'malicious_g'))
dataframe['legit_g'] = dataframe['sequences'].map(lambda x: g_aggregate(x, 'legit_g'))

In [19]:

dataframe.head()

Out[19]:

	raw_sql	type	parsed_sql	sequences	length	entropy	malicious_g	legit_g
0	'; exec master..xp_cmdshell 'ping 10.10.1.2'--	malicious	[Single, Identifier, Float, Float, Float, Erro...	[('Single',), ('Identifier',), ('Float',), ('F...	7	4.368792	449.733570	-63.831145
1	create user name identified by 'pass123'	malicious	[DDL, Keyword, Identifier, Keyword, Single]	[('DDL',), ('Keyword',), ('Identifier',), ('Ke...	5	4.037326	-242.191260	1210.713063
2	create user name identified by pass123 tempora...	malicious	[DDL, Keyword, Identifier, Keyword, Identifier...	[('DDL',), ('Keyword',), ('Identifier',), ('Ke...	11	4.028603	-392.742728	1489.732587
3	exec sp_addlogin 'name' , 'password'	malicious	[Keyword, Identifier, IdentifierList]	[('Keyword',), ('Identifier',), ('IdentifierLi...	3	4.030493	-331.875793	1069.265013
4	exec sp_addsrvrolemember 'name' , 'sysadmin'	malicious	[Keyword, Identifier, IdentifierList]	[('Keyword',), ('Identifier',), ('IdentifierLi...	3	4.010013	-331.875793	1069.265013

5 rows × 8 columns

In [20]:

# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot

# Plot the length and entropy of SQL statements
# Fixme Brian: make these pretty
dataframe.boxplot('length','type')
plt.ylabel('SQL Statement Length')
dataframe.boxplot('entropy','type')
plt.ylabel('SQL Statement Entropy')

Out[20]:

<matplotlib.text.Text at 0x6711210>

In [21]:

# Split the classes up so we can set colors, size, labels
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEE5'))
ax.grid(color='grey', linestyle='solid')
cond = dataframe['type'] == 'malicious'
evil = dataframe[cond]
legit = dataframe[~cond]
plt.scatter(legit['length'], legit['entropy'], s=140, c='#aaaaff', label='Legit', alpha=.7)
plt.scatter(evil['length'], evil['entropy'], s=40, c='r', label='Injections', alpha=.3)
plt.legend()
plt.xlabel('SQL Statement Length')
plt.ylabel('SQL Statement Entropy')

Out[21]:

<matplotlib.text.Text at 0x74843d0>

In [22]:

# Split the classes up so we can set colors, size, labels
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEE5'))
ax.grid(color='grey', linestyle='solid')
plt.scatter(legit['malicious_g'], legit['legit_g'], s=140, c='#aaaaff', label='Legit', alpha=.7)
plt.scatter(evil['malicious_g'], evil['legit_g'], s=40, c='r', label='Injections', alpha=.3)
plt.legend()
plt.ylabel('Legit SQL G-Test Score')
plt.xlabel('Malicious SQL G-Test Score')

Out[22]:

<matplotlib.text.Text at 0x7490450>

In [23]:

# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = dataframe.as_matrix(['length', 'entropy','legit_g','malicious_g'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(dataframe['type'].tolist())  # Yes, this is weird but it needs 
                                            # to be an np.array of strings

In [24]:

# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=20) # Trees in the forest

In [25]:

# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=10, n_jobs=4)
print scores

[ 0.99784173  0.99784173  1.          0.99784173  0.99856115  0.99784017
  0.99640029  0.99856012  0.99784017  0.99784017]

In [26]:

# Wow 99% accurate! There is an issue though...
# Recall that we have ~13k 'malicious SQL statements and
# we only have about 1k 'legit' SQL statements, so we dive 
# in a bit and look at the predictive performance more deeply.

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, index_train, index_test = train_test_split(X, y, dataframe.index, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [27]:

# Now plot the results of the 80/20 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['legit', 'malicious']
cm = confusion_matrix(y_test, y_pred, labels)

def plot_cm(cm, labels):
    
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  # Derp, I'm sure there's a better way
    
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks kermit666 from stackoverflow :)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm')
    plt.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

plot_cm(cm, labels)

Confusion Matrix Stats
legit/legit: 96.59% (170/176)
legit/malicious: 3.41% (6/176)
malicious/legit: 0.08% (2/2603)
malicious/malicious: 99.92% (2601/2603)

Yea! Those scores look totally awesome, lets have a Party! But later we realize that if your system is processing 1 Million sql statements that with the predictive performance above you'll get tens of thousands of false positives, so now it's a Sad Party :(

But our Mom said we were still cool.. so we're going to exercise another nice feature. Most of the machine learning algorithm in scikit learn have a companion function to the normal 'predict' function... called 'predict_proba' where the model will compute the probability of that class matching based on various metrics. For instance, random forest bases the probability function on how many of the trees in the forest voted one way or the other.. so a probability of .7 means that 70% of the trees voted one way and the other 30% voted the other way.

In [28]:

# Compute the precition probabilities and use them to mimimize our false positives
# Note: This is simply a trade off, it means we'll miss a few of the malicious
# ones but typically false alarms are a death blow to any new 'fancy stuff' so
# we definitely want to mimimize the false alarms.
y_probs = clf.predict_proba(X_test)[:,1]
thres = .9 # This can be set to whatever you'd like
y_pred[y_probs<thres] = 'legit'
y_pred[y_probs>=thres] = 'malicious'
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
legit/legit: 98.30% (173/176)
legit/malicious: 1.70% (3/176)
malicious/legit: 0.38% (10/2603)
malicious/malicious: 99.62% (2593/2603)

In [29]:

# We can also look at what features the learning algorithm thought were the most important
importances = zip(['length', 'entropy', 'legit_g', 'malicious_g'], clf.feature_importances_)
importances

# From the list below we see our feature importance scores. There's a lot of feature selection,
# sensitivity study, etc stuff that you could do if you wanted at this point.

Out[29]:

[('length', 0.034947270077838898),
 ('entropy', 0.081305411355911961),
 ('legit_g', 0.63638493679534702),
 ('malicious_g', 0.247362381770902)]

In [30]:

# Now were going to just do some post analysis on how the ML algorithm performed.
# Lets look at the legit samples that were misclassified as malicious
test_set = dataframe.ix[index_test]
test_set['pred'] = y_pred
misclassified = test_set[(test_set['type']=='legit') & (test_set['pred']=='malicious')]
misclassified.head()

Out[30]:

	raw_sql	type	parsed_sql	sequences	length	entropy	malicious_g	legit_g	pred
13606	create table Purchase (pid int primary key, pr...	legit	[DDL, Keyword, Function, Punctuation]	[('DDL',), ('Keyword',), ('Function',), ('Punc...	4	4.400948	-174.930469	513.570637	malicious
13605	create table Product (pid int primary key, pna...	legit	[DDL, Keyword, Function, Punctuation]	[('DDL',), ('Keyword',), ('Function',), ('Punc...	4	4.137866	-174.930469	513.570637	malicious
13495	SELECT dept, number, SUBSTR(title, 1, 12) AS s...	legit	[DML, Identifier, Punctuation, Builtin, Punctu...	[('DML',), ('Identifier',), ('Punctuation',), ...	8	4.699688	-20.133315	353.217738	malicious

3 rows × 9 columns

In [31]:

# Discussion for how to use the resulting models.
# Typically Machine Learning comes in two phases
#    - Training of the Model
#    - Evaluation of new observations against the Model
# This notebook is about exploration of the data and training the model.
# After you have a model that you are satisfied with, just 'pickle' it
# at the end of the your training script and then in a separate
# evaluation script 'unpickle' it and evaluate/score new observations
# coming in (through a file, or ZeroMQ, or whatever...)
#
# In this case we'd have to pickle the RandomForest classifier.
# See 'test_it' below for how to use them in evaluation mode.


# test_it shows how to do evaluation, also fun for manual testing below :)
def test_it(sql):
    parsed_sql = parse_it(sql)
    ngram_list = ngrams(parsed_sql, 3)
    malicious_g = g_aggregate(ngram_list, 'malicious_g')
    legit_g = g_aggregate(ngram_list, 'legit_g')
    _X = [len(parsed_sql), entropy(sql), legit_g, malicious_g]
    print '%-40s: %s' % (sql, clf.predict(_X)[0])

In [32]:

test_it('select * from employees')
test_it("'; exec master..xp_cmdshell")
test_it("'any 'x'='x'")
test_it('from dorseys mom xp_cmdshell biache')
test_it('select * from your_mom')

select * from employees                 : legit
'; exec master..xp_cmdshell             : malicious
'any 'x'='x'                            : malicious
from dorseys mom xp_cmdshell biache     : malicious
select * from your_mom                  : legit

Conclusions:¶

The combination of IPython, Pandas and Scikit Learn let us pull in some junky SQL data, clean it up, plot it, understand it and slap it with some machine learning!

Clearly a lot more formality could be used, plotting learning curves, adjusting for overfitting, feature selection, on and on... there are some really great machine learning resources that cover this deeper material. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/

W. G.J. Halfond, J. Viegas, and A. Orso, "A Classiﬁcation of SQL Injection Attacks and Countermeasure",[http://www-bcf.usc.edu/~halfond/papers/halfond06issse.pdf]
A. K. Baranwa1, "Approaches to detect SQL injection and XSS in web applications," 2012. [http://blogs.ubc.ca/computersecurity/files/2012/04/ABaranwal_ApproachesToDetectSQLinjection_XSSinWebApplication.pdf]

SQL Injection Exercise¶

We'd like to run some simple statistics to see what correlations the data might contain. Here we want to see if certain tokens or sets of transitions are indicative of malicious sql statements.¶

G-test is for goodness of fit to a distribution and for independence in contingency tables. It's related to chi-squared, multinomial and Fisher's exact test, please see http://en.wikipedia.org/wiki/G_test.¶

Conclusions:¶