version 0.1, May 2016
This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License]
Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers.
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/phishing.csv.zip', 'r') as z:
f = z.open('phishing.csv')
data = pd.read_csv(f, index_col=False)
data.head()
url | phishing | |
---|---|---|
0 | http://www.subalipack.com/contact/images/sampl... | 1 |
1 | http://fasc.maximecapellot-gypsyjazz-ensemble.... | 1 |
2 | http://theotheragency.com/confirmer/confirmer-... | 1 |
3 | http://aaalandscaping.com/components/com_smart... | 1 |
4 | http://paypal.com.confirm-key-21107316126168.s... | 1 |
data.phishing.value_counts()
1 20000 0 20000 Name: phishing, dtype: int64
data.url[data.phishing==1].sample(50, random_state=1).tolist()
['http://dothan.com.co/gold/austspark/index.htm\n', 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n', 'http://verify95.5gbfree.com/coverme2010/\n', 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n', 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n', 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n', 'http://senevi.com/confirmation/\n', 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n', 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n', 'http://alen.co/docs/cleaner\n', 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\n', 'http://steamcommunily.co/p.php?login=true\n', 'http://www.nyyg.com/Bradesco/5W9SQ394.html\n', 'http://wp.tipografiacentral.com.co/sparkde/index.html\n', 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\n', 'http://x.co/SecurCent\n', 'http://dejatequerer.co/united.com/index.html\n', 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\n', 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\n', 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\n', 'http://www.estranetsrl.com.ar/bbvacambios.html\n', 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\n', 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\n', 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\n', 'http://nova.pymesonline.co/fr.php\n', 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\n', 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\n', 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\n', 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\n', 'http://santanderseguranca.zapto.org/Clientesx/\n', 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\n', 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\n', 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\n', 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\n', 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\n', 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\n', 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\n', 'http://gobbva.com/bb/empresa/index.php?tarjeta=\n', 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\n', 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\n', 'http://www.bbvabancocontinental.ya.st\n', 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\n', 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\n', 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\n', 'http://consultoriojuridico.co/pp/www.paypal.com/\n', 'http://lovetodo.in.th/administrator/components/com_content/models/key/\n', 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\n', 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\n', 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\n', 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\n']
Contain any of the following:
keywords = ['https', 'login', '.php', '.html', '@', 'sign']
for keyword in keywords:
data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)
data['lenght'] = data.url.str.len() - 2
domain = data.url.str.split('/', expand=True).iloc[:, 2]
data['lenght_domain'] = domain.str.len()
domain.head(12)
0 www.subalipack.com 1 fasc.maximecapellot-gypsyjazz-ensemble.nl 2 theotheragency.com 3 aaalandscaping.com 4 paypal.com.confirm-key-21107316126168.securepp... 5 lcthomasdeiriarte.edu.co 6 livetoshare.org 7 www.i-m.co 8 manuelfernando.co 9 www.bladesmithnews.com 10 www.rasbaek.com 11 199.231.190.160 Name: 2, dtype: object
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)
data['count_com'] = data.url.str.count('com')
data.sample(15, random_state=4)
url | phishing | keyword_sign | keyword_https | keyword_login | keyword_.php | keyword_.html | keyword_@ | count_com | lenght | lenght_domain | isIP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
28607 | http://pennstatehershey.org/web/ibd/home/event... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 80 | 20 | 0 |
3689 | http://guiadesanborja.com/multiprinter/muestra... | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 81 | 18 | 0 |
6405 | http://paranaibaweb.com/faleconosco/accounting... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 65 | 16 | 0 |
35355 | http://courts.delaware.gov/Jury%20Services/Hel... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 94 | 19 | 0 |
16520 | http://erpa.co/tmp/getproductrequest.htm\n | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | 7 | 0 |
16196 | http://pulapulapipoca.com/components/com_media... | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 4 | 239 | 18 | 0 |
3810 | http://www.dag.or.kr/zboard/icon/visa/img/Atua... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 62 | 13 | 0 |
3005 | http://www.amazingdressup.com/wp-content/theme... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 94 | 22 | 0 |
9003 | http://web.indosuksesfutures.com/content_file/... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 80 | 25 | 0 |
34704 | http://www.nutritionaltree.com/subcat.aspx?cid... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 69 | 23 | 0 |
12561 | http://www.formation-continue-loiret.fr/compon... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 122 | 32 | 0 |
10885 | http://191.91.128.205/httpss/bancolombiaa.olb.... | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 451 | 14 | 1 |
2633 | http://www.sternies-hp.de/components/com_conte... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 85 | 18 | 0 |
22253 | http://www.silive.com/northshore/index.ssf/200... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 85 | 14 | 0 |
4720 | http://www.dineo.co.za/components/com_content/... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 172 | 15 | 0 |
X = data.drop(['url', 'phishing'], axis=1)
y = data.phishing
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)
cross_val_score(clf, X, y, cv=10)
array([ 0.80625, 0.81175, 0.8085 , 0.79475, 0.8025 , 0.816 , 0.80375, 0.80525, 0.80175, 0.794 ])
clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False)
from sklearn.externals import joblib
joblib.dump(clf, '22_clf_rf.pkl', compress=3)
['22_clf_rf.pkl']
See 22_model_deployment.py
from m22_model_deployment import predict_proba
predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')
0.89000000000000001
Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running.
First we need to install some libraries
pip install flask-restplus
Load Flask
from flask import Flask
from flask.ext.restplus import Api
from flask.ext.restplus import fields
from sklearn.externals import joblib
from flask.ext.restplus import Resource
from sklearn.externals import joblib
import pandas as pd
Create api
app = Flask(__name__)
api = Api(
app,
version='1.0',
title='Phishing Prediction API',
description='Phishing Prediction API')
ns = api.namespace('predict',
description='Phishing Classifier')
parser = api.parser()
parser.add_argument(
'URL',
type=str,
required=True,
help='URL to be analyzed',
location='args')
resource_fields = api.model('Resource', {
'result': fields.String,
})
Load model and create function that predicts an URL
clf = joblib.load('22_clf_rf.pkl')
@ns.route('/')
class PhishingApi(Resource):
@api.doc(parser=parser)
@api.marshal_with(resource_fields)
def get(self):
args = parser.parse_args()
result = self.predict_proba(args)
return result, 200
def predict_proba(self, args):
url = args['URL']
url_ = pd.DataFrame([url], columns=['url'])
# Create features
keywords = ['https', 'login', '.php', '.html', '@', 'sign']
for keyword in keywords:
url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)
url_['lenght'] = url_.url.str.len() - 2
domain = url_.url.str.split('/', expand=True).iloc[:, 2]
url_['lenght_domain'] = domain.str.len()
url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)
url_['count_com'] = url_.url.str.count('com')
# Make prediction
p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]
print('url=', url,'| p1=', p1)
return {
"result": p1
}
Run API
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)