Azure Sentinel Notebooks and MSTICPy¶

Examples of machine learning techniques in Jupyter notebooks¶

Author: Ian Hellen
Co-Authors: Pete Bryan, Ashwin Patil

Released: 26 Oct 2020

Notebook Setup¶

Please ensure that MSTICPy is installed before continuing with this notebook.

The nbinit module loads required libraries and optionally installs require packages.

In [ ]:
from pathlib import Path
from IPython.display import display, HTML
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

REQ_PYTHON_VER = "3.6"
REQ_MSTICPY_VER = "1.0.0"
REQ_MP_EXTRAS = ["ml"]

update_nbcheck = (
"<p style='color: orange; text-align=left'>"
"<b>Warning: we needed to update '<i>utils/nb_check.py</i>'</b><br>"
"Please restart the kernel and re-run this cell."
"</p>"
)

display(HTML("<h3>Starting Notebook setup...</h3>"))
if Path("./utils/nb_check.py").is_file():
try:
from utils.nb_check import check_versions
except ImportError as err:
%xmode Minimal
!curl https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/utils/nb_check.py > ./utils/nb_check.py 2>/dev/null
display(HTML(update_nbcheck))
if "check_versions" not in globals():
raise ImportError("Old version of nb_check.py detected - see instructions below.")
%xmode Verbose
check_versions(REQ_PYTHON_VER, REQ_MSTICPY_VER, REQ_MP_EXTRAS)

from msticpy.nbtools import nbinit
nbinit.init_notebook(namespace=globals());


Retrieve sample data files¶

In [ ]:
from urllib.request import urlretrieve
from pathlib import Path
from tqdm.auto import tqdm

github_uri = "https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/{file_name}"
github_files = {
"processes_on_host.pkl": "data",
"timeseries.pkl": "data",
"data_queries.yaml": "data",
}

Path("data").mkdir(exist_ok=True)
file_path = Path(path).joinpath(file)
print(file_path, end=", ")
url_path = f"{path}/{file}" if path else file
urlretrieve(
github_uri.format(file_name=url_path),
file_path
)
assert Path(file_path).is_file()


Time Series Analysis¶

Query network data¶

The starting point is ingesting data to analyze.

MSTICpy contains a number of query providers that let you query and return data from several different sources.

Below we are using the LocalData query provider to return data from sample files.

Data is returned in a Pandas DataFrame for easy manipulation and to provide a common interface for other features in MSTICpy.
Here we are getting a summary of our network traffic for the time period we are interested in.

In [ ]:
query_range = nbwidgets.QueryTime(
origin_time=pd.Timestamp("2020-07-13 00:00:00"),
before=1,
units="week"
)
query_range


This query fetches the total number of bytes send outbound on the network, grouped by hour.

The input to the Timeseries analysis needs to be in the form of:

• a datetime index (in a regular interval like an hour or day)
• a scalar value used to determine anomalous values based on periodicity
In [ ]:
# Initialize the data provider and connect to our Splunk instance.
qry_prov = QueryProvider("LocalData", data_paths=["./data"], query_paths=["./data"])
qry_prov.connect()

ob_bytes_per_hour = qry_prov.Network.get_network_summary(query_range)
md("Sample data:", "large")


Using Timeseries decomposition to detect anomalous network activity¶

Below we use MSTICpy's time series analysis machine learning capabilities to identify anomalies in our network traffic for further investigation.
As well as computing anomalies we visualize the data so that we can more easily see where these anomalies present themselves.

In [ ]:
from msticpy.nbtools.timeseries import display_timeseries_anomolies
from msticpy.analysis.timeseries import timeseries_anomalies_stl

# Conduct our timeseries analysis
ts_analysis = timeseries_anomalies_stl(ob_bytes_per_hour)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y= 'TotalBytesSent');

md("We can see two clearly anomalous data points representing unusual outbound traffic.<hr>", "bold")


View the summary events marked as anomalous¶

In [ ]:
max_score, min_score = ts_analysis.score.max(), ts_analysis.min()
ts_analysis[ts_analysis["anomalies"] == 1]


Extract the anomaly period¶

We can extract the start and end times of anomalous events and use this more-focused time range to query for unusual activity in this period.

Note: if more than one anomalous period is indicated we can use
msticpy.analysis.timeseries.extract_anomaly_periods() function to isolate time blocks around the anomalous periods.

In [ ]:
# Identify when the anomalies occur so that we can use this timerange
# to scope the next stage of our investigation.
# Add a 1 hour buffer around the anomalies
start = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].min() -  pd.to_timedelta(1, unit='h')
end = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].max() +  pd.to_timedelta(1, unit='h')

# md and md_warn are MSTICpy features to provide simple, and clean output in notebook cells
md("Anomalous network traffic detected between:", "large")
md(f"Start time: <b>{start}</b><br>End time: <b>{end}</b><hr>",)


Time Series Conclusion¶

We would take these start and end times to zero in on which machines were responsible for the anomalous traffic. Once we find them we can use other techniques to analyze what's going on on these hosts.

Other Applications¶

You can use the msticpy query function MultiDataSource.get_timeseries_anomalies on most Azure Sentinel tables to do this summarization directly.

Three examples are shown below.

start = pd.Timestamp("2020-09-01T00:00:00")
end = pd.Timestamp("2020-09-301T00:00:00")

# Sent bytes by hour (default) from Palo Alto devices
time_series_net_bytes = qry_prov.MultiDataSource.get_timeseries_decompose(
start=start,
end=end,
table="CommonSecurityLog",
timestampcolumn="TimeGenerated",
aggregatecolumn="SentBytes",
groupbycolumn="DeviceVendor",
aggregatefunction="sum(SentBytes)",
where_clause='|where DeviceVendor=="Palo Alto Networks"',
)

# Sign-in failure count in AAD
time_series_signin_fail = qry_prov.MultiDataSource.get_timeseries_anomalies(
table="SigninLogs",
start=start,
end=end,
timestampcolumn="TimeGenerated",
aggregatecolumn="AppDisplayName",
groupbycolumn="ResultType",
aggregatefunction="count(AppDisplayName)",
where_clause='| where ResultType in (50126, 50053, 50074, 50076)',
)

# Number of distinct processes by hour
time_series_procs = qry_prov.MultiDataSource.get_timeseries_anomalies(
table="SecurityEvent",
start=start,
end=end,
timestampcolumn="TimeGenerated",
aggregatecolumn="DistinctProcesses",
groupbycolumn="Account",
aggregatefunction="dcount(NewProcessName)",
where_clause="| where Computer='myhost.domain.con'",
)

# Then submit to ts anomalies decomposition
ts_analysis = timeseries_anomalies_stl(time_series_procs)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y='Total');


Using Clustering¶

- Example: aggregating similar process patterns to highlight unusual logon sessions¶

Sifting through thousands of events from a host is tedious in the extreme. We want to find a better way of identifying suspicious clusters of activity.

Query the data and do some initial analysis of the results

In [ ]:
print("Getting process events...", end="")
processes_on_host = qry_prov.WindowsSecurity.list_host_processes(
)

md("Initial analysis of data set", "large, bold")
md(f"Total processes in data set <b>{len(processes_on_host)}</b>")
for column in ("Account", "NewProcessName", "CommandLine"):
md(f"Total distinct {column} in data <b>{processes_on_host[column].nunique()}</b>")
md("<hr>")
md("Try grouping by distinct Account, Process, Commandline<br> - we still have 1000s of rows!", "large")
display(
processes_on_host
.groupby(["Account", "NewProcessName", "CommandLine"])
[["TimeGenerated"]]
.count()
.rename(columns={"TimeGenerated": "Count"})
)


Clustering motivation¶

We want to find atypical commands being run and see if they are associated with the same user or time period

It is tedious to do repeated queries grouping on different attributes of events.
Instead we can specify features that we are interested in grouping around and use
clustering, a form of unsupervised learning, to group the data.

A challenge when using simple grouping is that commands (commandlines) may vary slightly but are essentially repetitions of the same thing (e.g. contain dynamically-generated GUIDs or other temporary data).

We can extract features of the commandline rather than using it in its raw form.

Using clustering we can add arbitrarily many features to group on. Here we are using the following features:

• Account name
• Process name
• Command line structure
• Whether the process is a system session or not

Note: A downside to clustering is that text features (usually) need to be transformed from a string
into a numeric representation.

In [ ]:
from msticpy.sectools.eventcluster import dbcluster_events, add_process_features, char_ord_score
from collections import Counter

print(f"Input data: {len(processes_on_host)} events")
print("Extracting features...", end="")

feature_procs["accountNum"] = feature_procs.apply(
lambda x: char_ord_score(x.Account), axis=1
)
print(".", end="")

# you might need to play around with the max_cluster_distance parameter.
# decreasing this gives more clusters.
cluster_columns = ["commandlineTokensFull", "pathScore", "accountNum", "isSystemSession"]
print("done")
print("Clustering...", end="")
(clus_events, dbcluster, x_data) = dbcluster_events(
data=feature_procs,
cluster_columns=cluster_columns,
max_cluster_distance=0.0001,
)
print("done")
print("Number of input events:", len(feature_procs))
print("Number of clustered events:", len(clus_events))

print("Merging with source data and computing rarity...", end="")

# Join the clustered results back to the original process frame
procs_with_cluster = feature_procs.merge(
clus_events[[*cluster_columns, "ClusterSize"]],
on=["commandlineTokensFull", "accountNum", "pathScore", "isSystemSession"],
)

# Compute Process pattern Rarity = inverse of cluster size
procs_with_cluster["Rarity"] = 1 / procs_with_cluster["ClusterSize"]
# count the number of processes for each logon ID
lgn_proc_count = (
pd.concat(
[
processes_on_host.groupby("TargetLogonId")["TargetLogonId"].count(),
processes_on_host.groupby("SubjectLogonId")["SubjectLogonId"].count(),
]
).sum(level=0)
).to_dict()
print("done")
# Display the results
md("<br><hr>Sessions ordered by process rarity", "large, bold")
md("Higher score indicates higher number of unusual processes")
.agg({"Rarity": "mean", "TimeGenerated": "count"})
.rename(columns={"TimeGenerated": "ProcessCount"})
.reset_index())
display(
process_rarity
.sort_values("Rarity", ascending=False)
.style.bar(subset=["Rarity"], color="#d65f5f")
)

In [ ]:
# get the logon ID of the rarest session
rarest_logonid = process_rarity[process_rarity["Rarity"] == process_rarity.Rarity.max()].SubjectLogonId.iloc[0]
# extract processes with this logonID
sample_processes = (
processes_on_host
[processes_on_host["SubjectLogonId"] == rarest_logonid]
[["TimeGenerated", "CommandLine"]]
.sort_values("TimeGenerated")
)[5:25]
# compute duration of session
duration = sample_processes.TimeGenerated.max() - sample_processes.TimeGenerated.min()
md(f"{len(sample_processes)} processes executed in {duration.total_seconds()} sec", "bold")
display(sample_processes)


Clustering conclusion¶

We have narrowed down the task of sifting through > 20,000 processes to a few 10s and have them grouped into sessions ordered by the relative rarity of the process patterns

Other Applications¶

You can use this technique on other datasets where you want to group by multiple features of the data.

The caveat is that you need to transform any non-numeric data field into a numeric form.

msticpy has a few built-in functions to help with this:

from msticpy.sectools import eventcluster

my_df["account_num"] = eventcluster.char_ord_score_df(data=my_df, column="Account")

# This will create a distinct hash for even minor differences in the input.
# This might be useful to detect imperfectly faked UA strings.
my_df["ua_hash"] = eventcluster.crc32_hash_df(data=my_df, column="UserAgent")

# This will return the number of delimiter chars in the string - often a
# a good proxy for the structure of an input while ignoring variable text values
# e.g.
# "https://my.dom.com/path1?u1=163.174.4.23" will produce the same score as
# "https://www.contoso.com/azure?srcdom=moon.base.alpha.com"
# but
# "curl https://www.contoso.com/top/next?u=23 > ~/my_page"
# "curl https://www.contoso.com/top_next?u=2.3  > ~/my_page.sh"
# will produce different values despite the similarity of the strings.
# Note - you can override the default delimiter list of " \-/.,"'|&:;%\\$()]"
my_df["request_struct"] = eventcluster.delim_count_df(my_df, column="Request")


You can use a combination of these and other functions on the same fields to measure different aspects of the data. For example, the following takes a hash of the browser version of the UA (user agent) string and a structural count of the delimiters used.

Use the ua_pref_hash and ua_delims to cluster on identical browser versions that have the same UA string

my_df["ua_prefix"] = data=my_df["UserAgent"].str.split(")")[-1])
my_df["ua_pref_hash"] = eventcluster.crc32_hash_df(data=my_df, column="ua_prefix")
my_df["ua_delims"] = eventcluster.delim_count_df(data=my_df, column="UserAgent")


Detecting anomalous sequences using Markov Chain¶

The anomalous_sequence MSTICPy package uses Markov Chain analysis to predict the probability
that a particular sequence of events will occur given what has happened in the past.

Here we're applying it to Office activity.

Query the data¶

In [ ]:
query = """
| where TimeGenerated >= ago(60d)
| where UserId_s !startswith "NT AUTHORITY"
| where UserId_s !contains "prod.outlook.com"
| extend params = todynamic(strcat('{"', Operation_s, '" : ', tostring(Parameters_s), '}'))
| extend UserId = UserId_s, ClientIP = ClientIP_s, Operation = Operation_s
| project TimeGenerated= Start_Time_t, UserId, ClientIP, Operation, params
| sort by UserId asc, ClientIP asc, TimeGenerated asc
| extend begin = row_window_session(TimeGenerated, 20m, 2m, UserId != prev(UserId) or ClientIP != prev(ClientIP))
| summarize cmds=makelist(Operation), end=max(TimeGenerated), nCmds=count(), nDistinctCmds=dcount(Operation),
params=makelist(params) by UserId, ClientIP, begin
| project UserId, ClientIP, nCmds, nDistinctCmds, begin, end, duration=end-begin, cmds, params
"""
print(f"Number of events {len(exchange_df)}")


Perform Anomalous Sequence analysis on the data¶

The analysis groups events into sessions (time-bounded and linked by a common account). It then
builds a probability model for the types of command (E.g. "SetMailboxProperty")
and the parameters and parameter values used for that command.

I.e. how likely is it that a given user would be running this sequence of commands in a logon session?

Using this probability model, we can highlight sequences that have an extremely low probability, based
on prior behaviour.

In [ ]:
from msticpy.analysis.anomalous_sequence.utils.data_structures import Cmd
from msticpy.analysis.anomalous_sequence import anomalous

def process_exchange_session(session_with_params, include_vals):
new_ses = []
for cmd in session_with_params:
c = list(cmd.keys())[0]
par = list(cmd.values())[0]
new_pars = set()
if include_vals:
new_pars = dict()
for p in par:
if include_vals:
new_pars[p['Name']] = p['Value']
else:
new_ses.append(Cmd(name=c, params=new_pars))
return new_ses

sessions = exchange_df.cmds.values.tolist()
param_sessions = []
param_value_sessions = []

for ses in exchange_df.params.values.tolist():
new_ses_set = process_exchange_session(session_with_params=ses, include_vals=False)
new_ses_dict = process_exchange_session(session_with_params=ses, include_vals=True)
param_sessions.append(new_ses_set)
param_value_sessions.append(new_ses_dict)

data = exchange_df
data['session'] = sessions
data['param_session'] = param_sessions
data['param_value_session'] = param_value_sessions

modelled_df = anomalous.score_sessions(
data=data,
session_column='param_value_session',
window_length=3
)

anomalous.visualise_scored_sessions(
data_with_scores=modelled_df,
time_column='begin',  # this will appear on the x-axis
score_column='rarest_window3_likelihood',  # this will appear on the y axis
window_column='rarest_window3',  # this will represent the session in the tool-tips
source_columns=['UserId', 'ClientIP'],  # specify any additional columns to appear in the tool-tips
)


The events are shown in descending order of likelihood (vertically), so the
events at the bottom of the chart are the ones most interesting to us.

Looking at these rare events, we can see potentially suspicious activity changing role memberships.

In [ ]:
pd.set_option("display.html.table_schema", False)

likelihood_max=modelled_df["rarest_window3_likelihood"].max()
likelihood_min=modelled_df["rarest_window3_likelihood"].min()
slider_step = (likelihood_max - likelihood_min) / 20
start_val = likelihood_min + slider_step
threshold = widgets.FloatSlider(
description="Select likelihood threshold",
max=likelihood_max,
min=likelihood_min,
value=start_val,
step=start_val,
layout=widgets.Layout(width="60%"),
style={"description_width": "200px"},
)

def show_rows(change):
thresh = change["new"]
pd_disp.update(modelled_df[modelled_df["rarest_window3_likelihood"] < thresh])

threshold.observe(show_rows, names="value")
md("Move the slider to see event sessions below the selected <i>likelihood</i> threshold", "bold")
display(HTML("<hr>"))
display(threshold)
display(HTML("<hr>"))
md(f"Range is {likelihood_min:.7f} (min likelihood) to {likelihood_max:.7f} (max likelihood)<br><br><hr>")
pd_disp = display(
modelled_df[modelled_df["rarest_window3_likelihood"] < start_val],
display_id=True
)


Note for many events the output will be long

In [ ]:
import pprint

rarest_events = (
modelled_df[modelled_df["rarest_window3_likelihood"] < threshold.value]
[[
"UserId", "ClientIP", "begin", "end", "param_value_session", "rarest_window3_likelihood"
]]
.rename(columns={"rarest_window3_likelihood": "likelihood"})
.sort_values("likelihood")
)
for idx, (_, rarest_event) in enumerate(rarest_events.iterrows(), 1):
md(f"Event {idx}", "large")
display(pd.DataFrame(rarest_event[["UserId", "ClientIP", "begin", "end", "likelihood"]]))

md("<hr>")
md("Param session details:", "bold")
for cmd in rarest_event.param_value_session:
md(f"Command: {cmd.name}")
md(pprint.pformat(cmd.params))
md("<hr><br>")