<details>
Python Version: Python 3.6 (including Python 3.6 - AzureML)
Required Packages: kqlmagic, msticpy, pandas, numpy, matplotlib, seaborn, ipywidgets, ipython, scikit_learn, folium, maxminddb_geolite2
Platforms Supported:
Data Sources Required:
</details>
Brings together a series of queries and visualizations to help you investigate the security status of Office 365 subscription and individual user activities.
This notebook is intended to be illustrative of the types of data available in Office 365 Activity data and how to query and use them. It is not meant to be used as a prescriptive guide to how to navigate through the data.
Feel free to experiment and submit anything interesting you find to the community.
The next cell:
This should complete without errors. If you encounter errors or warnings look at the following two notebooks:
If you are running in the Azure Sentinel Notebooks environment (Azure Notebooks or Azure ML) you can run live versions of these notebooks:
You may also need to do some additional configuration to successfully use functions such as Threat Intelligence service lookup and Geo IP lookup.
There are more details about this in the ConfiguringNotebookEnvironment
notebook and in these documents:
from pathlib import Path
import os
import sys
import warnings
from IPython.display import display, HTML, Markdown
REQ_PYTHON_VER=(3, 6)
REQ_MSTICPY_VER=(0, 5, 0)
display(HTML("<h3>Starting Notebook setup...</h3>"))
if Path("./utils/nb_check.py").is_file():
from utils.nb_check import check_python_ver, check_mp_ver
check_python_ver(min_py_ver=REQ_PYTHON_VER)
try:
check_mp_ver(min_msticpy_ver=REQ_MSTICPY_VER)
except ImportError:
!pip install --user --upgrade msticpy
if "msticpy" in sys.modules:
importlib.reload(msticpy)
else:
import msticpy
check_mp_ver(REQ_PYTHON_VER)
from msticpy.nbtools import nbinit
extra_imports = [
"dns, reversename",
"dns, resolver",
"ipwhois, IPWhois",
"msticpy.sectools.ip_utils, get_ip_type",
"msticpy.sectools.ip_utils, get_whois_info",
]
nbinit.init_notebook(
namespace=globals(),
extra_imports=extra_imports,
);
WIDGET_DEFAULTS = {
"layout": widgets.Layout(width="95%"),
"style": {"description_width": "initial"},
}
<details>
Use the following syntax if you are authenticating using an Azure Active Directory AppId and Secret:
%kql loganalytics://tenant(aad_tenant).workspace(WORKSPACE_ID).clientid(client_id).clientsecret(client_secret)
instead of
%kql loganalytics://code().workspace(WORKSPACE_ID)
Note: you may occasionally see a JavaScript error displayed at the end of the authentication - you can safely ignore this.
On successful authentication you should see a popup schema
button.
To find your Workspace Id go to Log Analytics. Look at the workspace properties to find the ID.
</details>
# To list configured workspaces run WorkspaceConfig.list_workspaces()
# WorkspaceConfig.list_workspaces()
# Authentication
ws_config = WorkspaceConfig()
qry_prov = QueryProvider(data_environment="LogAnalytics")
qry_prov.connect(connection_str=ws_config.code_connect_str)
table_index = qry_prov.schema_tables
if 'OfficeActivity' not in table_index:
display(Markdown('<font color="red"><h2>Warning. Office Data not available.</h2></font><br>'
'Either Office 365 data has not been imported into the workspace or'
' the OfficeActivity table is empty.<br>'
'This workbook is not useable with the current workspace.'))
# set the origin time to the time of our alert
md("For large O365 user bases, use short time ranges to keep the query times reasonable.")
o365_query_times = nbwidgets.QueryTime(
units='hours', before=6, after=0, max_before=72, max_after=12)
o365_query_times.display()
# Queries
office_ops_query = '''
OfficeActivity
| where TimeGenerated >= datetime({start})
| where TimeGenerated <= datetime({end})
| where UserType == 'Regular'
'''
office_ops_summary_query = '''
OfficeActivity
| where TimeGenerated >= datetime({start})
| where TimeGenerated <= datetime({end})
| where UserType == 'Regular'
| extend RecordOp = strcat(RecordType, '-', Operation)
| summarize OperationCount=count() by RecordType, Operation, UserId, UserAgent, ClientIP, bin(TimeGenerated, 1h)
'''
print('Getting data...', end=' ')
o365_query = office_ops_summary_query.format(start = o365_query_times.start,
end=o365_query_times.end)
%kql -query o365_query
office_ops_summary_df = _kql_raw_result_.to_dataframe()
print('done.')
(office_ops_summary_df
.assign(UserId = lambda x: x.UserId.str.lower())
.groupby(['RecordType', 'Operation'])
.aggregate({'ClientIP': 'nunique',
'UserId': 'nunique',
'OperationCount': 'sum'}))
unique_ip_op_ua = (
office_ops_summary_df.assign(UserId = lambda x: x.UserId.str.lower())
.groupby(['UserId', 'Operation'])
.aggregate({'ClientIP': 'nunique', 'OperationCount': 'sum'})
.reset_index()
.rename(columns={"ClientIP": "ClientIPCount"})
)
import math
multi_ip_users = unique_ip_op_ua[unique_ip_op_ua["ClientIPCount"] > 1]
if len(unique_ip_op_ua) > 0:
height = max(math.log10(len(multi_ip_users.UserId.unique())) * 10, 8)
aspect = 10 / height
user_ip_op = sns.catplot(x="ClientIPCount", y="UserId", hue='Operation', data=multi_ip_users, height=height, aspect=aspect)
md('Variability of IP Address Usage by user')
else:
md('No IP Addresses')
iplocation = GeoLiteLookup()
restrict_cols = ['RecordType', 'TimeGenerated', 'Operation',
'UserId', 'ClientIP', 'UserAgent']
office_ops_summary = office_ops_summary_df[restrict_cols].assign(UserId = lambda x: x.UserId.str.lower())
unique_ip_op_ua['ClientIPCount'] = unique_ip_op_ua['ClientIPCount']
office_ops_merged = pd.merge(unique_ip_op_ua.query('ClientIPCount > 1').drop(columns='ClientIPCount'),
office_ops_summary,
on=['UserId', 'Operation'])
if not office_ops_merged.empty:
client_ips = (
office_ops_merged
.query('ClientIP != "<null>" & ClientIP != ""')['ClientIP']
.drop_duplicates()
.tolist()
)
ip_entities = []
for ip in client_ips:
ip_entity = entities.IpAddress(Address=ip)
iplocation.lookup_ip(ip_entity=ip_entity)
if ip_entity.Location:
ip_dict = {'Address': ip_entity.Address}
ip_dict.update(ip_entity.Location.properties)
ip_entities.append(pd.Series(ip_dict))
ip_locs_df = pd.DataFrame(data=ip_entities)
ip_locs_df
office_ops_summary_ip_loc = pd.merge(office_ops_merged,
ip_locs_df, left_on='ClientIP',
right_on='Address', how='left')
display(
office_ops_summary_ip_loc
.groupby(['UserId', 'CountryCode', 'City'])
.aggregate({'ClientIP': 'nunique', 'OperationCount': 'sum'})
.reset_index()
.sort_values("ClientIP", ascending=False)
.query("ClientIP > 1")
)
else:
md("No operations with > 1 IP Address")
th_wgt = widgets.IntSlider(value=1, min=1, max=50, step=1, description='Set IP Count Threshold', **WIDGET_DEFAULTS)
th_wgt
print('Getting data...', end=' ')
o365_query = office_ops_query.format(start=o365_query_times.start,
end=o365_query_times.end)
# %kql -query o365_query
# office_ops_df = _kql_raw_result_.to_dataframe()
office_ops_df = qry_prov.exec_query(o365_query)
print('done.')
# Get Locations for distinct IPs
client_ips = office_ops_df.query('ClientIP != "<null>" & ClientIP != ""')['ClientIP'].drop_duplicates().tolist()
ip_entities = []
for ip in client_ips:
ip_entity = entities.IpAddress(Address=ip)
iplocation.lookup_ip(ip_entity=ip_entity)
if ip_entity.Location:
ip_dict = {'Address': ip_entity.Address}
ip_dict.update(ip_entity.Location.properties)
ip_entities.append(pd.Series(ip_dict))
ip_locs_df = pd.DataFrame(data=ip_entities)
# Get rid of unneeded columns
restrict_cols = ['OfficeId', 'RecordType', 'TimeGenerated', 'Operation',
'OrganizationId', 'UserType', 'UserKey', 'OfficeWorkload',
'ResultStatus', 'OfficeObjectId', 'UserId', 'ClientIP',
'ActorIpAddress', 'UserAgent']
office_ops_restr = office_ops_df[restrict_cols]
if client_ips:
# Merge main DF with IP location data
office_ops_locs = pd.merge(
office_ops_restr,
ip_locs_df,
how='right',
left_on='ClientIP',
right_on='Address',
indicator=True
)
limit_op_types = ['FileDownloaded', 'FileModified','FileUploaded',
'MailboxLogin']
office_ops_locs = office_ops_locs[office_ops_locs.Operation.isin(limit_op_types)]
# Calculate operations grouped by location and operation type
cm = sns.light_palette("yellow", as_cmap=True)
country_by_op_count = (office_ops_locs[['Operation', 'RecordType', 'CountryCode', 'City']]
.groupby(['CountryCode', 'City', 'Operation'])
.count())
display(country_by_op_count.unstack().fillna(0).rename(columns={'RecordType':'OperationCount'}))
# .style.background_gradient(cmap=cm))
# Group by Client IP, Country, Operation
clientip_by_op_count = (office_ops_locs[['ClientIP', 'Operation', 'RecordType', 'CountryCode']]
.groupby(['ClientIP', 'CountryCode', 'Operation'])
.count())
(clientip_by_op_count.unstack().fillna(0).rename(columns={'RecordType':'OperationCount'}))
# .style.background_gradient(cmap=cm))
else:
md("No client IPs found")
from msticpy.nbtools.foliummap import FoliumMap
folium_map = FoliumMap(zoom_start=3)
def get_row_ip_loc(row):
try:
_, ip_entity = iplocation.lookup_ip(ip_address=row.ClientIP)
return ip_entity
except ValueError:
return None
off_ip_locs = (office_ops_df[['ClientIP']]
.drop_duplicates()
.apply(get_row_ip_loc, axis=1)
.tolist())
ip_locs = [ip_list[0] for ip_list in off_ip_locs if ip_list]
display(HTML('<h3>External IP Addresses seen in Office Activity</h3>'))
display(HTML('Numbered circles indicate multiple items - click to expand.'))
icon_props = {'color': 'purple'}
folium_map.add_ip_cluster(ip_entities=ip_locs,
**icon_props)
folium_map.center_map()
display(folium_map.folium_map)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
display(Markdown('### Change in rate of Activity Class (RecordType) and Operation'))
sns.relplot(data=office_ops_summary_df, x='TimeGenerated', y='OperationCount', kind='line', aspect=2,
hue='RecordType')
sns.relplot(data=office_ops_summary_df.query('RecordType == "SharePointFileOperation"'),
x='TimeGenerated', y='OperationCount', hue='Operation', kind='line', aspect=2)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
display(Markdown('### Identify Users/IPs with largest operation count'))
office_ops = (
office_ops_summary_df
.query("OperationCount > 5")
.assign(
Account=lambda x: (x.UserId.str.extract('([^@]+)@.*', expand=False)).str.lower())
.sort_values("OperationCount", ascending=False)
)
limit_op_types = ['FileDownloaded', 'FileModified','FileUploaded',
'MailboxLogin']
office_ops = office_ops[office_ops.Operation.isin(limit_op_types)]
sns.catplot(data=office_ops, y='Account', x='OperationCount',
hue='Operation', aspect=2)
display(office_ops.pivot_table('OperationCount', index=['Account'],
columns='Operation')) #.style.bar(color='orange', align='mid'))
off_ops_df = office_ops_df[['OfficeId', 'RecordType', 'TimeGenerated', 'Operation',
'OrganizationId', 'UserType', 'UserKey', 'OfficeWorkload',
'ResultStatus', 'OfficeObjectId', 'UserId', 'ClientIP','UserAgent']]
(pd.merge(off_ops_df, ip_locs_df, how='left', left_on='ClientIP', right_on='Address')
[['TimeGenerated', 'Operation', 'RecordType', 'OfficeWorkload',
'ResultStatus', 'UserId', 'ClientIP', 'UserAgent', 'CountryCode',
'CountryName', 'State', 'City', 'Longitude', 'Latitude'
]]
)
# set the origin time to the time of our alert
o365_query_times_user = nbwidgets.QueryTime(units='days',
before=10, after=1, max_before=60, max_after=20, auto_display=True)
distinct_users = office_ops_df[['UserId']].sort_values('UserId')['UserId'].str.lower().drop_duplicates().tolist()
distinct_users
user_select = nbwidgets.SelectString(description='Select User Id', item_list=distinct_users, auto_display=True)
# (items=distinct_users)
# Provides a summary view of a given account's activity
# For use when investigating an account that has been identified as having associated suspect activity or been otherwise compromised.
# All office activity by UserName using UI to set Time range
# Tags: #Persistence, #Discovery, #Lateral Movement, #Collection
user_activity_query = '''
OfficeActivity
| where TimeGenerated >= datetime({start})
| where TimeGenerated <= datetime({end})
| where UserKey has "{user}" or UserId has "{user}"
'''
print('Getting data...', end=' ')
o365_query = user_activity_query.format(start=o365_query_times_user.start,
end=o365_query_times_user.end,
user=user_select.value)
%kql -query o365_query
user_activity_df = _kql_raw_result_.to_dataframe()
print('done.')
user_activity_df
my_df = (user_activity_df[['OfficeId', 'RecordType', 'TimeGenerated', 'Operation',
'ResultStatus', 'UserId', 'ClientIP','UserAgent']]
.groupby(['Operation', 'ResultStatus', 'ClientIP'])
.aggregate({'OfficeId': 'count'})
.rename(columns={'OfficeId': 'OperationCount', 'ClientIP': 'IPCount'})
.reset_index())
sns.catplot(x='OperationCount', y="Operation", hue="ClientIP", jitter=False, data=my_df, aspect=2.5);
my_df2 = (user_activity_df[['OfficeId', 'RecordType', 'TimeGenerated', 'Operation',
'ResultStatus', 'UserId', 'ClientIP','UserAgent']]
.groupby(['Operation'])
.aggregate({'OfficeId': 'count', 'ClientIP': 'nunique'})
.rename(columns={'OfficeId': 'OperationCount', 'ClientIP': 'IPCount'})
.reset_index())
sns.barplot(x='IPCount', y="Operation", data=my_df2);
num_ops = user_activity_df["Operation"].nunique()
nbdisplay.display_timeline(data=user_activity_df,
title='Office Operations',
source_columns=['OfficeWorkload', 'Operation', 'ClientIP', 'ResultStatus'],
group_by="Operation",
height=25 * num_ops)
def get_row_ip_loc(row):
try:
_, ip_entity = iplocation.lookup_ip(ip_address=row.ClientIP)
return ip_entity
except ValueError:
return None
from msticpy.nbtools.foliummap import FoliumMap
folium_map = FoliumMap(zoom_start=3)
off_ip_locs = (user_activity_df[['ClientIP']]
.drop_duplicates()
.apply(get_row_ip_loc, axis=1)
.tolist())
ip_locs = [ip_list[0] for ip_list in off_ip_locs if ip_list]
display(HTML('<h3>External IP Addresses seen in Office Activity</h3>'))
display(HTML('Numbered circles indicate multiple items - click to expand.'))
icon_props = {'color': 'purple'}
folium_map.add_ip_cluster(ip_entities=ip_locs,
**icon_props)
folium_map.center_map()
display(folium_map.folium_map)
if 'AzureNetworkAnalytics_CL' not in table_index:
md("""
<font color="orange">
<h2>Warning. Azure network flow data not available.</h2></font><br>
This section of the notebook is not useable with the current workspace.
"""
)
if 'AzureNetworkAnalytics_CL' not in table_index:
display(Markdown('<font color="orange"><h2>Warning. Azure network flow data not available.</h2></font><br>'
'This section of the notebook is not useable with the current workspace.'))
# Build the query parameters
all_user_ips = user_activity_df['ClientIP'].tolist()
all_user_ips = [ip for ip in all_user_ips if ip and ip != '<null>']
# Some Office IPs have dest port appended to address
ipv4_ips = [ip.split(":")[0] for ip in all_user_ips if "." in ip]
ipv6_ips = [ip for ip in all_user_ips if "." not in ip]
all_ips = list(set(ipv4_ips + ipv6_ips))
az_net_comms_df = (
qry_prov
.Network
.list_azure_network_flows_by_ip(start=o365_query_times_user.start,
end=o365_query_times_user.end,
ip_address_list=all_ips)
)
net_default_cols = ['FlowStartTime', 'FlowEndTime', 'VMName', 'VMIPAddress',
'PublicIPs', 'SrcIP', 'DestIP', 'L4Protocol', 'L7Protocol',
'DestPort', 'FlowDirection', 'AllowedOutFlows',
'AllowedInFlows']
# %kql -query az_net_query_byip
# az_net_comms_df = _kql_raw_result_.to_dataframe()
if az_net_comms_df.empty:
md_warn("No network flow data available in AzureNetworkAnalytics_CL table"
+ "\nRemainder of cell will not work.")
raise ValueError("No network flow data available in AzureNetworkAnalytics_CL table")
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
az_net_comms_df['TotalAllowedFlows'] = az_net_comms_df['AllowedOutFlows'] + az_net_comms_df['AllowedInFlows']
sns.catplot(x="L7Protocol", y="TotalAllowedFlows", col="FlowDirection", data=az_net_comms_df)
sns.relplot(x="FlowStartTime", y="TotalAllowedFlows",
col="FlowDirection", kind="line",
hue="L7Protocol", data=az_net_comms_df).set_xticklabels(rotation=50)
cols = ['VMName', 'VMIPAddress', 'PublicIPs', 'SrcIP', 'DestIP', 'L4Protocol',
'L7Protocol', 'DestPort', 'FlowDirection', 'AllExtIPs', 'TotalAllowedFlows']
flow_index = az_net_comms_df[cols].copy()
def get_source_ip(row):
if row.FlowDirection == 'O':
return row.VMIPAddress if row.VMIPAddress else row.SrcIP
else:
return row.AllExtIPs if row.AllExtIPs else row.DestIP
def get_dest_ip(row):
if row.FlowDirection == 'O':
return row.AllExtIPs if row.AllExtIPs else row.DestIP
else:
return row.VMIPAddress if row.VMIPAddress else row.SrcIP
flow_index['source'] = flow_index.apply(get_source_ip, axis=1)
flow_index['target'] = flow_index.apply(get_dest_ip, axis=1)
flow_index['value'] = flow_index['L7Protocol']
cm = sns.light_palette("green", as_cmap=True)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
display(flow_index[['source', 'target', 'value', 'L7Protocol',
'FlowDirection', 'TotalAllowedFlows']]
.groupby(['source', 'target', 'value', 'L7Protocol', 'FlowDirection'])
.sum().unstack().style.background_gradient(cmap=cm))
nbdisp.display_timeline(data=az_net_comms_df.query('AllowedOutFlows > 0'),
overlay_data=az_net_comms_df.query('AllowedInFlows > 0'),
title='Network Flows (out=blue, in=green)',
time_column='FlowStartTime',
source_columns=['FlowType', 'AllExtIPs', 'L7Protocol', 'FlowDirection'],
height=300)
from msticpy.sectools.eventcluster import (dbcluster_events,
add_process_features,
char_ord_score,
token_count,
delim_count)
restrict_cols = ['OfficeId', 'RecordType', 'TimeGenerated', 'Operation',
'OrganizationId', 'UserType', 'UserKey', 'OfficeWorkload',
'ResultStatus', 'OfficeObjectId', 'UserId', 'ClientIP','UserAgent']
feature_office_ops = office_ops_df[restrict_cols]
feature_office_ops = ( pd.merge(feature_office_ops,
ip_locs_df, how='left',
left_on='ClientIP', right_on='Address')
.fillna(''))
# feature_office_ops = office_ops_df.copy()
feature_office_ops['country_num'] = feature_office_ops.apply(lambda x: char_ord_score(x.CountryCode) if x.CountryCode else 0, axis=1)
feature_office_ops['ua_tokens'] = feature_office_ops.apply(lambda x: char_ord_score(x.UserAgent), axis=1)
feature_office_ops['user_num'] = feature_office_ops.apply(lambda x: char_ord_score(x.UserId), axis=1)
feature_office_ops['op_num'] = feature_office_ops.apply(lambda x: char_ord_score(x.Operation), axis=1)
# you might need to play around with the max_cluster_distance parameter.
# decreasing this gives more clusters.
(clustered_ops, dbcluster, x_data) = dbcluster_events(data=feature_office_ops,
cluster_columns=['country_num',
'op_num',
'ua_tokens'],
time_column='TimeGenerated',
max_cluster_distance=0.0001)
print('Number of input events:', len(feature_office_ops))
print('Number of clustered events:', len(clustered_ops))
display(Markdown('#### Rarest combinations'))
display(clustered_ops[['TimeGenerated', 'RecordType',
'Operation', 'UserId', 'UserAgent', 'ClusterSize',
'OfficeObjectId', 'CountryName']]
.query('ClusterSize <= 2')
.sort_values('ClusterSize', ascending=True))
display(Markdown('#### Most common operations'))
display((clustered_ops[['RecordType', 'Operation', 'ClusterSize']]
.sort_values('ClusterSize', ascending=False)
.head(10)))
msticpyconfig.yaml
configuration File¶You can configure primary and secondary TI providers and any required parameters in the msticpyconfig.yaml
file. This is read from the current directory or you can set an environment variable (MSTICPYCONFIG
) pointing to its location.
To configure this file see the ConfigureNotebookEnvironment notebook
print('List of current DataFrames in Notebook')
print('-' * 50)
current_vars = list(locals().keys())
for var_name in current_vars:
if isinstance(locals()[var_name], pd.DataFrame) and not var_name.startswith('_'):
print(var_name)
To save the contents of a pandas DataFrame to an Excel spreadsheet use the following syntax
writer = pd.ExcelWriter('myWorksheet.xlsx')
my_data_frame.to_excel(writer,'Sheet1')
writer.save()