Sharing data, creating documents and doing public demonstrations often require that data containing PII or other sensitive material be obfuscated.
MSTICPy contains a simple library to obfuscation data using hashing and random mapping of values. You can use these functions on a single data items or entire DataFrames.
import pandas as pd
from msticpy.common.utility import md
from msticpy.data import data_obfus
netflow_df = pd.read_csv("data/az_net_flows.csv")
# list is imported as string from csv - convert back to list with eval
def str_to_list(val):
if isinstance(val, str):
return eval(val)
netflow_df["PublicIPs"] = netflow_df["PublicIPs"].apply(str_to_list)
# Define subset of output columns
out_cols = [
'TenantId', 'TimeGenerated', 'FlowStartTime',
'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',
'SrcIP', 'DestIP', 'L4Protocol', 'AllExtIPs'
]
netflow_df = netflow_df[out_cols]
Here we're importing individual functions but you can access them with the single import statement above as:
data_obfus.hash_string(...)
etc.
Note In the next cell we're using a function to output documentation and examples.
You can ignore this. The usage of each function is show in the output of
the subsequent cells.
from msticpy.data.data_obfus import (
hash_dict,
hash_ip,
hash_item,
hash_list,
hash_sid,
hash_string,
replace_guid
)
# Function to automate/format the examples below. You can ignore this
def show_func(func, examples):
func_name = func.__name__
if func.__name__.startswith("_"):
func_name = func_name[1:]
md(func_name, "bold")
print(func.__doc__)
md("Examples", "bold")
for example in examples:
if isinstance(example, tuple):
arg, delim = example
print(
f"{func_name}('{arg}', delim='{delim}') =>", func(*example)
)
else:
print(
f"{func_name}('{example}') =>", func(example)
)
md("<br><hr><br>")
md("hash_string", "large, bold")
md("hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric")
show_func(hash_string, ["sensitive data", "12345"])
hash_string
hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric
hash_string
Hash a simple string. Parameters ---------- input_str : str The input string Returns ------- str The obfuscated output string
Examples
hash_string('sensitive data') => jdiqcnrqmlidkd hash_string('12345') => 59944
md("hash_item", "large, bold")
md("hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.")
show_func(hash_item, [("sensitive data", " "), ("most-sensitive-data/here", " /-")])
hash_item
hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.
hash_item
Hash a simple string. Parameters ---------- input_item : str The input string delim: str, optional A string of delimiters to use to split the input string prior to hashing. Returns ------- str The obfuscated output string
Examples
hash_item('sensitive data', delim=' ') => kdneqoiia laoe hash_item('most-sensitive-data/here', delim=' /-') => kmea-kdneqoiia-laoe/fcec
md("hash_ip", "large, bold")
md("hash_ip will output random mappings of input IP V4 and V6 addresses.")
md("Within a Python session the mapping will remain constant.")
show_func(hash_ip, [
"192.168.3.1",
"2001:0db8:85a3:0000:0000:8a2e:0370:7334",
["192.168.3.1", "192.168.5.2", "192.168.10.2"],
])
hash_ip
hash_ip will output random mappings of input IP V4 and V6 addresses.
Within a Python session the mapping will remain constant.
hash_ip
Hash IP address or list of IP addresses. Parameters ---------- input_item : Union[List[str], str] List of IP addresses or single IP address. Returns ------- Union[List[str], str] List of hashed addresses or single address. (depending on input)
Examples
hash_ip('192.168.3.1') => 155.74.98.168 hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334') => 85d6:7819:9cce:9af1:9af1:24ad:d338:7d03 hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']') => ['155.74.98.168', '155.74.125.39', '155.74.224.39']
md("hash_sid", "large, bold")
md("hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)")
show_func(hash_sid, ["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"])
hash_sid
hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)
hash_sid
Hash a SID preserving well-known SIDs and the RID. Parameters ---------- sid : str SID string Returns ------- str Hashed SID
Examples
hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004') => S-1-5-21-3321821741-636458740-4143214142-1004 hash_sid('S-1-5-18') => S-1-5-18
md("hash_list", "large, bold")
md("hash_list will randomize a list of items preserving the list structure.")
show_func(hash_list, [["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"]])
hash_list
hash_list will randomize a list of items preserving the list structure.
hash_list
Hash list of strings. Parameters ---------- item_list : List[str] Input list Returns ------- List[str] Hashed list
Examples
hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']') => ['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']
md("hash_dict", "large, bold")
md("hash_dict will randomize a dict of items preserving the structure and the dict keys.")
show_func(hash_dict, [{"SID1": "S-1-5-21-1180699209-877415012-3182924384-1004", "SID2": "S-1-5-18"}])
hash_dict
hash_dict will randomize a dict of items preserving the structure and the dict keys.
hash_dict
Hash dictionary values. Parameters ---------- item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]] Input item can be a Dict of strings, lists or other dictionaries. Returns ------- Dict[str, Any] Dictionary with hashed values.
Examples
hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}') => {'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}
md("replace_guid", "large, bold")
md("replace_guid will output a random UUID mapped to the input.")
md("An input GUID will be mapped to the same newly-generated output UUID")
md("You can see that UUID #4 is the same as #1 and mapped to the same output UUID.")
show_func(replace_guid, [
"cf1b0b29-08ae-4528-839a-5f66eca2cce9",
"ed63d29e-6288-4d66-b10d-8847096fc586",
"ac561203-99b2-4067-a525-60d45ea0d7ff",
"cf1b0b29-08ae-4528-839a-5f66eca2cce9",
])
replace_guid
replace_guid will output a random UUID mapped to the input.
An input GUID will be mapped to the same newly-generated output UUID
You can see that UUID #4 is the same as #1 and mapped to the same output UUID.
replace_guid
Replace GUID/UUID with mapped random UUID. Parameters ---------- guid : str Input UUID. Returns ------- str Mapped UUID
Examples
replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 301b9777-41bd-4f92-8dae-f8da9172aa08 replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586') => 7c3fa4d1-4ad1-4242-bac1-94538b9e0b8e replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff') => 477b13cb-20fa-4b95-b794-0495edd89175 replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 301b9777-41bd-4f92-8dae-f8da9172aa08
We can use the msticpy pandas extension to obfuscate an entire DataFrame.
The obfuscation library contains a mapping for a number of common field names. You can view this list by displaying the attribute:
data_obfus.OBFUS_COL_MAP
In the first example, the TenantId, ResourceGroup, VMName have been obfuscated.
display(netflow_df.head(3))
netflow_df.head(3).mp_obf.obfuscate()
TenantId | TimeGenerated | FlowStartTime | ResourceGroup | VMName | VMIPAddress | PublicIPs | SrcIP | DestIP | L4Protocol | AllExtIPs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52b1ab41-869e-4138-9e40-2a4457f09bf0 | 2019-02-12 14:22:40.697 | 2019-02-12 13:00:07.000 | asihuntomsworkspacerg | msticalertswin1 | 10.0.3.5 | [65.55.44.109] | NaN | NaN | T | 65.55.44.109 |
1 | 52b1ab41-869e-4138-9e40-2a4457f09bf0 | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | asihuntomsworkspacerg | msticalertswin1 | 10.0.3.5 | [13.71.172.130, 13.71.172.128] | NaN | NaN | T | 13.71.172.128 |
2 | 52b1ab41-869e-4138-9e40-2a4457f09bf0 | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | asihuntomsworkspacerg | msticalertswin1 | 10.0.3.5 | [13.71.172.130, 13.71.172.128] | NaN | NaN | T | 13.71.172.130 |
obfuscating columns: TenantId, ResourceGroup, done
TenantId | TimeGenerated | FlowStartTime | ResourceGroup | VMName | VMIPAddress | PublicIPs | SrcIP | DestIP | L4Protocol | AllExtIPs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.697 | 2019-02-12 13:00:07.000 | ibmkajbmepnmiaeilfofa | msticalertswin1 | 10.0.3.5 | [65.55.44.109] | NaN | NaN | T | 65.55.44.109 |
1 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | msticalertswin1 | 10.0.3.5 | [13.71.172.130, 13.71.172.128] | NaN | NaN | T | 13.71.172.128 |
2 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | msticalertswin1 | 10.0.3.5 | [13.71.172.130, 13.71.172.128] | NaN | NaN | T | 13.71.172.130 |
Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.
We can add these columns to a custom mapping dictionary and re-run the obfuscation. See the later section on Creating Custom Mappings.
col_map = {
"VMName": ".",
"VMIPAddress": "ip",
"PublicIPs": "ip",
"AllExtIPs": "ip"
}
netflow_df.head(3).mp_obf.obfuscate(column_map=col_map)
obfuscating columns: TenantId, ResourceGroup, VMName, VMIPAddress, PublicIPs, AllExtIPs, done
TenantId | TimeGenerated | FlowStartTime | ResourceGroup | VMName | VMIPAddress | PublicIPs | SrcIP | DestIP | L4Protocol | AllExtIPs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.697 | 2019-02-12 13:00:07.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [239.3.143.131] | NaN | NaN | T | 239.3.143.131 |
1 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [188.84.223.185, 188.84.223.48] | NaN | NaN | T | 188.84.223.48 |
2 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [188.84.223.185, 188.84.223.48] | NaN | NaN | T | 188.84.223.185 |
You can also call the standard function obfuscate_df
to perform the same operation
on the dataframe passed as the data
parameter.
data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)
obfuscating columns: TenantId, ResourceGroup, VMName, VMIPAddress, PublicIPs, AllExtIPs, done
TenantId | TimeGenerated | FlowStartTime | ResourceGroup | VMName | VMIPAddress | PublicIPs | SrcIP | DestIP | L4Protocol | AllExtIPs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.697 | 2019-02-12 13:00:07.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [239.3.143.131] | NaN | NaN | T | 239.3.143.131 |
1 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [188.84.223.185, 188.84.223.48] | NaN | NaN | T | 188.84.223.48 |
2 | 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa | 2019-02-12 14:22:40.681 | 2019-02-12 13:00:48.000 | ibmkajbmepnmiaeilfofa | fmlmbnlpdcbnbnn | 224.21.98.125 | [188.84.223.185, 188.84.223.48] | NaN | NaN | T | 188.84.223.185 |
A custom mapping dictionary has entries in the following form:
"ColumnName": "operation"
The operation
defines the type of obfuscation method used for that column. Both the column
and the operation code must be quoted.
operation code | obfuscation function |
---|---|
"uuid" | replace_guid |
"ip" | hash_ip |
"str" | hash_string |
"dict" | hash_dict |
"list" | hash_list |
"sid" | hash_sid |
"null" | "null"* |
None | hash_str* |
delims_str | hash_item* |
*The last three items require some explanation:
null
operation code means set the value to empty - i.e. delete the value
in the output frame.None
) default to hash_string.NOTE If you want to only use custom mappings and ignore the builtin
mapping table, specifyuse_default=False
as a parameter to either
mp_obf.obfuscate()
orobfuscate_df
hash_item
with delimiters to preserve the structure/look of the hashed input¶Using hash_item with a delimiters string lets you create output that somewhat resembles the input
type. The delimiters string is specified as a simple string of delimiter characters, e.g. "@\,-"
The input string is broken into substrings using each of the delimiters in the delims_str. The substrings are individually hashed and the resulting substrings joined together using the original delimiters. The string is split in the order of the characters in the delims string.
This allows you to create hashed values that bear some resemblance to the original structure of the string. This might be useful for email address, qualified domain names and other structure text.
For example : ian@mydomain.com
Using the simple hash_string
function the output bears no resemblance to an email address
hash_string("ian@mydomain.com")
'prqocjmdpbodrafn'
Using hash_item
and specifying the expected delimiters we get something like an email address in the output.
hash_item("ian@mydomain.com", "@.")
'bnm@blbbrfbk.pjb'
You use hash_item
in your Custom Mapping dictionary by specifying a delimiters string as the operation
.
You should check that you have correctly masked all of the columns needed.
There is a function check_obfuscation
to do this.
Use silent=False
to print out the results.
If you use silent=True
(the default it will return 2 lists of unchanged
and
obfuscated
columns)
data_obfus.check_obfuscation(
data: pandas.core.frame.DataFrame,
orig_data: pandas.core.frame.DataFrame,
index: int = 0,
silent=True,
) -> Union[Tuple[List[str], List[str]], NoneType]
Check the obfuscation results for a row.
Parameters
----------
data : pd.DataFrame
Obfuscated DataFrame
orig_data : pd.DataFrame
Original DataFrame
index : int, optional
The row to check, by default 0
silent: bool
If False the function returns no output and
returns lists of changed and unchanged columns.
By default, True
Returns
-------
Optional[Tuple[List[str], List[str]]] :
If silent is True returns a tuple of unchanged, changed
items. If False, returns None.
Note by default this will check only the first row of the data. You can check other rows using the index parameter.
Warning The two DataFrames should have a matching index and ordering because the check works by comparing the values in each column, judging that column values that do not match have been obfuscated.
We first test the partially-obfuscated DataFrame from earlier.
partly_obfus_df = netflow_df.head(3).mp_obf.obfuscate()
fully_obfus_df = netflow_df.head(3).mp_obf.obfuscate(column_map=col_map)
data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)
obfuscating columns: TenantId, ResourceGroup, done obfuscating columns: TenantId, ResourceGroup, VMName, VMIPAddress, PublicIPs, AllExtIPs, done ===== Start Check ==== Unchanged columns: ------------------ AllExtIPs: 65.55.44.109 FlowStartTime: 2019-02-12 13:00:07.000 L4Protocol: T PublicIPs: ['65.55.44.109'] TimeGenerated: 2019-02-12 14:22:40.697 VMIPAddress: 10.0.3.5 VMName: msticalertswin1 Obfuscated columns: -------------------- DestIP: nan ----> nan ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa SrcIP: nan ----> nan TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa ====== End Check =====
Checking the fully-obfuscated data set
data_obfus.check_obfuscation(fully_obfus_df, netflow_df.head(3), silent=False)
===== Start Check ==== Unchanged columns: ------------------ FlowStartTime: 2019-02-12 13:00:07.000 L4Protocol: T TimeGenerated: 2019-02-12 14:22:40.697 Obfuscated columns: -------------------- AllExtIPs: 65.55.44.109 ----> 239.3.143.131 DestIP: nan ----> nan PublicIPs: ['65.55.44.109'] ----> ['239.3.143.131'] ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa SrcIP: nan ----> nan TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa VMIPAddress: 10.0.3.5 ----> 224.21.98.125 VMName: msticalertswin1 ----> fmlmbnlpdcbnbnn ====== End Check =====
import tabulate
print(tabulate.tabulate(netflow_df.head(3), tablefmt="rst", showindex=False, headers="keys"))
==================================== ======================= ======================= ===================== =============== ============= ================================== ======= ======== ============ ============= TenantId TimeGenerated FlowStartTime ResourceGroup VMName VMIPAddress PublicIPs SrcIP DestIP L4Protocol AllExtIPs ==================================== ======================= ======================= ===================== =============== ============= ================================== ======= ======== ============ ============= 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.697 2019-02-12 13:00:07.000 asihuntomsworkspacerg msticalertswin1 10.0.3.5 ['65.55.44.109'] nan nan T 65.55.44.109 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.681 2019-02-12 13:00:48.000 asihuntomsworkspacerg msticalertswin1 10.0.3.5 ['13.71.172.130', '13.71.172.128'] nan nan T 13.71.172.128 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.681 2019-02-12 13:00:48.000 asihuntomsworkspacerg msticalertswin1 10.0.3.5 ['13.71.172.130', '13.71.172.128'] nan nan T 13.71.172.130 ==================================== ======================= ======================= ===================== =============== ============= ================================== ======= ======== ============ =============