To find patterns and outliers in CSVs and event data, Graphistry provides the hypergraph transform.
As an example, this notebook examines different malware files reported to a security vendor. It reveals phenomena such as:
import pandas as pd
import graphistry as g
#graphistry.register(key='...')
df = pd.read_csv('barncat.1k.csv', encoding = "utf8")
print("# samples", len(df))
eval(df[:10]['value'].tolist()[0])
('# samples', 999)
{'Campaign': 'TRANSFORMICE', 'Date': '2015-11-19 14:04:23', 'Domain': 'spynet1.ddns.net', 'InstallDir': 'TEMP', 'InstallFlag': 'True', 'InstallName': 'svchost.exe', 'NetworkSeparator': "|'|'|", 'Origin': 'vt', 'Port': '1177', 'RegistryValue': 'ba4c12bee3027d94da5c81db2d196bfd', 'Version': '0.6.4', 'compile_date': '2015-11-18 21:25:59', 'imphash': 'f34d5f2d4577ed6d9ceec516c1f5a744', 'magic': 'PE32 executable for MS Windows (GUI) Intel 80386 32-bit Mono/.Net assembly', 'md5': '007a8403b3281fd4d48c69f4c96da0b8', 'rat_name': 'njRat', 'section_.RELOC': '7905c1aa858eb5484ad08a2e10b7e50e', 'section_.RSRC': '5b346ed223699f15252c1fdad182859f', 'section_.TEXT': 'f414cace41511d02fb8e278cf36fd2a3', 'sha1': 'd215edec90c5487800d961cc1ac2808e221818fa', 'sha256': '2beb53ca652d9d4f73516ce45365ae824370d2408d6b0d5a809cf3cd177ba694'}
#avoid double counting
df3 = df[df['value'].str.contains("{")]
df3[:1]
uuid | event_id | category | type | value | to_ids | date | |
---|---|---|---|---|---|---|---|
0 | 56e1af55-22f4-4b76-881a-50feac1f3af3 | 417 | External analysis | comment | {"InstallFlag": "True", "RegistryValue": "ba4c... | 0 | 20160310 |
#Unpack 'value' json
import json
df4 = pd.concat([df3.drop('value', axis=1), df3.value.apply(json.loads).apply(pd.Series)])
len(df4)
df4[:1]
ActivateKeylogger | ActiveXKey | ActiveXStartup | BackupDNSServer | BypassUAC | Campaign | ChangeCreationDate | ClearAccessControl | ClearZoneIdentifier | ConnectDelay | ... | section_.TEXT | section_.TLS | section_BSS | section_CODE | section_DATA | sha1 | sha256 | to_ids | type | uuid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | comment | 56e1af55-22f4-4b76-881a-50feac1f3af3 |
1 rows × 116 columns
The hypergraph transform creates:
When multiple rows share similar values, they will cluster together. When a row has unique values, those will form a ring around only that node.
g.hypergraph(df4[:50])['graph'].plot()
('# links', 200) ('# event entities', 50) ('# attrib entities', 102)
We clean up the visualization in a few ways:
Categorize hash codes as in the same family. This simplifies coloring by the generated 'category' field. If columns share the same value, such as two columns using md5 values, this would also cause them to only create 1 node per hash, instead of per-column instance.
Not show a lot of attributes as nodes, such as numbers and dates
Running help(graphistry.hypergraph)
reveals more options.
g.hypergraph(
df4,
opts={
'CATEGORIES': {
'hash': ['sha1', 'sha256', 'md5'],
'section': [x for x in df4.columns if 'section_' in x]
},
'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']
})['graph'].plot()
('# links', 2350) ('# event entities', 204) ('# attrib entities', 1156)