Finding Correlations in a CSV of Malware Events via Hypergraph Views

To find patterns and outliers in CSVs and event data, Graphistry provides the hypergraph transform.

As an example, this notebook examines different malware files reported to a security vendor. It reveals phenomena such as:

  • The malware files cluster into several families
  • The nodes central to a cluster reveal attributes specific to a strain of malware
  • The nodes bordering a cluster reveal attributes that show up in a strain, but are unique to each instance in that strain
  • Several families have attributes connecting them, suggesting they had the same authors

Load CSV

In [1]:
import pandas as pd
import graphistry as g
#graphistry.register(key='...')
In [15]:
df = pd.read_csv('barncat.1k.csv', encoding = "utf8")
print("# samples", len(df))
eval(df[:10]['value'].tolist()[0])
('# samples', 999)
Out[15]:
{'Campaign': 'TRANSFORMICE',
 'Date': '2015-11-19 14:04:23',
 'Domain': 'spynet1.ddns.net',
 'InstallDir': 'TEMP',
 'InstallFlag': 'True',
 'InstallName': 'svchost.exe',
 'NetworkSeparator': "|'|'|",
 'Origin': 'vt',
 'Port': '1177',
 'RegistryValue': 'ba4c12bee3027d94da5c81db2d196bfd',
 'Version': '0.6.4',
 'compile_date': '2015-11-18 21:25:59',
 'imphash': 'f34d5f2d4577ed6d9ceec516c1f5a744',
 'magic': 'PE32 executable for MS Windows (GUI) Intel 80386 32-bit Mono/.Net assembly',
 'md5': '007a8403b3281fd4d48c69f4c96da0b8',
 'rat_name': 'njRat',
 'section_.RELOC': '7905c1aa858eb5484ad08a2e10b7e50e',
 'section_.RSRC': '5b346ed223699f15252c1fdad182859f',
 'section_.TEXT': 'f414cace41511d02fb8e278cf36fd2a3',
 'sha1': 'd215edec90c5487800d961cc1ac2808e221818fa',
 'sha256': '2beb53ca652d9d4f73516ce45365ae824370d2408d6b0d5a809cf3cd177ba694'}
In [16]:
#avoid double counting
df3 = df[df['value'].str.contains("{")]
df3[:1]
Out[16]:
uuid event_id category type value to_ids date
0 56e1af55-22f4-4b76-881a-50feac1f3af3 417 External analysis comment {"InstallFlag": "True", "RegistryValue": "ba4c... 0 20160310
In [17]:
#Unpack 'value' json
import json
df4 = pd.concat([df3.drop('value', axis=1), df3.value.apply(json.loads).apply(pd.Series)])
len(df4)
df4[:1]
Out[17]:
ActivateKeylogger ActiveXKey ActiveXStartup BackupDNSServer BypassUAC Campaign ChangeCreationDate ClearAccessControl ClearZoneIdentifier ConnectDelay ... section_.TEXT section_.TLS section_BSS section_CODE section_DATA sha1 sha256 to_ids type uuid
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 0.0 comment 56e1af55-22f4-4b76-881a-50feac1f3af3

1 rows × 116 columns

Default Hypergraph Transform

The hypergraph transform creates:

  • A node for every row,
  • A node for every unique value in a columns (so multiple if found across columns)
  • An edge connecting a row to its values

When multiple rows share similar values, they will cluster together. When a row has unique values, those will form a ring around only that node.

In [5]:
g.hypergraph(df4[:50])['graph'].plot()
('# links', 200)
('# event entities', 50)
('# attrib entities', 102)
Out[5]:

Configured Hypergraph Transform

We clean up the visualization in a few ways:

  1. Categorize hash codes as in the same family. This simplifies coloring by the generated 'category' field. If columns share the same value, such as two columns using md5 values, this would also cause them to only create 1 node per hash, instead of per-column instance.

  2. Not show a lot of attributes as nodes, such as numbers and dates

Running help(graphistry.hypergraph) reveals more options.

In [14]:
g.hypergraph(
    df4,
    opts={
        'CATEGORIES': {
            'hash': ['sha1', 'sha256', 'md5'],
            'section': [x for x in df4.columns if 'section_' in x]
        },
        'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']
    })['graph'].plot()
('# links', 2350)
('# event entities', 204)
('# attrib entities', 1156)
Out[14]:
In [ ]: