#!/usr/bin/env python
# coding: utf-8

# # Finding Correlations in a CSV of Malware Events via Hypergraph Views
# 
# To find patterns and outliers in CSVs and event data, Graphistry provides the hypergraph transform. 
# 
# As an example, this notebook examines different malware files reported to a security vendor. It reveals phenomena such as:
# 
# * The malware files cluster into several families
# * The nodes central to a cluster reveal attributes specific to a strain of malware
# * The nodes bordering a cluster reveal attributes that show up in a strain, but are unique to each instance in that strain
# * Several families have attributes connecting them, suggesting they had the same authors

# ## Load CSV

# In[1]:


import pandas as pd
import graphistry as g
#graphistry.register(key='...')


# In[15]:


df = pd.read_csv('barncat.1k.csv', encoding = "utf8")
print("# samples", len(df))
eval(df[:10]['value'].tolist()[0])


# In[16]:


#avoid double counting
df3 = df[df['value'].str.contains("{")]
df3[:1]


# In[17]:


#Unpack 'value' json
import json
df4 = pd.concat([df3.drop('value', axis=1), df3.value.apply(json.loads).apply(pd.Series)])
len(df4)
df4[:1]


# ## Default Hypergraph Transform
# 
# The hypergraph transform creates:
# * A node for every row, 
# * A node for every unique value in a columns (so multiple if found across columns)
# * An edge connecting a row to its values
# 
# When multiple rows share similar values, they will cluster together. When a row has unique values, those will form a ring around only that node.

# In[5]:


g.hypergraph(df4[:50])['graph'].plot()


# ## Configured Hypergraph Transform
# We clean up the visualization in a few ways:
# 
# 1. Categorize hash codes as in the same family. This simplifies coloring by the generated 'category' field. If columns share the same value, such as two columns using md5 values, this would also cause them to only create 1 node per hash, instead of per-column instance.
# 
# 2. Not show a lot of attributes as nodes, such as numbers and dates
# 
# Running `help(graphistry.hypergraph)` reveals more options.

# In[14]:


g.hypergraph(
    df4,
    opts={
        'CATEGORIES': {
            'hash': ['sha1', 'sha256', 'md5'],
            'section': [x for x in df4.columns if 'section_' in x]
        },
        'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']
    })['graph'].plot()


# In[ ]: