MITRE ATT&CK PYTHON CLIENT: Data Sources¶

Goals:¶

Access ATT&CK data sources in STIX format via a public TAXII server
Learn to interact with ATT&CK data all at once
Explore and idenfity patterns in the data retrieved
Learn more about ATT&CK data sources

1. ATT&CK Python Client Installation¶

You can install it via PIP: pip install attackcti

2. Import ATT&CK API Client¶

In [1]:

from attackcti import attack_client

3. Import Extra Libraries¶

In [2]:

from pandas import *
import numpy as np

import altair as alt
alt.renderers.enable('notebook')

import itertools

4. Initialize ATT&CK Client Class¶

In [3]:

lift = attack_client()

5. Getting Information About Techniques¶

Getting ALL ATT&CK Techniques

In [4]:

all_techniques = lift.get_techniques(stix_format=False)

Showing the first technique in our list

In [5]:

all_techniques[0]

Out[5]:

{'external_references': [{'source_name': 'mitre-attack',
   'external_id': 'T1059.008',
   'url': 'https://attack.mitre.org/techniques/T1059/008'},
  {'source_name': 'Cisco Synful Knock Evolution',
   'url': 'https://blogs.cisco.com/security/evolution-of-attacks-on-cisco-ios-devices',
   'description': 'Graham Holmes. (2015, October 8). Evolution of attacks on Cisco IOS devices. Retrieved October 19, 2020.'},
  {'source_name': 'Cisco IOS Software Integrity Assurance - Command History',
   'url': 'https://tools.cisco.com/security/center/resources/integrity_assurance.html#23',
   'description': 'Cisco. (n.d.). Cisco IOS Software Integrity Assurance - Command History. Retrieved October 21, 2020.'}],
 'kill_chain_phases': [{'kill_chain_name': 'mitre-attack',
   'phase_name': 'execution'}],
 'x_mitre_is_subtechnique': True,
 'x_mitre_version': '1.0',
 'id': 'attack-pattern--818302b2-d640-477b-bf88-873120ce85c4',
 'technique_description': 'Adversaries may abuse scripting or built-in command line interpreters (CLI) on network devices to execute malicious command and payloads. The CLI is the primary means through which users and administrators interact with the device in order to view system information, modify device operations, or perform diagnostic and administrative functions. CLIs typically contain various permission levels required for different commands. \n\nScripting interpreters automate tasks and extend functionality beyond the command set included in the network OS. The CLI and scripting interpreter are accessible through a direct console connection, or through remote means, such as telnet or secure shell (SSH).\n\nAdversaries can use the network CLI to change how network devices behave and operate. The CLI may be used to manipulate traffic flows to intercept or manipulate data, modify startup configuration parameters to load malicious system software, or to disable security features or logging to avoid detection. (Citation: Cisco Synful Knock Evolution)',
 'technique': 'Network Device CLI',
 'created_by_ref': 'identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5',
 'object_marking_refs': ['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'],
 'url': 'https://attack.mitre.org/techniques/T1059/008',
 'matrix': 'mitre-attack',
 'technique_id': 'T1059.008',
 'type': 'attack-pattern',
 'tactic': ['execution'],
 'modified': '2020-10-22T16:43:38.388Z',
 'created': '2020-10-20T00:09:33.072Z',
 'data_sources': ['Network device logs',
  'Network device run-time memory',
  'Network device command history',
  'Network device configuration'],
 'platform': ['Network'],
 'technique_detection': 'Consider reviewing command history in either the console or as part of the running memory to determine if unauthorized or suspicious commands were used to modify device configuration.(Citation: Cisco IOS Software Integrity Assurance - Command History)\n\nConsider comparing a copy of the network device configuration against a known-good version to discover unauthorized changes to the command interpreter. The same process can be accomplished through a comparison of the run-time memory, though this is non-trivial and may require assistance from the vendor.',
 'permissions_required': ['Administrator', 'User']}

Normalizing semi-structured JSON data into a flat table via pandas.io.json.json_normalize

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html

In [6]:

techniques_normalized = pandas.json_normalize(all_techniques)

In [7]:

techniques_normalized[0:1]

Out[7]:

	external_references	kill_chain_phases	x_mitre_is_subtechnique	x_mitre_version	id	technique_description	technique	created_by_ref	object_marking_refs	url	...	remote_support	impact_type	revoked	x_mitre_deprecated	x_mitre_old_attack_id	difficulty_explanation	difficulty_for_adversary	detectable_explanation	detectable_by_common_defenses	tactic_type
0	[{'source_name': 'mitre-attack', 'external_id'...	[{'kill_chain_name': 'mitre-attack', 'phase_na...	True	1.0	attack-pattern--818302b2-d640-477b-bf88-873120...	Adversaries may abuse scripting or built-in co...	Network Device CLI	identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5	[marking-definition--fa42a846-8d90-4e51-bc29-7...	https://attack.mitre.org/techniques/T1059/008	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 37 columns

6. Re-indexing Dataframe¶

In [8]:

techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

In [9]:

techniques.head()

Out[9]:

	matrix	platform	tactic	technique	technique_id	data_sources
0	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	[Network device logs, Network device run-time ...
1	mitre-attack	[Network]	[collection]	Network Device Configuration Dump	T1602.002	[Netflow/Enclave netflow, Network protocol ana...
2	mitre-attack	[Network]	[defense-evasion, persistence]	TFTP Boot	T1542.005	[Network device run-time memory, Network devic...
3	mitre-attack	[Network]	[defense-evasion, persistence]	ROMMONkit	T1542.004	[File monitoring, Netflow/Enclave netflow, Net...
4	mitre-attack	[Network]	[collection]	SNMP (MIB Dump)	T1602.001	[Netflow/Enclave netflow, Network protocol ana...

In [10]:

print('A total of ',len(techniques),' techniques')

A total of  1024  techniques

7. Removing Revoked Techniques¶

In [11]:

all_techniques_no_revoked = lift.remove_revoked(all_techniques)

In [12]:

print('A total of ',len(all_techniques_no_revoked),' techniques')

A total of  878  techniques

8. Extractinng Revoked Techniques¶

In [13]:

all_techniques_revoked = lift.extract_revoked(all_techniques)

In [14]:

print('A total of ',len(all_techniques_revoked),' techniques that have been revoked')

A total of  146  techniques that have been revoked

The revoked techniques are the following ones:

In [15]:

for t in all_techniques_revoked:
    print(t['technique'])

Web Session Cookie
Emond
Cloud Instance Metadata API
Revert Cloud Instance
Application Access Token
Elevated Execution with Prompt
Credentials from Web Browsers
PowerShell Profile
Parent PID Spoofing
Compile After Delivery
Systemd Service
Runtime Data Manipulation
Transmitted Data Manipulation
Stored Data Manipulation
Disk Content Wipe
Disk Structure Wipe
Domain Generation Algorithms
Compiled HTML File
Kernel Modules and Extensions
Spearphishing Link
CMSTP
Credentials in Registry
Control Panel Items
Kerberoasting
Spearphishing Attachment
SIP and Trust Provider Hijacking
Spearphishing via Service
Sudo Caching
Time Providers
AppCert DLLs
Dynamic Data Exchange
Multi-hop Proxy
Process Doppelgänging
Extra Window Memory Injection
Domain Fronting
Mshta
Hooking
Image File Execution Options Injection
LSASS Driver
Screensaver
LLMNR/NBT-NS Poisoning and Relay
Password Filter DLL
SSH Hijacking
SID-History Injection
Gatekeeper Bypass
HISTCONTROL
LC_LOAD_DYLIB Addition
Launchctl
Local Job Scheduling
Private Keys
Rc.common
Space after Filename
Application Shimming
AppleScript
Bash History
.bash_profile and .bashrc
Clear Command History
Dylib Hijacking
Hidden Window
Launch Daemon
Hidden Users
Input Prompt
Launch Agent
Login Item
Keychain
Plist Modification
Re-opened Applications
Setuid and Setgid
Hidden Files and Directories
Startup Items
Sudo
Securityd Memory
Trap
Authentication Package
Install Root Certificate
Netsh Helper DLL
Network Share Connection Removal
Component Object Model Hijacking
Regsvcs/Regasm
InstallUtil
Regsvr32
Code Signing
Component Firmware
File Deletion
AppInit DLLs
Security Support Provider
Web Shell
Timestomp
Pass the Ticket
NTFS File Attributes
Custom Command and Control Protocol
Process Hollowing
Disabling Security Tools
Bypass User Account Control
PowerShell
Rundll32
Windows Management Instrumentation Event Subscription
Credentials in Files
Multilayer Encryption
Windows Admin Shares
Remote Desktop Protocol
Pass the Hash
DLL Side-Loading
Bootkit
Indicator Removal from Tools
Uncommonly Used Port
Security Software Discovery
Registry Run Keys / Startup Folder
Service Registry Permissions Weakness
Indicator Blocking
New Service
Software Packing
File System Permissions Weakness
Change Default File Association
DLL Search Order Hijacking
Service Execution
Standard Cryptographic Protocol
Modify Existing Service
Windows Remote Management
Custom Cryptographic Protocol
Shortcut Modification
Data Encrypted
System Firmware
Application Deployment Software
Accessibility Features
Port Monitors
Binary Padding
Winlogon Helper DLL
Data Compressed
Remotely Install Application
Insecure Third-Party Libraries
Fake Developer Accounts
Device Type Discovery
Detect App Analysis Environment
Malicious Software Development Tools
Biometric Spoofing
Device Unlock Code Guessing or Brute Force
Malicious Media Content
URL Scheme Hijacking
Abuse of iOS Enterprise App Signing Key
App Delivered via Web Download
App Delivered via Email Attachment
Malicious or Vulnerable Built-in Device Functionality
Malicious SMS Message
Exploit Baseband Vulnerability
Stolen Developer Credentials or Signing Keys

9. Updating our Dataframe¶

In [16]:

techniques_normalized = pandas.json_normalize(all_techniques_no_revoked)
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

10. Techniques Per Matrix¶

Using altair python library we can start showing a few charts stacking the number of techniques with or without data sources. Reference: https://altair-viz.github.io/

In [17]:

data = techniques
data_2 = data.groupby(['matrix'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3

Out[17]:

	matrix	technique
0	mitre-attack	536
1	mitre-ics-attack	81
2	mitre-mobile-attack	87
3	mitre-pre-attack	174

In [18]:

alt.Chart(data_3).mark_bar().encode(x='technique', y='matrix', color='matrix').properties(height = 200)

Out[18]:

11. Techniques With and Without Data Sources¶

In [19]:

data_source_distribution = pandas.DataFrame({
    'Techniques': ['Without DS','With DS'],
    'Count of Techniques': [techniques['data_sources'].isna().sum(),techniques['data_sources'].notna().sum()]})
bars = alt.Chart(data_source_distribution).mark_bar().encode(x='Techniques',y='Count of Techniques',color='Techniques').properties(width=200,height=300)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[19]:

What is the distribution of techniques based on ATT&CK Matrix?

In [20]:

data = techniques
data['Count_DS'] = data['data_sources'].str.len()
data['Ind_DS'] = np.where(data['Count_DS']>0,'With DS','Without DS')
data_2 = data.groupby(['matrix','Ind_DS'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3

Out[20]:

	matrix	Ind_DS	technique
0	mitre-attack	With DS	474
1	mitre-attack	Without DS	62
2	mitre-ics-attack	With DS	67
3	mitre-ics-attack	Without DS	14
4	mitre-mobile-attack	Without DS	87
5	mitre-pre-attack	Without DS	174

In [21]:

alt.Chart(data_3).mark_bar().encode(x='technique', y='Ind_DS', color='matrix').properties(height = 200)

Out[21]:

What are those mitre-attack techniques without data sources?

In [22]:

data[(data['matrix']=='mitre-attack') & (data['Ind_DS']=='Without DS')]

Out[22]:

	matrix	platform	tactic	technique	technique_id	data_sources	Count_DS	Ind_DS
17	mitre-attack	[PRE]	[resource-development]	Vulnerabilities	T1588.006	NaN	NaN	Without DS
23	mitre-attack	[PRE]	[reconnaissance]	Spearphishing Service	T1598.001	NaN	NaN	Without DS
25	mitre-attack	[PRE]	[reconnaissance]	Purchase Technical Data	T1597.002	NaN	NaN	Without DS
26	mitre-attack	[PRE]	[reconnaissance]	Threat Intel Vendors	T1597.001	NaN	NaN	Without DS
27	mitre-attack	[PRE]	[reconnaissance]	Search Closed Sources	T1597	NaN	NaN	Without DS
...	...	...	...	...	...	...	...	...
90	mitre-attack	[PRE]	[resource-development]	Compromise Infrastructure	T1584	NaN	NaN	Without DS
92	mitre-attack	[PRE]	[resource-development]	Acquire Infrastructure	T1583	NaN	NaN	Without DS
220	mitre-attack	[Linux, macOS, Windows]	[collection]	Archive via Custom Method	T1560.003	NaN	NaN	Without DS
260	mitre-attack	[Linux]	[credential-access]	/etc/passwd and /etc/shadow	T1003.008	NaN	NaN	Without DS
354	mitre-attack	[Linux, macOS, Windows]	[persistence, privilege-escalation]	Boot or Logon Autostart Execution	T1547	NaN	NaN	Without DS

62 rows × 8 columns

Techniques without data sources¶

In [23]:

techniques_without_data_sources=techniques[techniques.data_sources.isnull()].reset_index(drop=True)

In [24]:

techniques_without_data_sources.head()

Out[24]:

	matrix	platform	tactic	technique	technique_id	data_sources	Count_DS	Ind_DS
0	mitre-attack	[PRE]	[resource-development]	Vulnerabilities	T1588.006	NaN	NaN	Without DS
1	mitre-attack	[PRE]	[reconnaissance]	Spearphishing Service	T1598.001	NaN	NaN	Without DS
2	mitre-attack	[PRE]	[reconnaissance]	Purchase Technical Data	T1597.002	NaN	NaN	Without DS
3	mitre-attack	[PRE]	[reconnaissance]	Threat Intel Vendors	T1597.001	NaN	NaN	Without DS
4	mitre-attack	[PRE]	[reconnaissance]	Search Closed Sources	T1597	NaN	NaN	Without DS

In [25]:

print('There are ',techniques['data_sources'].isna().sum(),' techniques without data sources (',"{0:.0%}".format(techniques['data_sources'].isna().sum()/len(techniques)),' of ',len(techniques),' techniques)')

There are  337  techniques without data sources ( 38%  of  878  techniques)

Techniques With Data Sources¶

In [26]:

techniques_with_data_sources=techniques[techniques.data_sources.notnull()].reset_index(drop=True)

In [27]:

techniques_with_data_sources.head()

Out[27]:

	matrix	platform	tactic	technique	technique_id	data_sources	Count_DS	Ind_DS
0	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	[Network device logs, Network device run-time ...	4.0	With DS
1	mitre-attack	[Network]	[collection]	Network Device Configuration Dump	T1602.002	[Netflow/Enclave netflow, Network protocol ana...	3.0	With DS
2	mitre-attack	[Network]	[defense-evasion, persistence]	TFTP Boot	T1542.005	[Network device run-time memory, Network devic...	5.0	With DS
3	mitre-attack	[Network]	[defense-evasion, persistence]	ROMMONkit	T1542.004	[File monitoring, Netflow/Enclave netflow, Net...	4.0	With DS
4	mitre-attack	[Network]	[collection]	SNMP (MIB Dump)	T1602.001	[Netflow/Enclave netflow, Network protocol ana...	3.0	With DS

In [28]:

print('There are ',techniques['data_sources'].notna().sum(),' techniques with data sources (',"{0:.0%}".format(techniques['data_sources'].notna().sum()/len(techniques)),' of ',len(techniques),' techniques)')

There are  541  techniques with data sources ( 62%  of  878  techniques)

12. Grouping Techniques With Data Sources By Matrix¶

Let's create a graph to represent the number of techniques per matrix:

In [29]:

matrix_distribution = pandas.DataFrame({
    'Matrix': list(techniques_with_data_sources.groupby(['matrix'])['matrix'].count().keys()),
    'Count of Techniques': techniques_with_data_sources.groupby(['matrix'])['matrix'].count().tolist()})
bars = alt.Chart(matrix_distribution).mark_bar().encode(y='Matrix',x='Count of Techniques').properties(width=300,height=100)
text = bars.mark_text(align='center',baseline='middle',dx=10,dy=0).encode(text='Count of Techniques')
bars + text

Out[29]:

All the techniques belong to mitre-attack matrix which is the main Enterprise matrix. Reference: https://attack.mitre.org/wiki/Main_Page

13. Grouping Techniques With Data Sources by Platform¶

First, we need to split the platform column values because a technique might be mapped to more than one platform

In [30]:

techniques_platform=techniques_with_data_sources

attributes_1 = ['platform'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_1:
    s = techniques_platform.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_platform=techniques_platform.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_platform", and then join "techniques_platform" with "s"

# Let's re-arrange the columns from general to specific
techniques_platform_2=techniques_platform.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

We can now show techniques with data sources mapped to one platform at the time

In [31]:

techniques_platform_2.head()

Out[31]:

	matrix	platform	tactic	technique	technique_id	data_sources
0	mitre-attack	Network	[execution]	Network Device CLI	T1059.008	[Network device logs, Network device run-time ...
1	mitre-attack	Network	[collection]	Network Device Configuration Dump	T1602.002	[Netflow/Enclave netflow, Network protocol ana...
2	mitre-attack	Network	[defense-evasion, persistence]	TFTP Boot	T1542.005	[Network device run-time memory, Network devic...
3	mitre-attack	Network	[defense-evasion, persistence]	ROMMONkit	T1542.004	[File monitoring, Netflow/Enclave netflow, Net...
4	mitre-attack	Network	[collection]	SNMP (MIB Dump)	T1602.001	[Netflow/Enclave netflow, Network protocol ana...

Let's create a visualization to show the number of techniques grouped by platform:

In [32]:

platform_distribution = pandas.DataFrame({
    'Platform': list(techniques_platform_2.groupby(['platform'])['platform'].count().keys()),
    'Count of Techniques': techniques_platform_2.groupby(['platform'])['platform'].count().tolist()})
bars = alt.Chart(platform_distribution,height=300).mark_bar().encode(x ='Platform',y='Count of Techniques',color='Platform').properties(width=200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[32]:

In the bar chart above we can see that there are more techniques with data sources mapped to the Windows platform.

14. Grouping Techniques With Data Sources by Tactic¶

Again, first we need to split the tactic column values because a technique might be mapped to more than one tactic:

In [33]:

techniques_tactic=techniques_with_data_sources

attributes_2 = ['tactic'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_2:
    s = techniques_tactic.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_tactic = techniques_tactic.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_tactic", and then join "techniques_tactic" with "s"

# Let's re-arrange the columns from general to specific
techniques_tactic_2=techniques_tactic.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

We can now show techniques with data sources mapped to one tactic at the time

In [34]:

techniques_tactic_2.head()

Out[34]:

	matrix	platform	tactic	technique	technique_id	data_sources
0	mitre-attack	[Network]	execution	Network Device CLI	T1059.008	[Network device logs, Network device run-time ...
1	mitre-attack	[Network]	collection	Network Device Configuration Dump	T1602.002	[Netflow/Enclave netflow, Network protocol ana...
2	mitre-attack	[Network]	defense-evasion	TFTP Boot	T1542.005	[Network device run-time memory, Network devic...
3	mitre-attack	[Network]	persistence	TFTP Boot	T1542.005	[Network device run-time memory, Network devic...
4	mitre-attack	[Network]	defense-evasion	ROMMONkit	T1542.004	[File monitoring, Netflow/Enclave netflow, Net...

Let's create a visualization to show the number of techniques grouped by tactic:

In [35]:

tactic_distribution = pandas.DataFrame({
    'Tactic': list(techniques_tactic_2.groupby(['tactic'])['tactic'].count().keys()),
    'Count of Techniques': techniques_tactic_2.groupby(['tactic'])['tactic'].count().tolist()}).sort_values(by='Count of Techniques',ascending=True)
bars = alt.Chart(tactic_distribution,width=800,height=300).mark_bar().encode(x ='Tactic',y='Count of Techniques',color='Tactic').properties(width=400)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[35]:

Defende-evasion and Persistence are tactics with the highest nummber of techniques with data sources

15. Grouping Techniques With Data Sources by Data Source¶

We need to split the data source column values because a technique might be mapped to more than one data source:

In [36]:

techniques_data_source=techniques_with_data_sources

attributes_3 = ['data_sources'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes_3:
    s = techniques_data_source.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_data_source = techniques_data_source.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_data_source", and then join "techniques_data_source" with "s"

# Let's re-arrange the columns from general to specific
techniques_data_source_2 = techniques_data_source.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_source_3 = techniques_data_source_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

We can now show techniques with data sources mapped to one data source at the time

In [37]:

techniques_data_source_3.head()

Out[37]:

	matrix	platform	tactic	technique	technique_id	data_sources
0	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	Network device logs
1	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	Network device run-time memory
2	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	Network device command history
3	mitre-attack	[Network]	[execution]	Network Device CLI	T1059.008	Network device configuration
4	mitre-attack	[Network]	[collection]	Network Device Configuration Dump	T1602.002	Netflow/Enclave netflow

Let's create a visualization to show the number of techniques grouped by data sources:

In [38]:

data_source_distribution = pandas.DataFrame({
    'Data Source': list(techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().keys()),
    'Count of Techniques': techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().tolist()})
bars = alt.Chart(data_source_distribution,width=800,height=300).mark_bar().encode(x ='Data Source',y='Count of Techniques',color='Data Source').properties(width=1200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[38]:

A few interesting things from the bar chart above:

Process Monitoring, File Monitoring, and Process Command-line parameters are the Data Sources with the highest number of techniques
There are some data source names that include string references to Windows such as PowerShell, Windows and wmi

16. Most Relevant Groups Of Data Sources Per Technique¶

Number Of Data Sources Per Technique¶

Although identifying the data sources with the highest number of techniques is a good start, they usually do not work alone. You might be collecting Process Monitoring already but you might be still missing a lot of context from a data perspective.

In [39]:

data_source_distribution_2 = pandas.DataFrame({
    'Techniques': list(techniques_data_source_3.groupby(['technique'])['technique'].count().keys()),
    'Count of Data Sources': techniques_data_source_3.groupby(['technique'])['technique'].count().tolist()})

data_source_distribution_3 = pandas.DataFrame({
    'Number of Data Sources': list(data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().keys()),
    'Count of Techniques': data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().tolist()})

bars = alt.Chart(data_source_distribution_3).mark_bar().encode(x ='Number of Data Sources',y='Count of Techniques').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[39]:

The image above shows you the number data sources needed per techniques according to ATT&CK:

There are 71 techniques that require 3 data sources as enough context to validate the detection of them according to ATT&CK
Only one technique has 12 data sources
One data source only applies to 19 techniques

Let's create subsets of data sources with the data source column defining and using a python function:

In [40]:

# https://stackoverflow.com/questions/26332412/python-recursive-function-to-display-all-subsets-of-given-set
def subs(l):
    res = []
    for i in range(1, len(l) + 1):
        for combo in itertools.combinations(l, i):
            res.append(list(combo))
    return res

Before applying the function, we need to use lowercase data sources names and sort data sources names to improve consistency:

In [41]:

df = techniques_with_data_sources[['data_sources']]

In [42]:

for index, row in df.iterrows():
    row["data_sources"]=[x.lower() for x in row["data_sources"]]
    row["data_sources"].sort()

In [43]:

df.head()

Out[43]:

	data_sources
0	[network device command history, network devic...
1	[netflow/enclave netflow, network protocol ana...
2	[file monitoring, network device command histo...
3	[file monitoring, netflow/enclave netflow, net...
4	[netflow/enclave netflow, network protocol ana...

Let's apply the function and split the subsets column:

In [44]:

df['subsets']=df['data_sources'].apply(subs)

<ipython-input-44-9765a9dc0b2f>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['subsets']=df['data_sources'].apply(subs)

In [45]:

df.head()

Out[45]:

	data_sources	subsets
0	[network device command history, network devic...	[[network device command history], [network de...
1	[netflow/enclave netflow, network protocol ana...	[[netflow/enclave netflow], [network protocol ...
2	[file monitoring, network device command histo...	[[file monitoring], [network device command hi...
3	[file monitoring, netflow/enclave netflow, net...	[[file monitoring], [netflow/enclave netflow],...
4	[netflow/enclave netflow, network protocol ana...	[[netflow/enclave netflow], [network protocol ...

We need to split the subsets column values:

In [46]:

techniques_with_data_sources_preview = df

In [47]:

attributes_4 = ['subsets']

for a in attributes_4:
    s = techniques_with_data_sources_preview.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    s.name = a
    techniques_with_data_sources_preview = techniques_with_data_sources_preview.drop(a, axis=1).join(s).reset_index(drop=True)
    
techniques_with_data_sources_subsets = techniques_with_data_sources_preview.reindex(['data_sources','subsets'], axis=1)

In [48]:

techniques_with_data_sources_subsets.head()

Out[48]:

	data_sources	subsets
0	[network device command history, network devic...	[network device command history]
1	[network device command history, network devic...	[network device configuration]
2	[network device command history, network devic...	[network device logs]
3	[network device command history, network devic...	[network device run-time memory]
4	[network device command history, network devic...	[network device command history, network devic...

Let's add three columns to analyse the dataframe: subsets_name (Changing Lists to Strings), subsets_number_elements ( Number of data sources per subset) and number_data_sources_per_technique

In [49]:

techniques_with_data_sources_subsets['subsets_name']=techniques_with_data_sources_subsets['subsets'].apply(lambda x: ','.join(map(str, x)))
techniques_with_data_sources_subsets['subsets_number_elements']=techniques_with_data_sources_subsets['subsets'].str.len()
techniques_with_data_sources_subsets['number_data_sources_per_technique']=techniques_with_data_sources_subsets['data_sources'].str.len()

In [50]:

techniques_with_data_sources_subsets.head()

Out[50]:

	data_sources	subsets	subsets_name	subsets_number_elements	number_data_sources_per_technique
0	[network device command history, network devic...	[network device command history]	network device command history	1	4
1	[network device command history, network devic...	[network device configuration]	network device configuration	1	4
2	[network device command history, network devic...	[network device logs]	network device logs	1	4
3	[network device command history, network devic...	[network device run-time memory]	network device run-time memory	1	4
4	[network device command history, network devic...	[network device command history, network devic...	network device command history,network device ...	2	4

As it was described above, we need to find grups pf data sources, so we are going to filter out all the subsets with only one data source:

In [51]:

subsets = techniques_with_data_sources_subsets

subsets_ok=subsets[subsets.subsets_number_elements != 1]

In [52]:

subsets_ok.head()

Out[52]:

	data_sources	subsets	subsets_name	subsets_number_elements	number_data_sources_per_technique
4	[network device command history, network devic...	[network device command history, network devic...	network device command history,network device ...	2	4
5	[network device command history, network devic...	[network device command history, network devic...	network device command history,network device ...	2	4
6	[network device command history, network devic...	[network device command history, network devic...	network device command history,network device ...	2	4
7	[network device command history, network devic...	[network device configuration, network device ...	network device configuration,network device logs	2	4
8	[network device command history, network devic...	[network device configuration, network device ...	network device configuration,network device ru...	2	4

Finally, we calculate the most relevant groups of data sources (Top 15):

In [53]:

subsets_graph = subsets_ok.groupby(['subsets_name'])['subsets_name'].count().to_frame(name='subsets_count').sort_values(by='subsets_count',ascending=False)[0:15]

In [54]:

subsets_graph

Out[54]:

	subsets_count
subsets_name
process command-line parameters,process monitoring	183
file monitoring,process monitoring	144
file monitoring,process command-line parameters	100
file monitoring,process command-line parameters,process monitoring	88
network protocol analysis,packet capture	76
api monitoring,process monitoring	70
process monitoring,process use of network	56
netflow/enclave netflow,packet capture	55
process monitoring,windows registry	50
packet capture,process use of network	45
packet capture,process monitoring	43
process command-line parameters,windows registry	41
netflow/enclave netflow,network protocol analysis	41
network protocol analysis,process use of network	40
netflow/enclave netflow,process monitoring	38

In [55]:

subsets_graph_2 = pandas.DataFrame({
    'Data Sources': list(subsets_graph.index),
    'Count of Techniques': subsets_graph['subsets_count'].tolist()})

bars = alt.Chart(subsets_graph_2).mark_bar().encode(x ='Data Sources', y ='Count of Techniques', color='Data Sources').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx= 0,dy=-5).encode(text='Count of Techniques')
bars + text

Out[55]:

Group (Process Monitoring - Process Command-line parameters) is the is the group of data sources with the highest number of techniques. This group of data sources are suggested to hunt 78 techniques

17. Let's Split all the Information About Techniques With Data Sources Defined: Matrix, Platform, Tactic and Data Source¶

Let's split all the relevant columns of the dataframe:

In [56]:

techniques_data = techniques_with_data_sources

attributes = ['platform','tactic','data_sources'] # In attributes we are going to indicate the name of the columns that we need to split

for a in attributes:
    s = techniques_data.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    # "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
    s.name = a
    # We name "s" with the same name of "a".
    techniques_data=techniques_data.drop(a, axis=1).join(s).reset_index(drop=True)
    # We drop the column "a" from "techniques_data", and then join "techniques_data" with "s"

# Let's re-arrange the columns from general to specific
techniques_data_2=techniques_data.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)

# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_3 = techniques_data_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

techniques_data_3.head()

Out[56]:

	matrix	platform	tactic	technique	technique_id	data_sources
0	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device logs
1	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device run-time memory
2	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device command history
3	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device configuration
4	mitre-attack	Network	collection	Network Device Configuration Dump	T1602.002	Netflow/Enclave netflow

Do you remember data sources names with a reference to Windows? After splitting the dataframe by platforms, tactics and data sources, are there any macOC or linux techniques that consider windows data sources? Let's identify those rows:

In [57]:

# After splitting the rows of the dataframe, there are some values that relate windows data sources with platforms like linux and masOS.
# We need to identify those rows
conditions = [(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
             (techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
             (techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True),
             (techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True)]
# In conditions we indicate a logical test

choices = ['NO OK','NO OK','NO OK','NO OK','NO OK','NO OK']
# In choices, we indicate the result when the logical test is true

techniques_data_3['Validation'] = np.select(conditions,choices,default='OK')
# We add a column "Validation" to "techniques_data_3" with the result of the logical test. The default value is going to be "OK"

What is the inconsistent data?

In [58]:

techniques_analysis_data_no_ok = techniques_data_3[techniques_data_3.Validation == 'NO OK']
# Finally, we are filtering all the values with NO OK

techniques_analysis_data_no_ok.head()

Out[58]:

	matrix	platform	tactic	technique	technique_id	data_sources	Validation
162	mitre-attack	Linux	defense-evasion	Run Virtual Instance	T1564.006	Windows Registry	NO OK
168	mitre-attack	macOS	defense-evasion	Run Virtual Instance	T1564.006	Windows Registry	NO OK
179	mitre-attack	Linux	defense-evasion	Hidden File System	T1564.005	Windows Registry	NO OK
181	mitre-attack	macOS	defense-evasion	Hidden File System	T1564.005	Windows Registry	NO OK
794	mitre-attack	macOS	defense-evasion	Hidden Window	T1564.003	PowerShell logs	NO OK

In [59]:

print('There are ',len(techniques_analysis_data_no_ok),' rows with inconsistent data')

There are  136  rows with inconsistent data

What is the impact of this inconsistent data from a platform and data sources perspective?

In [60]:

df = techniques_with_data_sources

attributes = ['platform','data_sources']

for a in attributes:
    s = df.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
    s.name = a
    df=df.drop(a, axis=1).join(s).reset_index(drop=True)
    
df_2=df.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
df_3 = df_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])

conditions = [(df_3['data_sources'].str.contains('windows',case=False)== True),
              (df_3['data_sources'].str.contains('powershell',case=False)== True),
              (df_3['data_sources'].str.contains('wmi',case=False)== True)]

choices = ['Windows','Windows','Windows']

df_3['Validation'] = np.select(conditions,choices,default='Other')
df_3['Num_Tech'] = 1
df_4 = df_3[df_3.Validation == 'Windows']
df_5 = df_4.groupby(['data_sources','platform'])['technique'].nunique()
df_6 = df_5.to_frame().reset_index()

In [61]:

alt.Chart(df_6).mark_bar().encode(x=alt.X('technique', stack="normalize"),    y='data_sources',    color='platform').properties(height=200)

Out[61]:

There are techniques that consider Windows Error Reporting, Windows Registry, and Windows event logs as data sources and they also consider platforms like Linux and masOS. We do not need to consider this rows because those data sources can only be managed at a Windows environment. These are the techniques that we should not consider in our data base:

In [62]:

techniques_analysis_data_no_ok[['technique','data_sources']].drop_duplicates().sort_values(by='data_sources',ascending=True)

Out[62]:

	technique	data_sources
5953	OS Credential Dumping	PowerShell logs
5832	Remote Services	PowerShell logs
2814	Clear Command History	PowerShell logs
2432	Credentials from Password Stores	PowerShell logs
4564	Peripheral Device Discovery	PowerShell logs
2271	Keychain	PowerShell logs
2259	Credentials from Web Browsers	PowerShell logs
2392	GUI Input Capture	PowerShell logs
1831	Impair Command History Logging	PowerShell logs
794	Hidden Window	PowerShell logs
1611	Hide Artifacts	PowerShell logs
5431	Input Capture	PowerShell logs
5402	Command and Scripting Interpreter	PowerShell logs
3206	Event Triggered Execution	WMI Objects
4156	Exploitation of Remote Services	Windows Error Reporting
4206	Exploitation for Defense Evasion	Windows Error Reporting
5361	Exploitation for Privilege Escalation	Windows Error Reporting
4241	Exploitation for Credential Access	Windows Error Reporting
3212	Event Triggered Execution	Windows Registry
5217	Software Deployment Tools	Windows Registry
4038	Service Stop	Windows Registry
4020	Inhibit System Recovery	Windows Registry
5426	Input Capture	Windows Registry
3389	Create or Modify System Process	Windows Registry
5827	Remote Services	Windows Registry
4373	Browser Extensions	Windows Registry
162	Run Virtual Instance	Windows Registry
2414	Keylogging	Windows Registry
1875	Impair Defenses	Windows Registry
2599	Masquerade Task or Service	Windows Registry
1857	Disable or Modify Tools	Windows Registry
2654	Subvert Trust Controls	Windows Registry
1824	Disable or Modify System Firewall	Windows Registry
1204	System Services	Windows Registry
2341	Modify Authentication Process	Windows Registry
2722	Unsecured Credentials	Windows Registry
179	Hidden File System	Windows Registry
2895	Abuse Elevation Control Mechanism	Windows Registry
5278	Indicator Removal on Host	Windows event logs
5775	Obfuscated Files or Information	Windows event logs
5401	Command and Scripting Interpreter	Windows event logs
5828	Remote Services	Windows event logs
5559	Scheduled Task/Job	Windows event logs
5427	Input Capture	Windows event logs
2970	Local Account	Windows event logs
3202	Event Triggered Execution	Windows event logs
4439	Create Account	Windows event logs
2602	Masquerade Task or Service	Windows event logs
2655	Subvert Trust Controls	Windows event logs
4078	File and Directory Permissions Modification	Windows event logs
2720	Unsecured Credentials	Windows event logs
4022	Inhibit System Recovery	Windows event logs
3624	System Shutdown/Reboot	Windows event logs
3605	Account Access Removal	Windows event logs
2962	Domain Account	Windows event logs
4909	Account Manipulation	Windows event logs
3388	Create or Modify System Process	Windows event logs

Without considering this inconsistent data, the final dataframe is:

In [63]:

techniques_analysis_data_ok = techniques_data_3[techniques_data_3.Validation == 'OK']
techniques_analysis_data_ok.head()

Out[63]:

	matrix	platform	tactic	technique	technique_id	data_sources	Validation
0	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device logs	OK
1	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device run-time memory	OK
2	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device command history	OK
3	mitre-attack	Network	execution	Network Device CLI	T1059.008	Network device configuration	OK
4	mitre-attack	Network	collection	Network Device Configuration Dump	T1602.002	Netflow/Enclave netflow	OK

In [64]:

print('There are ',len(techniques_analysis_data_ok),' rows of data that you can play with')

There are  6650  rows of data that you can play with

18. Getting Techniques by Data Sources¶

This function gets techniques' information that includes specific data sources

In [65]:

data_source = 'PROCESS MONITORING'

In [66]:

results = lift.get_techniques_by_datasources(data_source)

In [67]:

len(results)

Out[67]:

In [68]:

type(results)

Out[68]:

list

In [69]:

results2 = lift.get_techniques_by_datasources('pRoceSS MoniTorinG','process commAnd-linE parameters')

In [70]:

len(results2)

Out[70]:

In [71]:

results2[1]

Out[71]:

AttackPattern(type='attack-pattern', id='attack-pattern--2de47683-f398-448f-b947-9abcc3e32fad', created_by_ref='identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', created='2020-10-05T13:24:49.780Z', modified='2020-10-09T16:05:36.344Z', name='Print Processors', description='Adversaries may abuse print processors to run malicious DLLs during system boot for persistence and/or privilege escalation. Print processors are DLLs that are loaded by the print spooler service, spoolsv.exe, during boot. \n\nAdversaries may abuse the print spooler service by adding print processors that load malicious DLLs at startup. A print processor can be installed through the <code>AddPrintProcessor</code> API call with an account that has <code>SeLoadDriverPrivilege</code> enabled. Alternatively, a print processor can be registered to the print spooler service by adding the <code>HKLM\\SYSTEM\\\\[CurrentControlSet or ControlSet001]\\Control\\Print\\Environments\\\\[Windows architecture: e.g., Windows x64]\\Print Processors\\\\[user defined]\\Driver</code> Registry key that points to the DLL. For the print processor to be correctly installed, it must be located in the system print-processor directory that can be found with the <code>GetPrintProcessorDirectory</code> API call.(Citation: Microsoft AddPrintProcessor May 2018) After the print processors are installed, the print spooler service, which starts during boot, must be restarted in order for them to run.(Citation: ESET PipeMon May 2020) The print spooler service runs under SYSTEM level permissions, therefore print processors installed by an adversary may run under elevated privileges.', kill_chain_phases=[KillChainPhase(kill_chain_name='mitre-attack', phase_name='persistence'), KillChainPhase(kill_chain_name='mitre-attack', phase_name='privilege-escalation')], external_references=[ExternalReference(source_name='mitre-attack', url='https://attack.mitre.org/techniques/T1547/012', external_id='T1547.012'), ExternalReference(source_name='Microsoft AddPrintProcessor May 2018', description='Microsoft. (2018, May 31). AddPrintProcessor function. Retrieved October 5, 2020.', url='https://docs.microsoft.com/en-us/windows/win32/printdocs/addprintprocessor'), ExternalReference(source_name='ESET PipeMon May 2020', description='Tartare, M. et al. (2020, May 21). No “Game over” for the Winnti Group. Retrieved August 24, 2020.', url='https://www.welivesecurity.com/2020/05/21/no-game-over-winnti-group/')], object_marking_refs=['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'], x_mitre_contributors=['Mathieu Tartare, ESET'], x_mitre_data_sources=['Process monitoring', 'Windows Registry', 'File monitoring', 'DLL monitoring', 'API monitoring'], x_mitre_detection='Monitor process API calls to <code>AddPrintProcessor</code> and <code>GetPrintProcessorDirectory</code>. New print processor DLLs are written to the print processor directory. Also monitor Registry writes to <code>HKLM\\SYSTEM\\ControlSet001\\Control\\Print\\Environments\\\\[Windows architecture]\\Print Processors\\\\[user defined]\\\\Driver</code> or <code>HKLM\\SYSTEM\\CurrentControlSet\\Control\\Print\\Environments\\\\[Windows architecture]\\Print Processors\\\\[user defined]\\Driver</code> as they pertain to print processor installations.\n\nMonitor for abnormal DLLs that are loaded by spoolsv.exe. Print processors that do not correlate with known good software or patching may be suspicious.', x_mitre_is_subtechnique=True, x_mitre_permissions_required=['Administrator', 'SYSTEM'], x_mitre_platforms=['Windows'], x_mitre_version='1.0')