You can install it via PIP: pip install attackcti
from attackcti import attack_client
from pandas import *
import numpy as np
import altair as alt
alt.renderers.enable('notebook')
import itertools
lift = attack_client()
Getting ALL ATT&CK Techniques
all_techniques = lift.get_techniques(stix_format=False)
Showing the first technique in our list
all_techniques[0]
{'external_references': [{'source_name': 'mitre-attack', 'external_id': 'T1059.008', 'url': 'https://attack.mitre.org/techniques/T1059/008'}, {'source_name': 'Cisco Synful Knock Evolution', 'url': 'https://blogs.cisco.com/security/evolution-of-attacks-on-cisco-ios-devices', 'description': 'Graham Holmes. (2015, October 8). Evolution of attacks on Cisco IOS devices. Retrieved October 19, 2020.'}, {'source_name': 'Cisco IOS Software Integrity Assurance - Command History', 'url': 'https://tools.cisco.com/security/center/resources/integrity_assurance.html#23', 'description': 'Cisco. (n.d.). Cisco IOS Software Integrity Assurance - Command History. Retrieved October 21, 2020.'}], 'kill_chain_phases': [{'kill_chain_name': 'mitre-attack', 'phase_name': 'execution'}], 'x_mitre_is_subtechnique': True, 'x_mitre_version': '1.0', 'id': 'attack-pattern--818302b2-d640-477b-bf88-873120ce85c4', 'technique_description': 'Adversaries may abuse scripting or built-in command line interpreters (CLI) on network devices to execute malicious command and payloads. The CLI is the primary means through which users and administrators interact with the device in order to view system information, modify device operations, or perform diagnostic and administrative functions. CLIs typically contain various permission levels required for different commands. \n\nScripting interpreters automate tasks and extend functionality beyond the command set included in the network OS. The CLI and scripting interpreter are accessible through a direct console connection, or through remote means, such as telnet or secure shell (SSH).\n\nAdversaries can use the network CLI to change how network devices behave and operate. The CLI may be used to manipulate traffic flows to intercept or manipulate data, modify startup configuration parameters to load malicious system software, or to disable security features or logging to avoid detection. (Citation: Cisco Synful Knock Evolution)', 'technique': 'Network Device CLI', 'created_by_ref': 'identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', 'object_marking_refs': ['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'], 'url': 'https://attack.mitre.org/techniques/T1059/008', 'matrix': 'mitre-attack', 'technique_id': 'T1059.008', 'type': 'attack-pattern', 'tactic': ['execution'], 'modified': '2020-10-22T16:43:38.388Z', 'created': '2020-10-20T00:09:33.072Z', 'data_sources': ['Network device logs', 'Network device run-time memory', 'Network device command history', 'Network device configuration'], 'platform': ['Network'], 'technique_detection': 'Consider reviewing command history in either the console or as part of the running memory to determine if unauthorized or suspicious commands were used to modify device configuration.(Citation: Cisco IOS Software Integrity Assurance - Command History)\n\nConsider comparing a copy of the network device configuration against a known-good version to discover unauthorized changes to the command interpreter. The same process can be accomplished through a comparison of the run-time memory, though this is non-trivial and may require assistance from the vendor.', 'permissions_required': ['Administrator', 'User']}
Normalizing semi-structured JSON data into a flat table via pandas.io.json.json_normalize
techniques_normalized = pandas.json_normalize(all_techniques)
techniques_normalized[0:1]
external_references | kill_chain_phases | x_mitre_is_subtechnique | x_mitre_version | id | technique_description | technique | created_by_ref | object_marking_refs | url | ... | remote_support | impact_type | revoked | x_mitre_deprecated | x_mitre_old_attack_id | difficulty_explanation | difficulty_for_adversary | detectable_explanation | detectable_by_common_defenses | tactic_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [{'source_name': 'mitre-attack', 'external_id'... | [{'kill_chain_name': 'mitre-attack', 'phase_na... | True | 1.0 | attack-pattern--818302b2-d640-477b-bf88-873120... | Adversaries may abuse scripting or built-in co... | Network Device CLI | identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5 | [marking-definition--fa42a846-8d90-4e51-bc29-7... | https://attack.mitre.org/techniques/T1059/008 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 37 columns
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
techniques.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | [Network device logs, Network device run-time ... |
1 | mitre-attack | [Network] | [collection] | Network Device Configuration Dump | T1602.002 | [Netflow/Enclave netflow, Network protocol ana... |
2 | mitre-attack | [Network] | [defense-evasion, persistence] | TFTP Boot | T1542.005 | [Network device run-time memory, Network devic... |
3 | mitre-attack | [Network] | [defense-evasion, persistence] | ROMMONkit | T1542.004 | [File monitoring, Netflow/Enclave netflow, Net... |
4 | mitre-attack | [Network] | [collection] | SNMP (MIB Dump) | T1602.001 | [Netflow/Enclave netflow, Network protocol ana... |
print('A total of ',len(techniques),' techniques')
A total of 1024 techniques
all_techniques_no_revoked = lift.remove_revoked(all_techniques)
print('A total of ',len(all_techniques_no_revoked),' techniques')
A total of 878 techniques
all_techniques_revoked = lift.extract_revoked(all_techniques)
print('A total of ',len(all_techniques_revoked),' techniques that have been revoked')
A total of 146 techniques that have been revoked
The revoked techniques are the following ones:
for t in all_techniques_revoked:
print(t['technique'])
Web Session Cookie Emond Cloud Instance Metadata API Revert Cloud Instance Application Access Token Elevated Execution with Prompt Credentials from Web Browsers PowerShell Profile Parent PID Spoofing Compile After Delivery Systemd Service Runtime Data Manipulation Transmitted Data Manipulation Stored Data Manipulation Disk Content Wipe Disk Structure Wipe Domain Generation Algorithms Compiled HTML File Kernel Modules and Extensions Spearphishing Link CMSTP Credentials in Registry Control Panel Items Kerberoasting Spearphishing Attachment SIP and Trust Provider Hijacking Spearphishing via Service Sudo Caching Time Providers AppCert DLLs Dynamic Data Exchange Multi-hop Proxy Process Doppelgänging Extra Window Memory Injection Domain Fronting Mshta Hooking Image File Execution Options Injection LSASS Driver Screensaver LLMNR/NBT-NS Poisoning and Relay Password Filter DLL SSH Hijacking SID-History Injection Gatekeeper Bypass HISTCONTROL LC_LOAD_DYLIB Addition Launchctl Local Job Scheduling Private Keys Rc.common Space after Filename Application Shimming AppleScript Bash History .bash_profile and .bashrc Clear Command History Dylib Hijacking Hidden Window Launch Daemon Hidden Users Input Prompt Launch Agent Login Item Keychain Plist Modification Re-opened Applications Setuid and Setgid Hidden Files and Directories Startup Items Sudo Securityd Memory Trap Authentication Package Install Root Certificate Netsh Helper DLL Network Share Connection Removal Component Object Model Hijacking Regsvcs/Regasm InstallUtil Regsvr32 Code Signing Component Firmware File Deletion AppInit DLLs Security Support Provider Web Shell Timestomp Pass the Ticket NTFS File Attributes Custom Command and Control Protocol Process Hollowing Disabling Security Tools Bypass User Account Control PowerShell Rundll32 Windows Management Instrumentation Event Subscription Credentials in Files Multilayer Encryption Windows Admin Shares Remote Desktop Protocol Pass the Hash DLL Side-Loading Bootkit Indicator Removal from Tools Uncommonly Used Port Security Software Discovery Registry Run Keys / Startup Folder Service Registry Permissions Weakness Indicator Blocking New Service Software Packing File System Permissions Weakness Change Default File Association DLL Search Order Hijacking Service Execution Standard Cryptographic Protocol Modify Existing Service Windows Remote Management Custom Cryptographic Protocol Shortcut Modification Data Encrypted System Firmware Application Deployment Software Accessibility Features Port Monitors Binary Padding Winlogon Helper DLL Data Compressed Remotely Install Application Insecure Third-Party Libraries Fake Developer Accounts Device Type Discovery Detect App Analysis Environment Malicious Software Development Tools Biometric Spoofing Device Unlock Code Guessing or Brute Force Malicious Media Content URL Scheme Hijacking Abuse of iOS Enterprise App Signing Key App Delivered via Web Download App Delivered via Email Attachment Malicious or Vulnerable Built-in Device Functionality Malicious SMS Message Exploit Baseband Vulnerability Stolen Developer Credentials or Signing Keys
techniques_normalized = pandas.json_normalize(all_techniques_no_revoked)
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
Using altair python library we can start showing a few charts stacking the number of techniques with or without data sources. Reference: https://altair-viz.github.io/
data = techniques
data_2 = data.groupby(['matrix'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3
matrix | technique | |
---|---|---|
0 | mitre-attack | 536 |
1 | mitre-ics-attack | 81 |
2 | mitre-mobile-attack | 87 |
3 | mitre-pre-attack | 174 |
alt.Chart(data_3).mark_bar().encode(x='technique', y='matrix', color='matrix').properties(height = 200)
data_source_distribution = pandas.DataFrame({
'Techniques': ['Without DS','With DS'],
'Count of Techniques': [techniques['data_sources'].isna().sum(),techniques['data_sources'].notna().sum()]})
bars = alt.Chart(data_source_distribution).mark_bar().encode(x='Techniques',y='Count of Techniques',color='Techniques').properties(width=200,height=300)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
What is the distribution of techniques based on ATT&CK Matrix?
data = techniques
data['Count_DS'] = data['data_sources'].str.len()
data['Ind_DS'] = np.where(data['Count_DS']>0,'With DS','Without DS')
data_2 = data.groupby(['matrix','Ind_DS'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3
matrix | Ind_DS | technique | |
---|---|---|---|
0 | mitre-attack | With DS | 474 |
1 | mitre-attack | Without DS | 62 |
2 | mitre-ics-attack | With DS | 67 |
3 | mitre-ics-attack | Without DS | 14 |
4 | mitre-mobile-attack | Without DS | 87 |
5 | mitre-pre-attack | Without DS | 174 |
alt.Chart(data_3).mark_bar().encode(x='technique', y='Ind_DS', color='matrix').properties(height = 200)
What are those mitre-attack techniques without data sources?
data[(data['matrix']=='mitre-attack') & (data['Ind_DS']=='Without DS')]
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
17 | mitre-attack | [PRE] | [resource-development] | Vulnerabilities | T1588.006 | NaN | NaN | Without DS |
23 | mitre-attack | [PRE] | [reconnaissance] | Spearphishing Service | T1598.001 | NaN | NaN | Without DS |
25 | mitre-attack | [PRE] | [reconnaissance] | Purchase Technical Data | T1597.002 | NaN | NaN | Without DS |
26 | mitre-attack | [PRE] | [reconnaissance] | Threat Intel Vendors | T1597.001 | NaN | NaN | Without DS |
27 | mitre-attack | [PRE] | [reconnaissance] | Search Closed Sources | T1597 | NaN | NaN | Without DS |
... | ... | ... | ... | ... | ... | ... | ... | ... |
90 | mitre-attack | [PRE] | [resource-development] | Compromise Infrastructure | T1584 | NaN | NaN | Without DS |
92 | mitre-attack | [PRE] | [resource-development] | Acquire Infrastructure | T1583 | NaN | NaN | Without DS |
220 | mitre-attack | [Linux, macOS, Windows] | [collection] | Archive via Custom Method | T1560.003 | NaN | NaN | Without DS |
260 | mitre-attack | [Linux] | [credential-access] | /etc/passwd and /etc/shadow | T1003.008 | NaN | NaN | Without DS |
354 | mitre-attack | [Linux, macOS, Windows] | [persistence, privilege-escalation] | Boot or Logon Autostart Execution | T1547 | NaN | NaN | Without DS |
62 rows × 8 columns
techniques_without_data_sources=techniques[techniques.data_sources.isnull()].reset_index(drop=True)
techniques_without_data_sources.head()
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
0 | mitre-attack | [PRE] | [resource-development] | Vulnerabilities | T1588.006 | NaN | NaN | Without DS |
1 | mitre-attack | [PRE] | [reconnaissance] | Spearphishing Service | T1598.001 | NaN | NaN | Without DS |
2 | mitre-attack | [PRE] | [reconnaissance] | Purchase Technical Data | T1597.002 | NaN | NaN | Without DS |
3 | mitre-attack | [PRE] | [reconnaissance] | Threat Intel Vendors | T1597.001 | NaN | NaN | Without DS |
4 | mitre-attack | [PRE] | [reconnaissance] | Search Closed Sources | T1597 | NaN | NaN | Without DS |
print('There are ',techniques['data_sources'].isna().sum(),' techniques without data sources (',"{0:.0%}".format(techniques['data_sources'].isna().sum()/len(techniques)),' of ',len(techniques),' techniques)')
There are 337 techniques without data sources ( 38% of 878 techniques)
techniques_with_data_sources=techniques[techniques.data_sources.notnull()].reset_index(drop=True)
techniques_with_data_sources.head()
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
0 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | [Network device logs, Network device run-time ... | 4.0 | With DS |
1 | mitre-attack | [Network] | [collection] | Network Device Configuration Dump | T1602.002 | [Netflow/Enclave netflow, Network protocol ana... | 3.0 | With DS |
2 | mitre-attack | [Network] | [defense-evasion, persistence] | TFTP Boot | T1542.005 | [Network device run-time memory, Network devic... | 5.0 | With DS |
3 | mitre-attack | [Network] | [defense-evasion, persistence] | ROMMONkit | T1542.004 | [File monitoring, Netflow/Enclave netflow, Net... | 4.0 | With DS |
4 | mitre-attack | [Network] | [collection] | SNMP (MIB Dump) | T1602.001 | [Netflow/Enclave netflow, Network protocol ana... | 3.0 | With DS |
print('There are ',techniques['data_sources'].notna().sum(),' techniques with data sources (',"{0:.0%}".format(techniques['data_sources'].notna().sum()/len(techniques)),' of ',len(techniques),' techniques)')
There are 541 techniques with data sources ( 62% of 878 techniques)
Let's create a graph to represent the number of techniques per matrix:
matrix_distribution = pandas.DataFrame({
'Matrix': list(techniques_with_data_sources.groupby(['matrix'])['matrix'].count().keys()),
'Count of Techniques': techniques_with_data_sources.groupby(['matrix'])['matrix'].count().tolist()})
bars = alt.Chart(matrix_distribution).mark_bar().encode(y='Matrix',x='Count of Techniques').properties(width=300,height=100)
text = bars.mark_text(align='center',baseline='middle',dx=10,dy=0).encode(text='Count of Techniques')
bars + text
All the techniques belong to mitre-attack matrix which is the main Enterprise matrix. Reference: https://attack.mitre.org/wiki/Main_Page
First, we need to split the platform column values because a technique might be mapped to more than one platform
techniques_platform=techniques_with_data_sources
attributes_1 = ['platform'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_1:
s = techniques_platform.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_platform=techniques_platform.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_platform", and then join "techniques_platform" with "s"
# Let's re-arrange the columns from general to specific
techniques_platform_2=techniques_platform.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
We can now show techniques with data sources mapped to one platform at the time
techniques_platform_2.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | Network | [execution] | Network Device CLI | T1059.008 | [Network device logs, Network device run-time ... |
1 | mitre-attack | Network | [collection] | Network Device Configuration Dump | T1602.002 | [Netflow/Enclave netflow, Network protocol ana... |
2 | mitre-attack | Network | [defense-evasion, persistence] | TFTP Boot | T1542.005 | [Network device run-time memory, Network devic... |
3 | mitre-attack | Network | [defense-evasion, persistence] | ROMMONkit | T1542.004 | [File monitoring, Netflow/Enclave netflow, Net... |
4 | mitre-attack | Network | [collection] | SNMP (MIB Dump) | T1602.001 | [Netflow/Enclave netflow, Network protocol ana... |
Let's create a visualization to show the number of techniques grouped by platform:
platform_distribution = pandas.DataFrame({
'Platform': list(techniques_platform_2.groupby(['platform'])['platform'].count().keys()),
'Count of Techniques': techniques_platform_2.groupby(['platform'])['platform'].count().tolist()})
bars = alt.Chart(platform_distribution,height=300).mark_bar().encode(x ='Platform',y='Count of Techniques',color='Platform').properties(width=200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
In the bar chart above we can see that there are more techniques with data sources mapped to the Windows platform.
Again, first we need to split the tactic column values because a technique might be mapped to more than one tactic:
techniques_tactic=techniques_with_data_sources
attributes_2 = ['tactic'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_2:
s = techniques_tactic.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_tactic = techniques_tactic.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_tactic", and then join "techniques_tactic" with "s"
# Let's re-arrange the columns from general to specific
techniques_tactic_2=techniques_tactic.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
We can now show techniques with data sources mapped to one tactic at the time
techniques_tactic_2.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Network] | execution | Network Device CLI | T1059.008 | [Network device logs, Network device run-time ... |
1 | mitre-attack | [Network] | collection | Network Device Configuration Dump | T1602.002 | [Netflow/Enclave netflow, Network protocol ana... |
2 | mitre-attack | [Network] | defense-evasion | TFTP Boot | T1542.005 | [Network device run-time memory, Network devic... |
3 | mitre-attack | [Network] | persistence | TFTP Boot | T1542.005 | [Network device run-time memory, Network devic... |
4 | mitre-attack | [Network] | defense-evasion | ROMMONkit | T1542.004 | [File monitoring, Netflow/Enclave netflow, Net... |
Let's create a visualization to show the number of techniques grouped by tactic:
tactic_distribution = pandas.DataFrame({
'Tactic': list(techniques_tactic_2.groupby(['tactic'])['tactic'].count().keys()),
'Count of Techniques': techniques_tactic_2.groupby(['tactic'])['tactic'].count().tolist()}).sort_values(by='Count of Techniques',ascending=True)
bars = alt.Chart(tactic_distribution,width=800,height=300).mark_bar().encode(x ='Tactic',y='Count of Techniques',color='Tactic').properties(width=400)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
Defende-evasion and Persistence are tactics with the highest nummber of techniques with data sources
We need to split the data source column values because a technique might be mapped to more than one data source:
techniques_data_source=techniques_with_data_sources
attributes_3 = ['data_sources'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_3:
s = techniques_data_source.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_data_source = techniques_data_source.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_data_source", and then join "techniques_data_source" with "s"
# Let's re-arrange the columns from general to specific
techniques_data_source_2 = techniques_data_source.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_source_3 = techniques_data_source_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
We can now show techniques with data sources mapped to one data source at the time
techniques_data_source_3.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | Network device logs |
1 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | Network device run-time memory |
2 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | Network device command history |
3 | mitre-attack | [Network] | [execution] | Network Device CLI | T1059.008 | Network device configuration |
4 | mitre-attack | [Network] | [collection] | Network Device Configuration Dump | T1602.002 | Netflow/Enclave netflow |
Let's create a visualization to show the number of techniques grouped by data sources:
data_source_distribution = pandas.DataFrame({
'Data Source': list(techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().keys()),
'Count of Techniques': techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().tolist()})
bars = alt.Chart(data_source_distribution,width=800,height=300).mark_bar().encode(x ='Data Source',y='Count of Techniques',color='Data Source').properties(width=1200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
A few interesting things from the bar chart above:
Although identifying the data sources with the highest number of techniques is a good start, they usually do not work alone. You might be collecting Process Monitoring already but you might be still missing a lot of context from a data perspective.
data_source_distribution_2 = pandas.DataFrame({
'Techniques': list(techniques_data_source_3.groupby(['technique'])['technique'].count().keys()),
'Count of Data Sources': techniques_data_source_3.groupby(['technique'])['technique'].count().tolist()})
data_source_distribution_3 = pandas.DataFrame({
'Number of Data Sources': list(data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().keys()),
'Count of Techniques': data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().tolist()})
bars = alt.Chart(data_source_distribution_3).mark_bar().encode(x ='Number of Data Sources',y='Count of Techniques').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
The image above shows you the number data sources needed per techniques according to ATT&CK:
Let's create subsets of data sources with the data source column defining and using a python function:
# https://stackoverflow.com/questions/26332412/python-recursive-function-to-display-all-subsets-of-given-set
def subs(l):
res = []
for i in range(1, len(l) + 1):
for combo in itertools.combinations(l, i):
res.append(list(combo))
return res
Before applying the function, we need to use lowercase data sources names and sort data sources names to improve consistency:
df = techniques_with_data_sources[['data_sources']]
for index, row in df.iterrows():
row["data_sources"]=[x.lower() for x in row["data_sources"]]
row["data_sources"].sort()
df.head()
data_sources | |
---|---|
0 | [network device command history, network devic... |
1 | [netflow/enclave netflow, network protocol ana... |
2 | [file monitoring, network device command histo... |
3 | [file monitoring, netflow/enclave netflow, net... |
4 | [netflow/enclave netflow, network protocol ana... |
Let's apply the function and split the subsets column:
df['subsets']=df['data_sources'].apply(subs)
<ipython-input-44-9765a9dc0b2f>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['subsets']=df['data_sources'].apply(subs)
df.head()
data_sources | subsets | |
---|---|---|
0 | [network device command history, network devic... | [[network device command history], [network de... |
1 | [netflow/enclave netflow, network protocol ana... | [[netflow/enclave netflow], [network protocol ... |
2 | [file monitoring, network device command histo... | [[file monitoring], [network device command hi... |
3 | [file monitoring, netflow/enclave netflow, net... | [[file monitoring], [netflow/enclave netflow],... |
4 | [netflow/enclave netflow, network protocol ana... | [[netflow/enclave netflow], [network protocol ... |
We need to split the subsets column values:
techniques_with_data_sources_preview = df
attributes_4 = ['subsets']
for a in attributes_4:
s = techniques_with_data_sources_preview.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
s.name = a
techniques_with_data_sources_preview = techniques_with_data_sources_preview.drop(a, axis=1).join(s).reset_index(drop=True)
techniques_with_data_sources_subsets = techniques_with_data_sources_preview.reindex(['data_sources','subsets'], axis=1)
techniques_with_data_sources_subsets.head()
data_sources | subsets | |
---|---|---|
0 | [network device command history, network devic... | [network device command history] |
1 | [network device command history, network devic... | [network device configuration] |
2 | [network device command history, network devic... | [network device logs] |
3 | [network device command history, network devic... | [network device run-time memory] |
4 | [network device command history, network devic... | [network device command history, network devic... |
Let's add three columns to analyse the dataframe: subsets_name (Changing Lists to Strings), subsets_number_elements ( Number of data sources per subset) and number_data_sources_per_technique
techniques_with_data_sources_subsets['subsets_name']=techniques_with_data_sources_subsets['subsets'].apply(lambda x: ','.join(map(str, x)))
techniques_with_data_sources_subsets['subsets_number_elements']=techniques_with_data_sources_subsets['subsets'].str.len()
techniques_with_data_sources_subsets['number_data_sources_per_technique']=techniques_with_data_sources_subsets['data_sources'].str.len()
techniques_with_data_sources_subsets.head()
data_sources | subsets | subsets_name | subsets_number_elements | number_data_sources_per_technique | |
---|---|---|---|---|---|
0 | [network device command history, network devic... | [network device command history] | network device command history | 1 | 4 |
1 | [network device command history, network devic... | [network device configuration] | network device configuration | 1 | 4 |
2 | [network device command history, network devic... | [network device logs] | network device logs | 1 | 4 |
3 | [network device command history, network devic... | [network device run-time memory] | network device run-time memory | 1 | 4 |
4 | [network device command history, network devic... | [network device command history, network devic... | network device command history,network device ... | 2 | 4 |
As it was described above, we need to find grups pf data sources, so we are going to filter out all the subsets with only one data source:
subsets = techniques_with_data_sources_subsets
subsets_ok=subsets[subsets.subsets_number_elements != 1]
subsets_ok.head()
data_sources | subsets | subsets_name | subsets_number_elements | number_data_sources_per_technique | |
---|---|---|---|---|---|
4 | [network device command history, network devic... | [network device command history, network devic... | network device command history,network device ... | 2 | 4 |
5 | [network device command history, network devic... | [network device command history, network devic... | network device command history,network device ... | 2 | 4 |
6 | [network device command history, network devic... | [network device command history, network devic... | network device command history,network device ... | 2 | 4 |
7 | [network device command history, network devic... | [network device configuration, network device ... | network device configuration,network device logs | 2 | 4 |
8 | [network device command history, network devic... | [network device configuration, network device ... | network device configuration,network device ru... | 2 | 4 |
Finally, we calculate the most relevant groups of data sources (Top 15):
subsets_graph = subsets_ok.groupby(['subsets_name'])['subsets_name'].count().to_frame(name='subsets_count').sort_values(by='subsets_count',ascending=False)[0:15]
subsets_graph
subsets_count | |
---|---|
subsets_name | |
process command-line parameters,process monitoring | 183 |
file monitoring,process monitoring | 144 |
file monitoring,process command-line parameters | 100 |
file monitoring,process command-line parameters,process monitoring | 88 |
network protocol analysis,packet capture | 76 |
api monitoring,process monitoring | 70 |
process monitoring,process use of network | 56 |
netflow/enclave netflow,packet capture | 55 |
process monitoring,windows registry | 50 |
packet capture,process use of network | 45 |
packet capture,process monitoring | 43 |
process command-line parameters,windows registry | 41 |
netflow/enclave netflow,network protocol analysis | 41 |
network protocol analysis,process use of network | 40 |
netflow/enclave netflow,process monitoring | 38 |
subsets_graph_2 = pandas.DataFrame({
'Data Sources': list(subsets_graph.index),
'Count of Techniques': subsets_graph['subsets_count'].tolist()})
bars = alt.Chart(subsets_graph_2).mark_bar().encode(x ='Data Sources', y ='Count of Techniques', color='Data Sources').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx= 0,dy=-5).encode(text='Count of Techniques')
bars + text
Group (Process Monitoring - Process Command-line parameters) is the is the group of data sources with the highest number of techniques. This group of data sources are suggested to hunt 78 techniques
Let's split all the relevant columns of the dataframe:
techniques_data = techniques_with_data_sources
attributes = ['platform','tactic','data_sources'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes:
s = techniques_data.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_data=techniques_data.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_data", and then join "techniques_data" with "s"
# Let's re-arrange the columns from general to specific
techniques_data_2=techniques_data.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_3 = techniques_data_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
techniques_data_3.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device logs |
1 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device run-time memory |
2 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device command history |
3 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device configuration |
4 | mitre-attack | Network | collection | Network Device Configuration Dump | T1602.002 | Netflow/Enclave netflow |
Do you remember data sources names with a reference to Windows? After splitting the dataframe by platforms, tactics and data sources, are there any macOC or linux techniques that consider windows data sources? Let's identify those rows:
# After splitting the rows of the dataframe, there are some values that relate windows data sources with platforms like linux and masOS.
# We need to identify those rows
conditions = [(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True)]
# In conditions we indicate a logical test
choices = ['NO OK','NO OK','NO OK','NO OK','NO OK','NO OK']
# In choices, we indicate the result when the logical test is true
techniques_data_3['Validation'] = np.select(conditions,choices,default='OK')
# We add a column "Validation" to "techniques_data_3" with the result of the logical test. The default value is going to be "OK"
What is the inconsistent data?
techniques_analysis_data_no_ok = techniques_data_3[techniques_data_3.Validation == 'NO OK']
# Finally, we are filtering all the values with NO OK
techniques_analysis_data_no_ok.head()
matrix | platform | tactic | technique | technique_id | data_sources | Validation | |
---|---|---|---|---|---|---|---|
162 | mitre-attack | Linux | defense-evasion | Run Virtual Instance | T1564.006 | Windows Registry | NO OK |
168 | mitre-attack | macOS | defense-evasion | Run Virtual Instance | T1564.006 | Windows Registry | NO OK |
179 | mitre-attack | Linux | defense-evasion | Hidden File System | T1564.005 | Windows Registry | NO OK |
181 | mitre-attack | macOS | defense-evasion | Hidden File System | T1564.005 | Windows Registry | NO OK |
794 | mitre-attack | macOS | defense-evasion | Hidden Window | T1564.003 | PowerShell logs | NO OK |
print('There are ',len(techniques_analysis_data_no_ok),' rows with inconsistent data')
There are 136 rows with inconsistent data
What is the impact of this inconsistent data from a platform and data sources perspective?
df = techniques_with_data_sources
attributes = ['platform','data_sources']
for a in attributes:
s = df.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
s.name = a
df=df.drop(a, axis=1).join(s).reset_index(drop=True)
df_2=df.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
df_3 = df_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
conditions = [(df_3['data_sources'].str.contains('windows',case=False)== True),
(df_3['data_sources'].str.contains('powershell',case=False)== True),
(df_3['data_sources'].str.contains('wmi',case=False)== True)]
choices = ['Windows','Windows','Windows']
df_3['Validation'] = np.select(conditions,choices,default='Other')
df_3['Num_Tech'] = 1
df_4 = df_3[df_3.Validation == 'Windows']
df_5 = df_4.groupby(['data_sources','platform'])['technique'].nunique()
df_6 = df_5.to_frame().reset_index()
alt.Chart(df_6).mark_bar().encode(x=alt.X('technique', stack="normalize"), y='data_sources', color='platform').properties(height=200)
There are techniques that consider Windows Error Reporting, Windows Registry, and Windows event logs as data sources and they also consider platforms like Linux and masOS. We do not need to consider this rows because those data sources can only be managed at a Windows environment. These are the techniques that we should not consider in our data base:
techniques_analysis_data_no_ok[['technique','data_sources']].drop_duplicates().sort_values(by='data_sources',ascending=True)
technique | data_sources | |
---|---|---|
5953 | OS Credential Dumping | PowerShell logs |
5832 | Remote Services | PowerShell logs |
2814 | Clear Command History | PowerShell logs |
2432 | Credentials from Password Stores | PowerShell logs |
4564 | Peripheral Device Discovery | PowerShell logs |
2271 | Keychain | PowerShell logs |
2259 | Credentials from Web Browsers | PowerShell logs |
2392 | GUI Input Capture | PowerShell logs |
1831 | Impair Command History Logging | PowerShell logs |
794 | Hidden Window | PowerShell logs |
1611 | Hide Artifacts | PowerShell logs |
5431 | Input Capture | PowerShell logs |
5402 | Command and Scripting Interpreter | PowerShell logs |
3206 | Event Triggered Execution | WMI Objects |
4156 | Exploitation of Remote Services | Windows Error Reporting |
4206 | Exploitation for Defense Evasion | Windows Error Reporting |
5361 | Exploitation for Privilege Escalation | Windows Error Reporting |
4241 | Exploitation for Credential Access | Windows Error Reporting |
3212 | Event Triggered Execution | Windows Registry |
5217 | Software Deployment Tools | Windows Registry |
4038 | Service Stop | Windows Registry |
4020 | Inhibit System Recovery | Windows Registry |
5426 | Input Capture | Windows Registry |
3389 | Create or Modify System Process | Windows Registry |
5827 | Remote Services | Windows Registry |
4373 | Browser Extensions | Windows Registry |
162 | Run Virtual Instance | Windows Registry |
2414 | Keylogging | Windows Registry |
1875 | Impair Defenses | Windows Registry |
2599 | Masquerade Task or Service | Windows Registry |
1857 | Disable or Modify Tools | Windows Registry |
2654 | Subvert Trust Controls | Windows Registry |
1824 | Disable or Modify System Firewall | Windows Registry |
1204 | System Services | Windows Registry |
2341 | Modify Authentication Process | Windows Registry |
2722 | Unsecured Credentials | Windows Registry |
179 | Hidden File System | Windows Registry |
2895 | Abuse Elevation Control Mechanism | Windows Registry |
5278 | Indicator Removal on Host | Windows event logs |
5775 | Obfuscated Files or Information | Windows event logs |
5401 | Command and Scripting Interpreter | Windows event logs |
5828 | Remote Services | Windows event logs |
5559 | Scheduled Task/Job | Windows event logs |
5427 | Input Capture | Windows event logs |
2970 | Local Account | Windows event logs |
3202 | Event Triggered Execution | Windows event logs |
4439 | Create Account | Windows event logs |
2602 | Masquerade Task or Service | Windows event logs |
2655 | Subvert Trust Controls | Windows event logs |
4078 | File and Directory Permissions Modification | Windows event logs |
2720 | Unsecured Credentials | Windows event logs |
4022 | Inhibit System Recovery | Windows event logs |
3624 | System Shutdown/Reboot | Windows event logs |
3605 | Account Access Removal | Windows event logs |
2962 | Domain Account | Windows event logs |
4909 | Account Manipulation | Windows event logs |
3388 | Create or Modify System Process | Windows event logs |
Without considering this inconsistent data, the final dataframe is:
techniques_analysis_data_ok = techniques_data_3[techniques_data_3.Validation == 'OK']
techniques_analysis_data_ok.head()
matrix | platform | tactic | technique | technique_id | data_sources | Validation | |
---|---|---|---|---|---|---|---|
0 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device logs | OK |
1 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device run-time memory | OK |
2 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device command history | OK |
3 | mitre-attack | Network | execution | Network Device CLI | T1059.008 | Network device configuration | OK |
4 | mitre-attack | Network | collection | Network Device Configuration Dump | T1602.002 | Netflow/Enclave netflow | OK |
print('There are ',len(techniques_analysis_data_ok),' rows of data that you can play with')
There are 6650 rows of data that you can play with
This function gets techniques' information that includes specific data sources
data_source = 'PROCESS MONITORING'
results = lift.get_techniques_by_datasources(data_source)
len(results)
320
type(results)
list
results2 = lift.get_techniques_by_datasources('pRoceSS MoniTorinG','process commAnd-linE parameters')
len(results2)
336
results2[1]
AttackPattern(type='attack-pattern', id='attack-pattern--2de47683-f398-448f-b947-9abcc3e32fad', created_by_ref='identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', created='2020-10-05T13:24:49.780Z', modified='2020-10-09T16:05:36.344Z', name='Print Processors', description='Adversaries may abuse print processors to run malicious DLLs during system boot for persistence and/or privilege escalation. Print processors are DLLs that are loaded by the print spooler service, spoolsv.exe, during boot. \n\nAdversaries may abuse the print spooler service by adding print processors that load malicious DLLs at startup. A print processor can be installed through the <code>AddPrintProcessor</code> API call with an account that has <code>SeLoadDriverPrivilege</code> enabled. Alternatively, a print processor can be registered to the print spooler service by adding the <code>HKLM\\SYSTEM\\\\[CurrentControlSet or ControlSet001]\\Control\\Print\\Environments\\\\[Windows architecture: e.g., Windows x64]\\Print Processors\\\\[user defined]\\Driver</code> Registry key that points to the DLL. For the print processor to be correctly installed, it must be located in the system print-processor directory that can be found with the <code>GetPrintProcessorDirectory</code> API call.(Citation: Microsoft AddPrintProcessor May 2018) After the print processors are installed, the print spooler service, which starts during boot, must be restarted in order for them to run.(Citation: ESET PipeMon May 2020) The print spooler service runs under SYSTEM level permissions, therefore print processors installed by an adversary may run under elevated privileges.', kill_chain_phases=[KillChainPhase(kill_chain_name='mitre-attack', phase_name='persistence'), KillChainPhase(kill_chain_name='mitre-attack', phase_name='privilege-escalation')], external_references=[ExternalReference(source_name='mitre-attack', url='https://attack.mitre.org/techniques/T1547/012', external_id='T1547.012'), ExternalReference(source_name='Microsoft AddPrintProcessor May 2018', description='Microsoft. (2018, May 31). AddPrintProcessor function. Retrieved October 5, 2020.', url='https://docs.microsoft.com/en-us/windows/win32/printdocs/addprintprocessor'), ExternalReference(source_name='ESET PipeMon May 2020', description='Tartare, M. et al. (2020, May 21). No “Game over” for the Winnti Group. Retrieved August 24, 2020.', url='https://www.welivesecurity.com/2020/05/21/no-game-over-winnti-group/')], object_marking_refs=['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'], x_mitre_contributors=['Mathieu Tartare, ESET'], x_mitre_data_sources=['Process monitoring', 'Windows Registry', 'File monitoring', 'DLL monitoring', 'API monitoring'], x_mitre_detection='Monitor process API calls to <code>AddPrintProcessor</code> and <code>GetPrintProcessorDirectory</code>. New print processor DLLs are written to the print processor directory. Also monitor Registry writes to <code>HKLM\\SYSTEM\\ControlSet001\\Control\\Print\\Environments\\\\[Windows architecture]\\Print Processors\\\\[user defined]\\\\Driver</code> or <code>HKLM\\SYSTEM\\CurrentControlSet\\Control\\Print\\Environments\\\\[Windows architecture]\\Print Processors\\\\[user defined]\\Driver</code> as they pertain to print processor installations.\n\nMonitor for abnormal DLLs that are loaded by spoolsv.exe. Print processors that do not correlate with known good software or patching may be suspicious.', x_mitre_is_subtechnique=True, x_mitre_permissions_required=['Administrator', 'SYSTEM'], x_mitre_platforms=['Windows'], x_mitre_version='1.0')