Security Detection Language Proposal #1073

davidmagnotti · 2024-05-02T19:27:23Z

davidmagnotti
May 2, 2024

Contributors: David Magnotti, Rajas Panat, Tim Vidas, Aldrin DSouza, Mike Artz

Overview

I propose expanding Open Cybersecurity Schema Framework (OCSF) to include a SQL-compliant security detection language stored in a JSON blob. The reason to include this as part of OCSF is to facilitate portable stateless and stateful detections for threat detection purposes.

Design

The language is proposed as being stored in JSON in order to support portability. JSON support is native in languages such as Python, Java, PHP, JavaScript, and other languages.

The actual detection language logic is proposed to be implemented as SQL queries, further enabling portability. SQL provides language capabilities that support creation of both stateless (singular events) and stateful (multiple events) detection logic.

Rule Schema

Rules are designed to run against OCSF-crafted data models and generate detections (or classifications) for events. Rules adhere to the template method design pattern.

The UML representation of the rule schema is as such (note, [0..1] indicates the field value is optional, per the UML specification)

Rule

+ author: string
+ created_on: string
+ description: string
+ reference: string [0..1]
+ technique: string
+ subtechnique: string
+ annotation: string
+ filter: string
+ tags: list [0..1]

The technique is a string and the sub-technique fields are stored as string. The values of the technique and sub-technique fields are compliant with the MITRE ATT&CK Framework.

The filter section is a SQL-compliant string that queries an OCSF-compliant schema, compliant with ISO 9075-2023. Each column that is selected or filtered maps to a field name value from OCSF. Stateless rules should not use aggregations, restricting use of the keywords JOIN or GROUP BY, and also should not use sub-queries.

An example of a stateless rule designed to detect exploitation of CVE-2023-4966 in an HTTP record is as such:

{
    "author": "David",
    "created_on": "2023-10-10",
    "description": "Sensitive information disclosure vulnerability affecting Citrix ADC and NetScaler devices",
    "reference": "https://nvd.nist.gov/vuln/detail/CVE-2023-4966",
    "technique": "Initial Access",
    "subtechnique": "Exploit Public-Facing Application",
    "annotation": "CVE-2023-4966",
    "filter": "SELECT * FROM http_activity WHERE http_request.url.path='/oauth/idp/.well-known/openid-configuration'"
}

Another example of a stateless rule designed to detect CobaltStrike beacon via a DNS record:

{
    "author": "David",
    "created_on": "2022-12-16",
    "description": "Detects CobaltStrike beacon via AAAA record query",
    "reference": "https://blog.sekoia.io/hunting-and-detecting-cobalt-strike/#h-beaconing-network-traffic",
    "technique": "Command and Control",
    "subtechnique": "Application Layer Protocol",
    "annotation": "CobaltStrike DNS beacon",
    "filter": "SELECT * FROM dns_activity WHERE query.type='AAAA' and query.hostname LIKE 'www6.%'"
}

Considerations

Why SQL?

In order to facilitate support for stateful processing, rather than building custom matching logic, we could choose to leverage an existing language such as SQL. Many of the semantics we would need to support, such as aggregations and filtering, are natively supported in SQL. Additionally, support for SQL is highly available across technology stacks and platforms, so parsing and conversion of rules between languages is better supported than alternative languages.

A stateful query to detect use of a process chain sequence of nginx followed by wget executing bash, using a string array to break the search into multiple lines:


{
 "author": "David",
 "created_on": "2024-01-05",
 "description": "Process chain sequence for nginx followed by wget and then bash",
 "reference": "https://docs.aws.amazon.com/guardduty/latest/ug/findings-runtime-monitoring.html",
 "technique": "Initial Access",
 "subtechnique": "Exploit Public-Facing Application",
 "annotation": "Potential shell spawned from web application",
 "filter": "SELECT * FROM process_activity JOIN process_activity p1 on p1.actor.process = process_activity.process.id WHERE activity_id=1 AND process_activity.actor.process.file.name=‘nginx’ and process_activity.process.file.name=‘wget’ and p1.process.file.name=‘bash’"
}

We did consider using SQL extensions that provide greater capability for manipulating data such as JSON, such as Amazon’s SQL-compliant extension, PartiQL. PartiQL provides some advantages for filtering on JSON data over traditional SQL. The disadvantage of these extensions is that they harm portability across technology stacks. Therefore, to optimize for portability, we elected to use a SQL dialect that did not provide custom functions.

Why JSON?

JavaScript Object Notation (JSON) is an efficient and portable notation for representing data. While other notations such as Yet Another Markup Language (YAML) provide advantages such as being simpler to write due to not requiring diligent use of double quotes for key-value pairs, YAML lacks native support in languages such as Python. Other alternatives such as Extensible Markup Language (XML), while portable, tend to be highly verbose, resulting in difficult to read and write rules.

What about Existing languages like Sigma, YARA-L, Roota?

Existing languages such as Sigma, YARA-L, and Roota each provide capabilities for storing security detection rules. However, all of the rules are written to be stateless, only able to operate on singular events. Additionally, each language is written in YAML, which may hinder portability.

Another challenge with Sigma and YARA-L specifically is the lack of support for OCSF. While Roota intends to solve that challenge by combining Sigma with OCSF, it still doesn’t provide a mechanism for creating stateful rules, nor is it as portable as a JSON-based language.

Implementation

How do I use the rules?

There are a variety of approaches for consumption of the rules. For a Python-based implementation, a data science or security analyst user can quickly leverage the rules to evaluate criteria against a pandas DataFrame through the use of the third-party library pandasql.

Using the provided example rule to identify exploitation attempts of CVE-2024-28255:

{
  "author": "Peter",
  "created_on": "2024-04-22",
  "description": "Pre-authentication remote code execution vulnerability in OpenMetadata.",
  "reference": "https://nvd.nist.gov/vuln/detail/CVE-2024-28255",
  "technique": "Initial Access",
  "subtechnique": "Exploit Public-Facing Application",
  "annotation": "CVE-2024-28255",
  "filter": "SELECT * FROM http_activity WHERE path LIKE ‘/api/v1;v1%’"
}

We can create a script that populates test data in an HttpActivity Pandas DataFrame that will evaluate against the given rule, like so:

import pandas, pandasql

rule = {
  "author": "Peter",
  "created_on": "2024-04-22",
  "description": "Pre-authentication remote code execution vulnerability in OpenMetadata.",
  "reference": "https://nvd.nist.gov/vuln/detail/CVE-2024-28255",
  "technique": "Initial Access",
  "subtechnique": "Exploit Public-Facing Application",
  "annotation": "CVE-2024-28255",
  "filter": "SELECT * FROM http_activity WHERE path LIKE '/api/v1;v1%'"
}

test_data = [
 1, "/index.html",
 2, "/home",
 3, "/",
 4, "/index.html",
 5, "/login",
 6, "/help",
 7, "/api/v1;v1%2fusers%2flogin/events/subscriptions/validation/condition/T(java.lang.Runtime).getRuntime().exec(new%20java.lang.String(T(java.util.Base64).getDecoder().decode('ZWNobyAnQ1ZFLTIwMjQtMjgyNTUn')))",
 8, "/index.html",
 9, "/",
 10, "/logout"
]

http_activity = pandas.DataFrame(test_data, columns=['path'])
result = pandasql.sqldf(rule['filter'])
print(result)

Do you support expanding OCSF to include this detection schema?

Yes

11%

Yes (with changes)

0%

No

88%

9 votes

mavam · 2024-05-03T04:39:22Z

mavam
May 3, 2024
Collaborator

While building a common detection language is a great idea in principle that I find intriguing, it does not fit into the mission of the OCSF.

Therefore, I strongly vote against this proposal, as it would dilute the focus. OCSF is an event taxonomy that defines representation and semantics of data. This mission is big enough already. The whole point is to make it agnostic of downstream analytics, not couple it to one particular one.

(There are numerous reasons for why not SQL and not JSON, but that's a separate topic.)

1 reply

floydtree May 3, 2024
Maintainer

Not to speak on behalf of the author, but a few questions regarding your observations -

--

...it does not fit into the mission of the OCSF. ...as it would dilute the focus.

Any specific reason why you think so? Diluting focus vs Increasing scope, just two ways to look at it.

The whole point is to make it agnostic of downstream analytics, not couple it to one particular one.

I wonder why you would conclude that the aim is to couple it to this particular detection lang.

(There are numerous reasons for why not SQL and not JSON, but that's a separate topic.)

Would like to hear them. In fact that would be one of the primary topics for discussion from my perspective.

mavam · 2024-05-05T08:58:16Z

mavam
May 5, 2024
Collaborator

Any specific reason why you think so? Diluting focus vs Increasing scope, just two ways to look at it.

It seems like a case of scope creep. OCSF is a format for encoding data. The mission is to establish it as a layer independent of downstream analytics.

The whole point is to make it agnostic of downstream analytics, not couple it to one particular one.

I wonder why you would conclude that the aim is to couple it to this particular detection lang.

I'm not quite sure if I understand the point. If you'd include a detection language in the standard, it's fundamentally coupled, no? Otherwise there is no point in proposing a language within OCSF.

Would like to hear them. In fact that would be one of the primary topics for discussion from my perspective.

The TL;DR is that security people are not data people.

SQL: There's a small intersection but the large majority of SOC analysts are not at the level where the write SQL. The early Splunk was a sweet-spot in terms of UX for this audience. See PRQL for a natural extension.
JSON: it's a reasonable exchange format but not something you want to edit or read as a human. As a detection engineer, I would want both to be convenient.

0 replies

lcostantino · 2024-05-07T14:04:45Z

lcostantino
May 7, 2024

Could we just rely on SIGMA and create the proper transformer to OCSF and whatever format is needed based on how data is stored / processed? (Ex: SQL, JQ, etc)

0 replies

davemcatcisco · 2024-05-08T23:30:08Z

davemcatcisco
May 8, 2024
Collaborator

SQL vs GQL?

As somebody who started his programming career in the mid-90s when the RDBMS was king, I almost instinctively tend towards an SQL JOIN-style way of expressing relationships between things. So it almost pains me to say the following. There are good reasons why graph databases came to popularity and why languages like Cypher, GQL, etc. are increasingly used in contexts where SQL might previously have been the obvious choice. It's probably not a good use of anyone's time to start enumerating pros and cons here because I think we're all equally adept at using Google, ChatGPT, etc.

As well as the above "well I wouldn't start with SQL in the first place" argument, I'll also raise a few specific issues which don't appear to have been covered in the proposal as-is.

Many OCSF events and objects have attributes which are arrays. For example, consider a detection rule that uses the actor.process.lineage array to check if the grandparent or great-grandparent is winword.exe. What would this look like in SQL? If the answer is something like WHERE actor.process.lineage[-3] LIKE '%\winword.exe' OR actor.process.lineage[-4] LIKE '%\winword.exe' then I think we would not be in a good place! And lineage is a particularly simple case of an array of strings. What about arrays of objects?
I have to wonder how portable these detection rules would actually be. Consider a rule that looks for use of an inter-process memory write of 526 bytes by doing something like SELECT * FROM memory_activity WHERE size = 526 AND activity_name ='Write'. How will that work in an endpoint product which, for efficiency reasons, doesn't use the optional human-readability text attributes like activity_name and uses only the required numeric attributes like activity_id? Is the detection engine in the endpoint product going to have to temporarily hydrate/enrich the events to include the optional text attributes?

Do we need another standard?

Then there's the whole question as to whether we even need another standard way to share detection rules. As my colleague Leandro has pointed out above, we aleady have Sigma, i.e. a "standard" way to express detection rules. Sigma doesn't make any assumptions about a rule is actually evaluated. The rules are abstract and compiled to whatever detection engine, SIEM, etc. one is using. This gives huge flexibility to vendors in terms of how and where rules are evaluated.

In contrast, the SQL-based syntax proposed here appears to assume (or at best hints strongly at) an implementation in which events are modelled as collections of flattened key-value pairs which are then projected into Relational Land using something like SQLite's virtual table mechanism.

Is this what OCSF is about?

At the risk of being a bit blunt, I think this is an overreach and a distraction (albeit an interesting one). We should focus our efforts on getting to a comprehensive schema for the exchange of security information. We are a long way from there right now. If and when that happy day arrives, only then should we perhaps think about a standard way to describe the kind of things we actually do with this security information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security Detection Language Proposal #1073

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Security Detection Language Proposal #1073

davidmagnotti May 2, 2024

Overview

Design

Rule Schema

Considerations

Why SQL?

Why JSON?

What about Existing languages like Sigma, YARA-L, Roota?

Implementation

How do I use the rules?

Replies: 4 comments · 1 reply

mavam May 3, 2024 Collaborator

floydtree May 3, 2024 Maintainer

mavam May 5, 2024 Collaborator

lcostantino May 7, 2024

davemcatcisco May 8, 2024 Collaborator

SQL vs GQL?

Do we need another standard?

Is this what OCSF is about?

davidmagnotti
May 2, 2024

Replies: 4 comments 1 reply

mavam
May 3, 2024
Collaborator

floydtree May 3, 2024
Maintainer

mavam
May 5, 2024
Collaborator

lcostantino
May 7, 2024

davemcatcisco
May 8, 2024
Collaborator