Posted August 22, 2018
One of the biggest challenges threat analysts face today is assessing the validity of threat intelligence feeds. Many of those threat intelligence feeds contribute to a large amount of atomic data that is often difficult to relate and correlate in meaningful ways; therefore, it is difficult to determine the relevancy of a threat to the organization.
If an analyst were to search within the threat intelligence for an IP address of an attacker’s infrastructure, or a malware identifier, they would typically receive a deluge of atomic data about the item for which they searched. However, many intelligence sources typically do not connect those atomic facts in a way that provides a cohesive picture to the analyst that would allow any easy understanding of the entire threat landscape associated with the item.
There are ways to address these challenges. Below, we discuss a few data correlation techniques to help analysts understand the connections within threat intelligence in new and meaningful manner.
Technique #1: The Power of Multi-Order Relationships
One technique is to provide a macroscopic view of the data so that analysts can visualize the connections in data that are not normally directly defined.
This can be achieved by correlating the data such that mutual relationships and connections between two or more elements in the data are formed. By leveraging this correlation technique across multiple orders of the relationships, we find connections within the data that would not be otherwise discovered.
There is increased insight for the analyst from the threat intelligence when transitive relationships are found that include second or even third order relationships discovered through intermediary elements.
The diagram below shows the discovery of multi-order relationships, correlated from five or more distinct threat intelligence data facts where no direct relationships were initially identified as a connection.
In this example, the analyst searches for information on foobar.com as a potentially malicious domain.
Based on the direct relationship defined in the threat intelligence data, the domain resolves to 10.0.0.1. The analyst may simply conclude that it was non-malicious.
However, by traversing second, third, and fourth order relationships across five distinct data facts, we discover that zork.com resolved to 10.0.0.1, and also resolves to 192.168.1.1, which Zeus malware communicated with as a direct relationship.
Given this correlation of relationships, the analyst is able to determine a much higher likelihood that 10.0.0.1 is a command and control server for Zeus.
This is not a conclusion that would be discovered through most threat intelligence systems that present at most one or two orders of fixed pre-defined relationships.
If we expand upon this example across distinct attributes within the data, we can start to see how this powerful technique can provide insight across many artifacts within the data. For example, below we show a data set where a single fact record, asserting relationships between a Threat, a Country, a FQDN, three file Hashes, and IPv4 address 18.104.22.168, which is one of Google’s public DNS resolvers.
Searching for the IP address 22.214.171.124 returns 3,544 attribute-value pairs, from all sources. This includes data such as ASN network relationships, malware, filenames, and FQDNs – all somehow related to this IP address.
If you consider this data over longer periods (e.g. 90 days) of historical collection, then the resultant data is in the 10s of billions of individual data facts, which makes it extremely difficult for analysts to process without some ways to view the data in a more connected manner.
Technique #2: Data Normalization is Key to finding Connected Data
A common challenge with threat intelligence correlation to identify connected data is that the data facts within threat intelligence needs to be normalized across multiple feeds and multiple sources.
Determining connected data on a country, Great Britain, for example, can be difficult when there are several possible permutations that semantically mean the same country:
- United Kingdom
- United Kingdom of Great Britain and Northern Ireland (official name)
- UK (common two-letter acronym)
- Great Britain
- GB (two-letter country code)
- GBR (three-letter country code)
- 826 (numeric identifier)
If we wanted to be able to relate and correlate on Great Britain, data from all sources referring to this country must be normalized to a single attribute that has a single value representation, preferably a record reference rather than an explicit string value. This allows us to have a reference to an entity record that in turn, contains all the possible representations of the entity name such as the United Kingdom.
Using references to a network element rather than values that represent the element allows for unique identifiers for each element. With unique identifiers that are not typed or need to be interpreted, we can communicate and store relationships, such as whether something is within another thing, contains, or is related-to.
Technique #3: The Importance of Source Attribution and Time
To ensure maximum results when leveraging connected data every data fact needs to:
- Be attributable to a source.
- Contain the following temporal data:
- When the relationship or attributes were observed-at.
- When the relationship or attributes were asserted-at.
We make a distinction as to when something was observed-at and asserted-at:
- Observed-at is when something happened. For example, when Conficker showed up on a system.
- Asserted-at means when the data that asserts what is observed was true at a particular time. For example, if across ten days, Conficker continues to be seen on a system, we would have the same observed-at, but ten different asserted-at records that asserts that at these times, the fact was still being asserted.
By separating data into observed-time vs asserted-time, the analyst is able to deduce what is new data and what is duplicate data that is reoccurring over and over again. This can be helpful in prioritization and focusing on new threats.
Where every data fact is attributable to a source and timestamp, the analyst can identify which sources of data may be trusted more than others.
For example, the query below shows feed sources with data about IPv4 126.96.36.199.
If the analyst decides that they trust Emerging Threats over other sources, then having an ability to easily filter data based on source and time is important such as below.
Analysts are under pressure to be able to understand what threat intelligence is telling them about a particular threat or risk to their organization. By using the techniques outlined above, analysts will be able to more accurately and quickly determine if something is relevant or a threat to their organization. Gaining more context by leveraging connected data can help analysts have more confidence about prioritizing potential threats.
If you would like to learn more about the techniques LookingGlass use in their threat processing and analytical approaches, please connect with us.