Cybersecurity with LITL: Unsupervised learning on unlabeled data? Why bother?

active learningcybersecurityLITL

Data is not the problem, yet when you want to make use of it, labels prove to be one. How can AI help? Does implementing active learning can support cybersecurity operations? What does LITL stand for?

Data, in principle, is not the problem.

In the era of widespread digitization, we collect vast amounts of data every day—not to mention the data we don’t record but could.

The challenge comes when we want to use this data for our decision-making processes. To somehow manage such a task, we usually use machine learning technologies. However, they are most effective when initially they are shown the right amount of data along with a label appropriate to the problem we are interested in. But where to get labels for data without them [labels] given out-of-the-box?

You can try to get around this problem and replace supervised models with unsupervised ones. Only then you are limited to a small number of use cases, and modern self-supervised learning models usually require a lot of work, research, and computing power to create.

Outsourcing the tagging of entire data sets to experts is not a good idea either. Just as crowdsourced tagging services work well for simple labels, when the problem requires specialized knowledge—such as cybersecurity—we need the support of domain experts. And experts, as a rule, are expensive, and their time is limited. Here, with an attempt to solve the problem comes active learning.

Active learning is an approach to machine learning that assumes no labels in the data. The goal of an active learning system is to select (from the set of all available unlabeled data) such observations to be labeled by experts, on which the supervised model of interest will gain the most after training. In practice, the most interesting observations are often those near the hypothesis boundary of our problem (e.g., near the decision boundary of the classifier for the classification problem). The criteria responsible for the diversity, among others, are also applied so as not to get stuck in the local minimum.

In active learning we are interested in labeling only part of the data while obtaining the best possible model. Otherwise, we would not need intelligent selection of observations—we could feed them in a random order. Such a model can then be used to make predictions of observations that interest us, and use active learning as a mechanism to continually retrain the model.

Another use of the model “taken out” of the active learning loop is to make predictions on the entire dataset for which we have no labels, and use predictions for which the model shows low uncertainty as target labels for that data.

LITL – Labeling in the loop for your data

LITL (Label in the Loop) is our answer to the lack of reliable active learning systems on the market. It is a set of components enabling running an active loop on given data. Among other things, it consists of:

  • a module for estimating the quality of experts,
  • a module for selecting an appropriate expert to label an observation,
  • a module for selecting a batch of observations to label—the heart of the LITL system.

When creating LITL, we made an assumption about the imperfection of experts. In the classical academic approach, the label of observation is given by an infallible oracle. This is a far-from-real approach, where real experts have different knowledge, different psychophysical conditions, or are just plain wrong.

Our work resulted in expert quality estimation modules and an expert observation selection module based on it. This allows for assigning an observation to the best possible expert. In addition, our expert modules are robust to changes in the pool of experts (the arrival of new experts or the departure of old ones) as well as changes in experts’ competencies (e.g. caused by an expert’s training).

Experts interacting with the system use an ergonomic interface that allows them to label multimodal data. As the expert works, in addition to the labels required of him, he can assign additional information to the observation. Then the system administrator, seeing the label appearing regularly, can decide whether to add it to the permanent label collection.

Observations sent to the experts are selected by the batch selection module. It consists of a set of generic criteria aggregated to obtain the score of relevance of an observation from the unlabeled data set. These criteria provide information including:

  • the uncertainty of the model on a given observation,
  • the representativeness of a given observation against similar observations,
  • the diversity of a given observation against previously selected observations.

This ensures that experts are not continuously labeling similar observations. Additionally, the criteria are robust—as is the entire batch selection module—to newly arriving data (including data drift) and the appearance of new targets in the data.


A way to leverage LITL’s potential in cybersecurity

Security operation center (SOC) is often an integral part of keeping modern IT systems secure. Experts working in them, who are qualified cybersecurity specialists, despite considerable expertise, are not able to review all events in the logs of their IT systems due to the volume of such data. Therefore, they use predefined rules, which serve them as a filter to separate alarming events from fully typical ones. These rules in more progressive organizations sometimes take the form of machine learning models. The problem that arises with this second approach is the dynamically changing nature of data. This phenomenon ranges from a dataset drift caused by some software update to new forms of cyberattacks (the capture of which is an integral and key part of systems security work). In the following, we address this problem by presenting an example use of LITL in cybersecurity.

The diagram presented below shows the three key components for our application: the anomaly detector (later called the sieve), the classifier, and the SOC.

The sieve
The purpose of this component is to return information whether a given event is anomalous or not. It is a kind of a sieve separating unusual events, which we will want to analyze further, from those completely typical and not worrying. The practical implementation of this component can be:classical anomaly detection methods (for example, density-based techniques such as k-nearest neighbor, clustering or fuzzy logic-based outlier detection),expert rules, behavioral analysis methods ,and other implementations, such as mixed implementations or novel application-specific approaches. An important feature of this component is time and memory efficiency due to contact with a large amount of data.

The classifier
The main part of this component is a supervised machine learning model. Its task is—in a simplified version—to return a binary prediction whether a given anomalous event is a threat or not, and in a more advanced version—to return a multi class (or even multi label) prediction with a threat label according to the ontology developed by the users of the system. Predictions of the model in practice can be used in various ways. E.g., as a hint to the expert in the SOC in the form of information about whether the model finds the event to be a threat or to what degree, available for each event viewed by the expert. Alternatively, such a model can be queried periodically for predictions and where the model’s uncertainty about the prediction is low and the prediction is a threat, an alert will be sent directly to the expert.

In the SOC, experts decide whether an event is a threat and whether to trigger an alert to the appropriate user or device or to trigger other forms of security assurance. By assigning a label, the expert has access to the entire system logs and other network traffic data such as MISP. They can, therefore, have access to all the information they need to reliably assign a label, on demand.LITL is naturally coupled with the classifier and SOC. It is responsible for cyclic model training, which, as mentioned above, is an important need when using a machine learning model in cybersecurity applications. In addition, it is robust to data set drift. The batch selection module provides events to be labeled by experts, where the main purpose of these labeled events is to improve and update the model. At the same time, events labeled as a threat during the regular work of experts can also be used to retrain the classifier. The frequency of model retraining as well as the number of events to be labeled by experts as part of model retraining is determined by the system administrator (for example, they may decide that ¼ of the events labeled by experts are events provided by active learning).

The way the expert works with the system is arbitrary. Some possible scenarios include:

  • expert labels only events which the classifier has identified to be a threat with a low uncertainty,
  • expert labels events that the sieve has passed as anomalous, and the expert himself decides which one they will review (while still having a hint from the classifier),
  • expert labels events that the sieve has passed as anomalous, and the expert himself decides which one they will review (while still having a hint from the classifier),

In all scenarios above, experts label events provided to them by active learning.

This is one application of LITL. And there are many more! Not just in the cybersecurity domain.

LITL works best in cases when there are large volumes of unlabeled data or when you lack specialized datasets for your problem.

Want to know more about how we can label your data?

See Label in the Loop
Label in the Loop