Our products are used to:
_reduce the uncertainty of Machine Learning models,
_prioritize enterprise data efforts,
_support experts in the ML loop,
_improve the quality of ML models, especially in multi-class settings with complex ontologies,
_reduce data footprint and compactify ML models so as to be used by the Internet of Things applications,
_improve gaming experience via more challenging and realistic AI in games,
_create intelligent advisory systems from pre-compiled building blocks.
Data is not the problem, yet when you want to make use of it, labels prove to be one. How can AI help? Does implementing active learning can support cybersecurity operations? What does LITL stand for?
Data, in principle, is not the problem.
In the era of widespread digitization, we collect vast amounts of data every day—not to mention the data we don't record but could.
The challenge comes when we want to use this data for our decision-making processes. To somehow manage such a task, we usually use machine learning technologies. However, they are most effective when initially they are shown the right amount of data along with a label appropriate to the problem we are interested in. But where to get labels for data without them [labels] given out-of-the-box?
You can try to get around this problem and replace supervised models with unsupervised ones. Only then you are limited to a small number of use cases, and modern self-supervised learning models usually require a lot of work, research, and computing power to create.
Outsourcing the tagging of entire data sets to experts is not a good idea either. Just as crowdsourced tagging services work well for simple labels, when the problem requires specialized knowledge—such as cybersecurity—we need the support of domain experts. And experts, as a rule, are expensive, and their time is limited. Here, with an attempt to solve the problem comes active learning.
Active learning is an approach to machine learning that assumes no labels in the data. The goal of an active learning system is to select (from the set of all available unlabeled data) such observations to be labeled by experts, on which the supervised model of interest will gain the most after training. In practice, the most interesting observations are often those near the hypothesis boundary of our problem (e.g., near the decision boundary of the classifier for the classification problem). The criteria responsible for the diversity, among others, are also applied so as not to get stuck in the local minimum.
In active learning we are interested in labeling only part of the data while obtaining the best possible model. Otherwise, we would not need intelligent selection of observations—we could feed them in a random order. Such a model can then be used to make predictions of observations that interest us, and use active learning as a mechanism to continually retrain the model.
Another use of the model "taken out" of the active learning loop is to make predictions on the entire dataset for which we have no labels, and use predictions for which the model shows low uncertainty as target labels for that data.
LITL (Label in the Loop) is our answer to the lack of reliable active learning systems on the market. It is a set of components enabling running an active loop on given data. Among other things, it consists of:
When creating LITL, we made an assumption about the imperfection of experts. In the classical academic approach, the label of observation is given by an infallible oracle. This is a far-from-real approach, where real experts have different knowledge, different psychophysical conditions, or are just plain wrong.
Our work resulted in expert quality estimation modules and an expert observation selection module based on it. This allows for assigning an observation to the best possible expert. In addition, our expert modules are robust to changes in the pool of experts (the arrival of new experts or the departure of old ones) as well as changes in experts' competencies (e.g. caused by an expert's training).
Experts interacting with the system use an ergonomic interface that allows them to label multimodal data. As the expert works, in addition to the labels required of him, he can assign additional information to the observation. Then the system administrator, seeing the label appearing regularly, can decide whether to add it to the permanent label collection.
Observations sent to the experts are selected by the batch selection module. It consists of a set of generic criteria aggregated to obtain the score of relevance of an observation from the unlabeled data set. These criteria provide information including:
This ensures that experts are not continuously labeling similar observations. Additionally, the criteria are robust—as is the entire batch selection module—to newly arriving data (including data drift) and the appearance of new targets in the data.
Security operation center (SOC) is often an integral part of keeping modern IT systems secure. Experts working in them, who are qualified cybersecurity specialists, despite considerable expertise, are not able to review all events in the logs of their IT systems due to the volume of such data. Therefore, they use predefined rules, which serve them as a filter to separate alarming events from fully typical ones. These rules in more progressive organizations sometimes take the form of machine learning models. The problem that arises with this second approach is the dynamically changing nature of data. This phenomenon ranges from a dataset drift caused by some software update to new forms of cyberattacks (the capture of which is an integral and key part of systems security work). In the following, we address this problem by presenting an example use of LITL in cybersecurity.
The diagram presented below shows the three key components for our application: the anomaly detector (later called the sieve), the classifier, and the SOC.
The way the expert works with the system is arbitrary. Some possible scenarios include:
In all scenarios above, experts label events provided to them by active learning.
This is one application of LITL. And there are many more! Not just in the cybersecurity domain.
LITL works best in cases when there are large volumes of unlabeled data or when you lack specialized datasets for your problem.
Want to know more about how we can label your data?