Category Archives: Threat Monitoring

Now That You Have a Machine Learning Model, It’s Time to Evaluate Your Security Classifier

This is the third installment in a three-part series about machine learning. Be sure to read part one and part two for more context on how to choose the right artificial intelligence solution for your security problems.

As we move into this third part, we hope we have helped our readers better identify an artificial intelligence (AI) solution and select the right algorithm to address their organization’s security needs. Now, it’s time to evaluate the effectiveness of the machine learning (ML) model being used. But with so many metrics and systems available to measure security success, where does one begin?

Classification or Regression? Which Can Get Better Insights From Your Data?

By this time, you may have selected an algorithm of choice to use with your machine learning solution. It could fall into one of two categories in general, classification or regression. Here is a reminder of the main difference: From a security standpoint, these two types of algorithms tend to solve different problems. For example, a classifier might be used as an anomaly detector, which is often the basis of the new generation of intrusion detection and prevention systems. Meanwhile, a regression algorithm might be better at things such as detecting denial-of-service attacks (DoS) because these problems tend to involve numbers rather than nominal labels.

At first look, the difference between Classification and Regression might seem complicated, but it really isn’t. It just comes down to what type of value our target variable, also called our dependent variable, contains. In that sense, the main difference between the two is that the output variable in Regression is numerical while he output for Classification is categorical/discrete.

For our purposes in this blog, we’ll focus on metrics that are used to evaluate algorithms applied to supervised ML. For reference, supervised machine learning is the form of learning where we have complete labels and a ground truth. For example, we know that the data can be divided into class1 and class2, and each of our training, validation, and testing samples is labeled as belonging to class1 or class2.

Classification Algorithms – or Classifiers

To have ML work with data, we can select a security classifier, which is an algorithm with a non-numeric class value. We want this algorithm to look at data and classify it into predefined data “classes.” These are usually two or more categorical, dependent variables.

For example, we might try to classify something as an attack or not an attack. We would create two labels, one for each of those classes. A classifier then takes the training set and tries to learn a “decision boundary” between the two classes. There could be more than two classes, and in some cases only one class. For example, the Modified National Institute of Standards and Technology (MNIST) database demo tries to classify an image as one of the ten possible digits from hand-written samples. This demo is often used to show the abilities of deep learning, as the deep net can output probabilities for each digit rather than one single decision. Typically, the digit with the highest probability is chosen as the answer.

A Regression Algorithm – or Regressor

A Regression algorithm, or regressor, is used when the target variable is a number. Think of a function in math: there are numbers that go into the function and there is a number that comes out of it. The task in Regression is to find what this function is. Consider the following example:

Y = 3x+9

We will now find ‘Y’ for various values of ‘X’. Therefore:

X = 1 -> y = 12

X = 2 -> y = 15

X = 3 -> y = 18

The regressor’s job is to figure out what the function is by relying on the values of X and Y. If we give the algorithm enough X and Y values, it will hopefully find the function 3x+9.

We might want to do this in cases where we need to calculate the probability of an event being malicious. Here, we do not want a classification, as the results are not fine-grained enough. Instead, we want a confidence or probability score. So, for example, the algorithm might provide the answer that “there is a 47 percent probability that this sample is malicious.”

In the next section, we will be looking at the various metrics for each, Classification, and Regression, which can help us determine the efficacy of our security posture by using our chosen ML model.

Metrics for Classification

Before we dive into common classification metrics, let’s define some key terms:

  • Ground truth is a set of known labels or descriptions of which class or target variable represents the correct solution. In a binary classification problem, for instance, each example in the ground truth is labeled with the correct classification. This mirrors the training set, where we have known labels for each example.
  • Predicted labels represent the classifications that the algorithm believes are correct. That is, the output of the algorithm.

Now let’s take a closer look at some of the most useful metrics against which we can choose to measure the success of our machine learning deployment.

True Positive Rate

This is the ratio of correctly predicted positive examples to the total number of examples in the ground truth. If there are 100 examples in the ground truth and the model correctly predicts 65 of them as positive, then the true positive rate (TPR) is 65 percent, sometimes written as 0.65.

False Positive Rate

The false positive rate (FPR) is the number of incorrectly predicted examples that are labeled as positive by the algorithm but are actually negative in the ground truth. If we have 100 examples and 15 of them are incorrectly predicted as positive, then the false positive rate would be 15 percent, sometimes written as 0.15.

True Negative Rate

The true negative rate (TNR) is the number of correctly predicted negative examples divided by the number of examples in the ground truth. Let us say that in the scenario of 100 examples that another 15 of these examples were correctly predicted as negative. Therefore, the true negative rate (TNR) is 15 percent, also written as 0.15. Notice here that there were 15 false positives and 15 true negatives. This makes for a total of 30 negative examples.

False Negative Rate

The false negative rate (FNR) is the ratio of examples predicted incorrectly as belonging to the negative class over the number of examples in the ground truth. Continuing with the aforementioned case, let’s say that out of 100 examples in the ground truth, the algorithm correctly predicted 65 as positive. We also know that 15 were predicted as false positives and 15 were predicted as true negatives. This leaves us with 5 examples unaccounted for, so our false negative rate is 5 percent, or 0.05. The false negative rate is the complement to the true positive rate, so the sum of the two metrics should be 70 percent (0.7), as 70 examples actually belong to the positive class.


Accuracy measures the proportion of correct predictions, both positive and negative, to the total number of examples in the ground truth. This metric can often be misleading if, for instance, there is a large proportion of positive examples in the ground truth compared to the number of negative examples. Similarly, if the model predicts only the positive class correctly, accuracy will not give you a sense of how well the model does with negative predictions versus negative examples in the ground truth even though the accuracy could be quite high because the positive examples were predicted.

Accuracy = (TP+TN)/(TP+TN+FP+FN)


Before we explore the precision metric, it’s important to define a few more terms:

  • TP is the raw number of true positives (in the above example, the TP is 65).
  • FP is the raw number of false positives (15 in the above example).
  • TN is the raw number of true negatives (15 in the above example).
  • FN is the raw number of false negatives (5 in the above example).

Precision, sometimes known as the positive predictive value, is the proportion of true positives predicted by the algorithm over the sum of all examples predicted as positive. That is, precision=TP/(TP+FP).

In our example, there were 65 positives in the ground truth that the algorithm correctly labeled as positive. However, it also labeled 15 examples as positive when they were actually negative.

These false positives go into the denominator of the precision calculation. So, we get 65/(65+15), which yields a precision of 0.81.

What does this mean? In brief, high precision means that the algorithm returned far more true positives than false positives. In other words, it is a qualitative measure. The higher the precision, the better job the algorithm did of predicting true positives while rejecting false positives.


Recall, also known as sensitivity, is the ratio of true positives to true positives plus false negatives: TP/(TP+FN).

In our example, there were 65 true positives and 5 false negatives, giving us a recall of 65/(65+5) = 0.93. Recall is a quantitative measure; in a classification task, it is a measure of how well the algorithm “memorized” the training data.

Note that there is often a trade-off between precision and recall. In other words, it’s possible to optimize one metric at the expense of the other. In a security context, we may often want to optimize recall over precision because there are circumstances where we must predict all the possible positives with a high degree of certainty.

For example, in the world of automotive security, where kinetic harm may occur, it is often heard that false positives are annoying, but false negatives can get you killed. That is a dramatic example, but it can apply to other situations as well. In intrusion prevention, for instance, a false positive on a ransomware sample is a minor nuisance, while a false negative could cause catastrophic data loss.

However, there are cases that call for optimizing precision. If you are constructing a virus encyclopedia, for example, higher precision might be preferred when analyzing one sample since the missing information will presumably be acquired from another sample.


An F-measure (or F1 score) is defined as the harmonic mean of precision and recall. There is a generic F-measure, which includes a variable beta that causes the harmonic mean of precision and recall to be weighted.

Typically, the evaluation of an algorithm is done using the F1 score, meaning that beta is 1 and therefore the harmonic mean of precision and recall is unweighted. The term F-measure is used as a synonym for F1 score unless beta is specified.

The F1 score is a value between 0 and 1 where the ideal score is 1, and is calculated as 2 * Precision * Recall/(Precision+Recall), or the harmonic mean. This metric typically lies between precision and recall. If both are 1, then the F-measure equals 1 as well. The F1 score has no intuitive meaning per se; it is simply a way to represent both precision and recall in one metric.

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC), sometimes written as Phi, is a representation of all four values — TP, FP, TN and FN. Unlike precision and recall, the MCC takes true negatives into account, which means it handles imbalanced classes better than other metrics. It is defined as:


If the value is 1, then the classifier and ground truth are in perfect agreement. If the value is 0, then the result of the classifier is no better than random chance. If the result is -1, the classifier and the ground truth are in perfect disagreement. If this coefficient seems low (below 0.5), then you should consider using a different algorithm or fine-tuning your current one.

Youden’s Index

Also known as Youden’s J statistic, Youden’s index is the binary case of the general form of the statistic known as ‘informedness’, which applies to multiclass problems. It is calculated as (sensitivity + specificity–1) and can be seen as the probability of an informed decision verses a random guess. In other words, it takes all four predictors into account.

Remember from our examples that recall=TP/(FP+FN) and specificity, or TNR, is also the complement of the FPR. Therefore, the Youden index incorporates all measures of predictors. If the value of Youden’s index is 0, then the probability of the decision actually being informed is no better than random chance. If it is 1, then both false positives and false negatives are 0.

Area Under the Receiver Operator Characteristic Curve

This metric, usually abbreviated as AUC or ROC, measures the area under the curve plotted with true positives on the Y-axis and false positives on the X-axis. This metric can be useful because it provides a single number that lets you compare models of different types. An AUC value of 0.5 means the result of the test is essentially a coin flip. You want the AUC to be as close to 1 as possible because this enables researchers to make comparisons across experiments.

Area Under the Precision Recall Curve

Area under the precision recall curve (AUPRC) is a measurement that, like MCC, accounts for imbalanced class distributions. If there are far more negative examples than positive examples, you might want to use AUPRC as your metric and visual plot. The curve is precision plotted against recall. The closer to 1, the better. Note that since this metric/plot works best when there are more negative predictions than positive predictions, you might have to invert your labels for testing.

Average Log Loss

Average log loss represents the penalty of wrong prediction. It is the difference between the probability distributions of the actual and predicted models.

In deep learning, this is sometimes known as the cross-entropy loss, which is used when the result of a classifier such as a deep learning model is a probability rather than a binary label. Cross-entropy loss is therefore the divergence of the predicted probability from the actual probability in the ground truth. This is useful in multiclass problems but is also applicable to the simplified case of binary classification.

By using these metrics to evaluate your ML model, and tailoring them to your specific needs, you could fine-tune the output from the data and essentially get more certain results, thus detecting more issues/threats, and optimizing controls as needed.

Metrics for Regression

For regression, the goal is to determine the amount of errors produced by the ML algorithm. The model is considered good if the error value between the predicted and observed value is small.

Let’s take a closer look at some of the metrics used for evaluating regression models.

Mean Absolute Error

Mean absolute error (MAE) is the closeness of the predicted result to the actual result. You can think of this as the average of the differences between the predicted value and the ground truth value. As we proceed along each test example when evaluating against the ground truth, we subtract the actual value reported in the ground truth from the predicted value from the regression algorithm and take the absolute value. We can then calculate the arithmetic mean of these values.

While the interpretation of this metric is well-defined, because it is an arithmetic mean, it could be affected by very large, or very small differences. Note that this value is scale-dependent, meaning that the error is on the same scale as the data. Because of this, you cannot compare two MAE values across datasets.

Root Mean Squared Error

Root mean squared error (RMSE) attempts to represent all error across moments in time in one value. This is often the metric that optimization algorithms seek to minimize in regression problems. When an optimization algorithm is tuning so-called hyperparameters, it seeks to make RMSE as small as possible.

Consider, however, that like MAE, RMSE is both sensitive to large and small outliers and is scale-dependent. Therefore, you have to be careful and examine your residuals to look for outliers — values that are significantly above or below the rest of the residuals. Also, like MAE, it is improper to compare RMSE across datasets unless the scaling translations have been accounted for, because data scaling, whether by normalization or standardization, is dependent upon the data values.

For example, in Standardization, the scale from -1 to 1 is determined by subtracting the mean from each value and dividing the value by the standard deviation. This gives the normal distribution. If, on the other hand, the data is normalized, the scaling is done by taking the current value and subtracting the minimum value, then dividing this by the quantity (maximum value – minimum value). These are completely different scales, and as a result, one cannot compare the RMSE between these two data sets.

Relative Absolute Error

Relative absolute error (RAE) is the mean difference divided by the arithmetic mean of the values in the ground truth. Note that this value can be compared across scales because it has been normalized.

Relative Squared Error

Relative squared error (RSE) is the total squared error of the predicted values divided by the total squared error of the observed values. This also normalizes the error measurement so that it can be compared across datasets.

Machine Learning Can Revolutionize Your Organization’s Security

Machine learning is integral to the enhancement of cybersecurity today and it will only become more critical as the security community embraces cognitive platforms.

In this three-part series, we covered various algorithms and their security context, from cutting-edge technologies such as generative adversarial networks to more traditional algorithms that are still very powerful.

We also explored how to select the appropriate security classifier or regressor for your task, and, finally, how to evaluate the effectiveness of a classifier to help our readers better gauge the impact of optimization. With a better idea about these basics, you’re ready to examine and implement your own algorithms and to move toward revolutionizing your security program with machine learning.

The post Now That You Have a Machine Learning Model, It’s Time to Evaluate Your Security Classifier appeared first on Security Intelligence.

It’s Time to Modernize Traditional Threat Intelligence Models for Cyber Warfare

When a client asked me to help build a cyberthreat intelligence program recently, I jumped at the opportunity to try something new and challenging. To begin, I set about looking for some rudimentary templates with a good outline for building a threat intelligence process, a few solid platforms that are user-friendly, the basic models for cyber intelligence collection and a good website for describing various threats an enterprise might face. This is what I found:

  1. There are a handful of rudimentary templates for building a good cyberthreat intelligence program available for free online. All of these templates leave out key pieces of information that any novice to the cyberthreat intelligence field would be required to know. Most likely, this is done to entice organizations into spending copious amounts of money on a specialist.
  2. The number of companies that specialize in the collection of cyberthreat intelligence is growing at a ludicrous rate, and they all offer something that is different, unique to certain industries, proprietary, automated via artificial intelligence (AI) and machine learning, based on pattern recognition, or equipped with behavioral analytics.
  3. The basis for all threat intelligence is heavily rooted in one of three basic models: Lockheed Martin’s Cyber Kill Chain, MITRE’s ATT&CK knowledge base and The Diamond Model of Intrusion Analysis.
  4. A small number of vendors working on cyberthreat intelligence programs or processes published a complete list of cyberthreats, primary indicators, primary actors, primary targets, typical attack vectors and potential mitigation techniques. Of that small number, very few were honest when there was no useful mitigation or defensive strategy against a particular tactic.
  5. All of the cyberthreat intelligence models in use today have gaps that organizations will need to overcome.
  6. A search within an article content engine for helpful articles with the keyword “threat intelligence” produced more than 3,000 results, and a Google search produces almost a quarter of a million. This is completely ridiculous. Considering how many organizations struggle to find experienced cyberthreat intelligence specialists to join their teams — and that cyberthreats grow by the day while mitigation strategies do not — it is not possible that there are tens of thousands of professionals or experts in this field.

It’s no wonder why organizations of all sizes in a variety of industries are struggling to build a useful cyberthreat intelligence process. For companies that are just beginning their cyberthreat intelligence journey, it can be especially difficult to sort through all these moving parts. So where do they begin, and what can the cybersecurity industry do to adapt traditional threat intelligence models to the cyber battlefield?

How to Think About Thinking

A robust threat intelligence process serves as the basis for any cyberthreat intelligence program. Here is some practical advice to help organizations plan, build and execute their program:

  1. Stop and think about the type(s) of cyberthreat intelligence data the organization needs to collect. For example, if a company manufactures athletic apparel for men and women, it is unnecessary to collect signals, geospatial data or human intelligence.
  2. How much budget is available to collect the necessary cyberthreat intelligence? For example, does the organization have the budget to hire threat hunters and build a cyberthreat intelligence program uniquely its own? What about purchasing threat intelligence as a service? Perhaps the organization should hire threat hunters and purchase a threat intelligence platform for them to use? Each of these options has a very different cost model for short- and long-term costs.
  3. Determine where cyberthreat intelligence data should be stored once it is obtained. Does the organization plan to build a database or data lake? Does it intend to store collected threat intelligence data in the cloud? If that is indeed the intention, pause here and reread step one. Cloud providers have very different ideas about who owns data, and who is ultimately responsible for securing that data. In addition, cloud providers have a wide range of security controls — from the very robust to a complete lack thereof.
  4. How does the organization plan to use collected cyberthreat intelligence data? It can be used for strategic purposes, tactical purposes or both within an organization.
  5. Does the organization intend to share any threat intelligence data with others? If yes, then you can take the old cybersecurity industry adage “trust but verify” and throw it out. The new industry adage should be “verify and then trust.” Never assume that an ally will always be an ally.
  6. Does the organization have enough staff to spread the workload evenly, and does the organization plan to include other teams in the threat intelligence process? Organizations may find it very helpful to include other teams, either as strategic partners, such as vulnerability management, application security, infrastructure and networking, and risk management teams, or as tactical partners, such as red, blue and purple teams.

How Can We Adapt Threat Intelligence Models to the Cyber Battlefield?

As mentioned above, the threat intelligence models in use today were not designed for cyber warfare. They are typically linear models, loosely based on Carl Von Clausewitz’s military strategy and tailored for warfare on a physical battlefield. It’s time for the cyberthreat intelligence community to define a new model, perhaps one that is three-dimensional, nonlinear, rooted in elementary number theory and that applies vector calculus.

Much like game theory, The Diamond Model of Intrusion Analysis is sufficient if there are two players (the victim and the adversary), but it tends to fall apart if the adversary is motivated by anything other than sociopolitical or socioeconomic payoff, if there are three or more players (e.g., where collusion, cooperation and defection of classic game theory come into play), or if the adversary is artificially intelligent. In addition, The Diamond Model of Intrusion Analysis attempts to show a stochastic model diagram but none of the complex equations behind the model — probably because that was someone’s 300-page Ph.D. thesis in applied mathematics. This is not much help to the average reader or a newcomer to the threat intelligence field.

Nearly all models published thus far are focused on either external actors or insider threats, as though a threat actor must be one or the other. None of the widely accepted models account for, or include, physical security.

While there are many good articles about reducing alert fatigue in the security operations center (SOC), orchestrating security defenses, optimizing the SOC with behavioral analysis and so on, these articles assume that the reader knows what any of these things mean and what to do about any of it. A veteran in the cyberthreat intelligence field would have doubts that behavioral analysis and pattern recognition are magic bullets for automated threat hunting, for example, since there will always be threat actors that don’t fit the pattern and whose behavior is unpredictable. Those are two of the many reasons why the fields of forensic psychology and criminal profiling were created.

Furthermore, when it comes to the collection of threat intelligence, very few articles provide insight on what exactly constitutes “useful data,” how long to store it and which types of data analysis would provide the best insight.

It would be a good idea to get the major players in the cyberthreat intelligence sector together to develop at least one new model — but preferably more than one. It’s time for industry leaders to develop new ways of classifying threats and threat actors, share what has and has not worked for them, and build more boundary connections than the typical socioeconomic or sociopolitical ones. The sector could also benefit from looking ahead at what might happen if threat actors choose to augment their crimes with algorithms and AI.

The post It’s Time to Modernize Traditional Threat Intelligence Models for Cyber Warfare appeared first on Security Intelligence.

NRSMiner Crypto-Mining Malware Infects Asian Devices With the Help of EternalBlue Exploit

Security researchers report that the newest version of NRSMiner crypto-mining malware is causing problems for companies that haven’t patched the EternalBlue exploit.

Last year, the EternalBlue exploit (CVE-2017-0144) leveraged Server Message Block (SMB) 1.0 flaws to trigger remote code execution and spread the WannaCry ransomware. Now, security research firm F-Secure reports that threat actors are using this exploit to infect unpatched devices in Asia with NRSMiner. While several countries including Japan, China and Taiwan have all been targeted, the bulk of attacks — around 54 percent — have occurred in Vietnam.

According to F-Secure, the newest version of NRSMiner has the capability to leverage both existing infections to update its code on host machines and intranet-connected systems to spread infections to machines that haven’t been patched with Microsoft security update MS17-010.

Eternal Issues Facing Security Professionals

In addition to its crypto-mining activities, the latest version of NRSMiner is also capable of downloading new versions of itself and deleting old files and services to cover its tracks. Using the WUDHostUpgrade[xx].exe module, NRSMiner actively searchers for potential targets to infect. If it detects the current NRSMiner version, WUDHostUpgrade deletes itself. If it finds a potential host, the malware deletes multiple system files, extracts its own versions and then installs a service named snmpstorsrv.

Although this crypto-mining malware is currently confined to Asia, its recent uptick serves as a warning to businesses worldwide that haven’t patched their EternalBlue vulnerabilities. While WannaCry infections have largely evaporated, the EternalBlue exploit/DoublePulsar backdoor combination remains an extremely effective way to deploy advanced persistent threats (APTs).

How to Curtail Crypto-Mining Malware Threats

Avoiding NRSMiner starts with security patching: Enterprises must ensure their systems are updated with MS17-010. While this won’t eliminate pre-existing malware infections, it will ensure no new EternalBlue exploits can occur. As noted by security experts, meanwhile, a combination of proactive and continual network monitoring can help identify both emerging threats and infections already present on enterprise systems. Organizations should also develop a comprehensive security framework that includes two-factor authentication (2FA), identity and access management (IAM), web application firewalls and reliable patch management.

EternalBlue exploits continue to cause problems for unpatched systems. Avoid NRSMiner and other crypto-mining malware threats by closing critical gaps, implementing improved monitoring strategies and developing advanced security frameworks.

The post NRSMiner Crypto-Mining Malware Infects Asian Devices With the Help of EternalBlue Exploit appeared first on Security Intelligence.