Author Archives: Moazzam Khan

Machine Learning Algorithms Are Not One-Size-Fits-All

This is the second installment in a three-part series about machine learning. Be sure to read part one for the full story.

When designing machine learning solutions for security, it’s important to decide on a classifier that will perform the best with minimal error. Given the sheer number of choices available, it’s easy to get confused. Let’s explore some tips to help security leaders select the right machine learning algorithm for their needs.

4 Types of Machine Learning Techniques

If you know which category your problem falls into, you can narrow down your choices. Machine learning algorithms are broadly categorized under four types of learning problems.

1. Supervised Learning

Supervised learning trains the algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detector model by learning from samples of labeled spam/nonspam is supervised learning. Another example is a problem such as the Defense Advanced Research Projects Agency (DARPA)’s Knowledge Discovery and Data Mining 1999 challenge (KDD-99), in which contestants competed to design a machine learning-based intrusion detection system (IDS) from a set of 41 features per instance labeled either “attack” or “normal.”

2. Unsupervised Learning

Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as clustering, and the outcome is usually unknown. Clustering is a task in which samples are compared with each other in an attempt to find examples that are close to each other, usually by either a measure of density or a distance metric, such as Euclidian distance when projected into a high-dimensional space.

A security problem that falls into this category is network anomaly detection, which is a different method of designing an IDS. In this case, the algorithm doesn’t assume that it knows an attack from a normal input. Instead, the algorithm tries to understand what normal traffic is by watching the network in a (hopefully) clean state so that it can learn the patterns of traffic. Then, anything that falls outside of this “normal” region is a possible attack. Note that there is a great deal of uncertainty with these algorithms because they do not actually know what an attack looks like.

3. Semisupervised Learning

This type of learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to provide some concept of a known classification to unsupervised algorithms. Several techniques, such as label spreading and weakly supervised learning, can also be employed to augment a supervised training set with a small number of samples. This requires a great deal of work because it is an extremely common scenario.

For the challenge of exploit kit identification, for example, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. Semisupervised learning can help solve this problem. Note that semisupervised and fully supervised learning can often be differentiated by the choice of features to learn against. Depending on the features, you can often label much more data than you can with another selected set of features.

4. Reinforcement Learning

Unlike the other three types of learning problems, reinforcement learning seeks the optimal path to a desired result by rewarding improvement. The problem set is generally small and the training data well-understood. An example is a generative adversarial network (GAN) such as this experiment, in which the distance, measured in correct and incorrect bits, was used as a loss function to encrypt messages between two neural networks.

Another example is PassGAN, where a GAN was trained to guess passwords more efficiently and the reinforcement function, at a very high level, was how many passwords it guessed correctly. Specifically, PassGAN learned rules to take a dictionary and create likely passwords based on an actual set of leaked passwords. It modeled human behavior in guessing the ways that humans transformed passwords into nondictionary character strings.

Problem Type and Training Data

Another, broader categorization of algorithm is based on the type of problem, such as classification, regression, anomaly detection or dimensionality reduction. There are specific machine learning algorithms designed for each type of problem.

Once you have narrowed down your choices based on the above broader categories, you should consider the algorithm’s bias or variance.

Bias is defined in practical effect as the ability of an algorithm to model the training data accurately. Bias represents the proximity of the model to the training data. A high bias results in the model missing relationships between features and the target variable, which leads to what is known as underfitting.

Variance is how accurately the model measures noise in the distribution of training samples. A high variance means that random variables are captured more effectively. Obviously, we do not want this either. Therefore, we seek to minimize both of these values. We can control bias and variance through different parameters. For example, the k-means algorithm with a high value of k gives us high bias and low variance. This makes sense because the value of k means that we have allowed more clusters in space to form, and more of the relationships among the features will be missed as the algorithm fits the data to a larger number of clusters.

A deeper decision tree has more variance because it contains more branches (decision points) and therefore has more false relationships that model noise in the training data. For artificial neural networks (ANNs), variance increases and bias decreases with the increase in the number of layers. Therefore, deep learning has a very low bias. However, this is at the cost of more noise being represented in the deep learning models.

Other Considerations for Selecting Machine Learning Algorithms

Data type also dictates the choice of algorithm because some algorithms work better on certain data types than others. For example, support vector machines (SVMs), linear and logistic regression, and neural networks require the feature vector to be numerical. On the other hand, decision trees can be more flexible to different data types, such as nominal input.

Some algorithms perform poorly when there is correlation between the features — meaning that multiple features demonstrate the same patterns. A feature that is defined as the result of calculations based on other features would be highly correlated with those input features. Linear regression and logistic regression, along with other algorithms, require regularization to avoid numerical instabilities that come from redundancy in data.

The relationship between independent and dependent variables can also help us determine which algorithm to choose. Because naive Bayes, linear regression and logistic regression perform well if each feature has an independent contribution to the output, these algorithms are poor choices for correlated data. Unfortunately, when features are extracted from voluminous data, it is often impossible to know whether the independence assumption holds true without further analysis. If the relationship is more complex, decision trees or neural networks may be a better choice.

Noise in the output values can also inform which algorithm to choose, as well as the parameters for your selected algorithm. However, you are best served by cleaning the data of noise. If you can’t clean the data, you will most likely receive results with poor classification boundaries. An SVM, for example, is very sensitive to noise because it attempts to draw a margin, or boundary, between or among the classes of the data as they are labeled. A decision tree or bagged tree such as a random forest might be a better choice here since these allow for the tree(s) to be pruned. A shallower tree models less noise.

Dimensionality of the input space is critical to your decision as well. In dimensionality, there are two common phenomena that are directly at odds with each other. First, the so-called curse of dimensionality occurs when the data contains a very large number of features. If you think about this in special terms, a two-dimensional figure verses a three-dimensional figure can look very different. A very packed set of points on a plane can become spread apart once we add the third dimension. If we try to cluster these points, the distances between two arbitrary points will dramatically increase.

Alternatively, the blessing of dimensionality means that having more dimensions helps us model the problem more completely. In this sense, the plane may pack completely unrelated points together because the two-dimensional coordinates are close to each other even though they have nothing to do with each other. In three dimensions, these points might be spread farther apart, showing us a more complete separation of the unrelated points.

These two ideas cause us, as domain experts and deep learning designers, to pick our features carefully. We do not want to encounter the sparsity of the curse of dimensionality because there might not be a clear boundary, but we also do not want to pack our unrelated examples too close to each other and create an unusable boundary. Some algorithms work better in higher dimensions. In very high dimensional space, for example, you might choose a boosted random forest or a deep learning model.

Transparency and Functional Complexity

The level of visibility into the model’s decision process is a very important criterion for selecting an algorithm. Algorithms that provide decision trees show clearly how the model reached a decision, whereas a neural network is essentially a black box.

Similarly, you should consider functional complexity and the amount of training data. Functional complexity can be understood in terms of things like speed and memory usage. If you lack sufficient computational power or memory in the machine on which your model will be run, you should choose an algorithm such as naive Bayes or one of the many rules generator-based algorithms. Naive Bayes counts the frequencies of terms, so its model is only as big as the number of features in the data. Rules generators essentially create a series of if/then conditions that the data must satisfy to be classified correctly. These are very good for low-complexity devices. If, however, you have a good deal of power and memory, you might go so far as deep learning, which (in most but not all configurations) requires many more resources.

How Much Training Data Do You Need?

You might also opt for naive Bayes if you have a smaller set of training data available. A small training data size will severely limit you, so it is always best to acquire as much data as you can. A few hundred might be able to construct a shallow decision tree or a naive Bayesian model. A few thousand to tens of thousands is usually enough to create an effective random forest, an algorithm that performs very well in most learning problems. Finally, if you want to try deep learning, you may need hundreds of thousands — or, preferably, millions — of training examples, depending on how many layers you want your deep learning model to have.

Try Several Classifiers and Pick the Best

When all else fails, try several classifiers that are suitable for your problem and then compare the results. Key metrics include accuracy of the model and true and false positive rates; derived metrics such as precision, recall and F1; and even metrics as complex as area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPRC). Each measurement tells you something different about the model’s performance. We’ll define these metrics in greater detail in the final installment of our three-part series.

The post Machine Learning Algorithms Are Not One-Size-Fits-All appeared first on Security Intelligence.

How to Choose the Right Artificial Intelligence Solution for Your Security Problems

Artificial intelligence (AI) brings a powerful new set of tools to the fight against threat actors, but choosing the right combination of libraries, test suites and trading models when building AI security systems is highly dependent on the situation. If you’re thinking about adopting AI in your security operations center (SOC), the following questions and considerations can help guide your decision-making.

What Problem Are You Trying to Solve?

Spam detection, intrusion detection, malware detection and natural language-based threat hunting are all very different problem sets that require different AI tools. Begin by considering what kind of AI security systems you need.

Understanding the outputs helps you test data. Ask yourself whether you’re solving a classification or regression problem, building a recommendation engine or detecting anomalies. Depending on the answers to those questions, you can apply one of four basic types of machine learning:

  1. Supervised learning trains an algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detection model by learning from samples labeled spam/nonspam is a good application of supervised learning.
  2. Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as cluster analysis, and the outcome is usually unknown. Unsupervised machine learning is good at discovering underlying patterns and data, but is a poor choice for a regression or classification problem. Network anomaly detection is a security problem that fits well in this category.
  3. Semisupervised learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to improve the quality of training sets. For exploit kit identification problems, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. We can use semisupervised learning to address the problem.
  4. Reinforcement learning seeks the optimal path to a desired result by continually rewarding improvement. The problem set is generally small, and the training data well-understood. An example of reinforcement learning is a generative adversarial network (GAN), such as this experiment from Cornell University in which distance, measured in the form of correct and incorrect bits, is used as a loss function to encrypt messages between two neural networks and avoid eavesdropping by an unauthorized third neural network.

Artificial Intelligence Depends on Good Data

Machine learning is predicated on learning from data, so having the right quantity and quality is essential. Security leaders should ask the following questions about their data sources to optimize their machine learning deployments:

  • Is there enough data? You’ll need a sufficient amount to represent all possible scenarios that a system will encounter.
  • Does the data contain patterns that machine learning systems can learn from? Good data sets should have frequently recurring values, clear and obvious meanings, few out-of-range values and persistence, meaning that they change little over time.
  • Is the data sparse? Are certain expected values missing? This can create misleading results.
  • Is the data categorical or numeric in nature? This dictates the choice of the classifier we can use.
  • Are labels available?
  • Is the data current? This is particularly important in AI security systems because threats change so quickly. For example, a malware detection system that has been trained on old samples will have difficulty detecting new malware variations.
  • Is the source of the data trusted? You don’t want to train your model from publicly available data of origins you don’t trust. Data sample poisoning is just one attack vector through which machine learning-based security models are compromised.

Choosing the Right Platforms and Tools

There is a wide variety of platforms and tools available on the market, but how do you know which is the right one for you? Ask the following questions to help inform your choice:

  • How comfortable are you in a given language?
  • Does the tool integrate well with your existing environment?
  • Is the tool well-suited for big data analytics?
  • Does it provide built-in data parsing capabilities that enable the model to understand the structure of data?
  • Does it use a graphical or command-line interface?
  • Is it a complete machine learning platform or just a set of libraries that you can use to build models? The latter provides more flexibility, but also has a steeper learning curve.

What About the Algorithm?

You’ll also need to select an algorithm to employ. Try a few different algorithms and compare to determine which delivers the most accurate results. Here are some factors that can help you decide which algorithm to start with:

  • How much data do you have, and is it of good quality? Data with many missing values will deliver lower-quality results.
  • Is the learning problem supervised, unsupervised or reinforcement learning? You’ll want to match the data set to the use case as described above.
  • Determine the type of problem being solved, such as classification, regression, anomaly detection or dimensionality reduction. There are different AI algorithms that work best for each type of problem.
  • How important is accuracy versus speed? If approximations are acceptable, you can get by with smaller data sets and lower-quality data. If accuracy is paramount, you’ll need higher quality data and more time to run the machine learning algorithms.
  • How much visibility do you need into the process? Algorithms that provide decision trees show you clearly how the model reached a decision, while neural networks are a bit of a black box.

How to Train, Test and Evaluate AI Security Systems

Training samples should be constantly updated as new exploits are discovered, so it’s often necessary to perform training on the fly. However, training in real time opens up the risk of adversarial machine learning attacks in which bad actors attempt to disrupt the results by introducing misleading input data.

While it is often impossible to perform training offline, it is desirable to do so when possible so the quality of the data can be regulated. Once the training process is complete, the model can be deployed into production.

One common method of testing trained models is to split the data set and devote a portion of the data — say, 70 percent — to training and the rest to testing. If the model is robust, the output from both data sets should be similar.

A somewhat more refined approach called cross-validation divides the data set into groups of equal sizes and trains on all but one of the groups. For example, if the number of groups is “n,” then you would train on n-1 groups and test with the one set that is left out. This process is repeated many times, leaving out a different group for testing each time. Performance is measured by averaging results across all repetitions.

Choice of evaluation metrics also depends on the type of problem you’re trying to solve. For example, a regression problem tries to find the range of error between the actual value and the predicted value, so the metrics you might use include mean absolute error, root mean absolute error, relative absolute error and relative squared error.

For a classification problem, the objective is to determine which categories new observations belong in — which requires a different set of quality metrics, such as accuracy, precision, recall, F1 score and area under the curve (AUC).

Deployment on the Cloud or On-Premises?

Lastly, you’ll need to select a location for deployment. Cloud machine learning platforms certainly have advantages, such as speed of provisioning, choice of tools and the availability of third-party training data. However, you may not want to share data in the cloud for security and compliance reasons. Consider these factors before choosing whether to deploy on-premises or in a public cloud.

These are just a few of the many factors to consider when building security systems with artificial intelligence. Remember, the best solution for one organization or security problem is not necessarily the best solution for everyone or every situation.

The post How to Choose the Right Artificial Intelligence Solution for Your Security Problems appeared first on Security Intelligence.