Category Archives: machine learning

Training Transformers for Cyber Security Tasks: A Case Study on Malicious URL Prediction

Highlights       

  • Perform a case study on using Transformer models to solve cyber security problems
  • Train a Transformer model to detect malicious URLs under multiple training regimes
  • Compare our model against other deep learning methods, and show it performs on-par with other top-scoring models
  • Identify issues with applying generative pre-training to malicious URL detection, which is a cornerstone of Transformer training in natural language processing (NLP) tasks
  • Introduce novel loss function that balances classification and generative loss to achieve improved performance on the malicious URL detection task

Introduction

Over the past three years Transformer machine learning (ML) models, or “Transformers” for short, have yielded impressive breakthroughs in a variety of sequence modeling problems, specifically natural language processing (NLP). For example, OpenAI’s latest GPT-3 model is capable of generating long segments of grammatically-correct prose from scratch. Spinoff models, such as those developed for question and answering, are capable of correlating context over multiple sentences. AI Dungeon, a single and multiplayer text adventure game, uses Transformers to generate plausible unlimited content in a variety of fantasy settings. Transformers’ NLP modeling capabilities are apparently so powerful that they pose security risks in their own right, in terms of their potential power to spread disinformation, yet on the other side of the coin, they can be used as powerful tools to detect and mitigate disinformation campaigns. For example, in previous research by the FireEye Data Science team, a NLP Transformer was fine-tuned to detect disinformation on social media sites.

Given the power of these Transformer models, it seems natural to wonder if we can apply them to other types of cyber security problems that do not necessarily involve natural language, per se. In this blog post, we discuss a case study in which we apply Transformers to malicious URL detection. Studying Transformer performance on URL detection problem is a first logical step to extending Transformers to more generic cyber security tasks, since URLs are not technically natural language sequences but share some common characteristics with NLP.

In the following sections, we outline a typical Transformer architecture and discuss how we adapt it to URLs with a character-focused tokenization. We then discuss loss functions we employ to guide the training of the model, and finally compare our training approaches to more conventional ML-based modeling options.

Adapting Transformers to URLs

Our URL Transformer operates at the character level, where each character in the URL corresponds to an input token. When a URL is input to our Transformer, it is appended with special tokens—a classification token (“CLS”) that conditions the model to produce a prediction and padding tokens (“PAD”) that normalize the input to a fixed length to allow for parallel training. Each token in the input string is then projected into a character embedding space, followed by a stack of Attention and Feed-Forward Neural Network (FFNN) layers. This stack of layers is similar to the architecture introduced in the original Transformers paper. At a high level, the Attention layers allow each input to be associated with long-distance context of other characters that are important for the classification task, similar to the notion of attention in humans, while the FFNN layers provide capacity for learning the relationships among the combination of inputs and their respective contexts. An illustration of our architecture is shown in Figure 1.

Additionally, the URL Transformer employs a masking strategy in its Attention calculation, which enforces a left-to-right (L-R) dependence. This means that only input characters from the left of a given character influence that character’s representation in each layer of the attention stack. The network outputs one embedding for each input character, which captures all information learned by the model about the character sequence up to that point in the input.

Once the model is trained, we can use the URL Transformer to perform several different tasks, such as generatively predicting the next character in the input sequence by using the sequence embedding () as an input to another neural network with as softmax output over the possible vocabulary of characters. A specific example of this is shown in Figure 1, where we take the embedding of the input “firee”() and use it to predict the next most likely character, “y.” Similarly, we can use the embedding produced after the classification token to predict other properties of the input sequences, such as their likelihood of maliciousness.


Figure 1: High-level overview of the URL Transformer architecture

Loss Functions and Training Regimes

With the model architecture in hand, we now turn to the question of how we train the model to most effectively detect malicious URLs. Of course, we can train this model in a similar way to other supervised deep learning classifiers by: (1) making predictions on samples from a labeled training set, (2) using a loss function to measure the quality of our predictions, and (3) tune model parameters (i.e., weights) via backpropagation. However, the nature of the Transformer model allows for several interesting variations to this training regime. In fact, one of the reasons that Transformers have become so popular for NLP tasks is because they allow for self-supervised generative pre-training, which takes advantage of massive amounts of unlabeled data to help the model learn general characteristics of the input language before being fine-tuned on the ultimate task at-hand (e.g., question answering, sentiment analysis, etc.). Here, we outline some of the training regimes we explored for our URL Transformer model.

Direct Label Prediction (Decode-To-Label)

Using a training set of URLs with malicious and benign labels, we can treat the URL Transformer architecture as a feature extractor, whose outputs we use as the input to a traditional classifier (e.g., FFNN or even a random forest). When using a FFNN as our classifier, we can backpropagate the classification loss (e.g., binary cross-entropy) through both the classifier and the Transformer network to adjust the weights to perform classification. This training regime is the baseline for our experiments and is how most deep learning models are trained for classification tasks.

Next-Character Prediction Pre-Training and Fine-Tuning

Beyond the baseline classification training regime, the NLP literature suggests that one can learn a self-supervised embedding of the input sequence by training the Transformer to perform a next-character prediction task, then fine-tuning the learned representation for the classification problem. A key advantage of this approach is that data used for pre-training does not require malicious or benign labels; instead, the next characters in a URL serve as the labels to be predicted from prior characters in the sequence. This is similar to the example given in Figure 1, where the embedding output is used to predict the next character, “y,” in “fireeye.com.” Overall, this training regime allows us to take advantage of the massive amount of unlabeled data that is typically available in cyber security-related problems.

The overall structure of the architecture for this regime is similar to the aforementioned binary classification task, with FFNN layers added for classification. However, since we are now predicting multiple classes (i.e., one class per input character in the vocabulary), we must apply a softmax function to the output to induce a probability distribution over the potential output characters. Once the Transformer portion of the network is pre-trained in this way, we can swap the FFNN classification layers focused on character prediction with new layers that will be trained for the malicious URL classification problem, as in the decode-to-label case.

Balanced Mixed-Objective Training

Prior work has shown that imbuing the training process with additional knowledge outside of the primary task can help constrain the learning process, and ultimately result in better models. For instance, a malware classifier might train using loss functions that capture malicious/benign classification, malware family prediction, and tag prediction tasks as a mechanism to provide the classifier with broader understanding of the problem than looking at malicious/benign labels in isolation.

Inspired by these findings, we also introduced a mixed-objective training regime for our URL Transformer, where we train for binary classification and next-character prediction simultaneously. At each iteration of training, we compute a loss multiplier such that each loss contribution is fixed prior to backpropagation. This ensures that neither loss term dominates during training. Specifically, for minibatch i, let the net loss LMixed be computed as follows:

 

Given hyperparameters a and b, defined such that a + b: = 1, we compute constant a so that the net loss contribution of LCLS to LMixed is a and the net contribution of LNext to LMixed is b. For our evaluations, we set a := b := 0.5, effectively requiring that the model equally balance its ability to generate the next character and accurately predict malicious URLs.

Evaluation

To evaluate our URL Transformer model and better understand the impact of the three training regimes discussed earlier, we collected a training dataset of over 1M labeled malicious and benign URLs, which was split into roughly 700K training samples, 100K validation samples, and 200k test samples. Additionally, we also developed an unlabeled pre-training dataset of 20M URLs.

Using this data, we performed four different training runs for our Transformer model:

  1. DecodeToLabel (Baseline): Using strictly the binary cross-entropy loss on the embedded classification features over the entire sequence, we trained the model for 15 epochs using the training set.
  2. MixedObjective: We trained the model for 15 epochs on the training set, using both the embedded classification features and the embedded next-character prediction features.
  3. FineTune: We pre-trained the model for 15 epochs on the next-character prediction task using the training set, ignoring the malicious/benign labels. We then froze weights over the first 16 layers of the model and trained the model for an additional 15 epochs using a binary cross-entropy loss on the classification labels.
  4. FineTune 20M: We performed pre-training on the next-character prediction task using the 20M URL dataset, pre-training for 2 epochs. We then froze weights over the first 16 layers of the Transformer and trained for 15 epochs on the binary classification task.

The ROC curve shown in Figure 2 compares the performance of these four training regimes. Here, our baseline DecodeToLabel model (red) yielded a ROC curve with 0.9484 AUC, while the MixedObjective model (green) slightly outperformed the baseline with an AUC of 0.956. Interestingly, both of the fine-tuning models yielded poor classification results, which is counter to the established practice of these Transformer models in the NLP domain.


Figure 2: ROC curves for four URL Transformer training regimes

To assess the relative efficacy of our Transformer models on this dataset, we also fit several other types of benchmark models developed for URL classification: (1) a Random Forest model on SME-derived features, (2) a 1D Convolutional Neural Network (CNN) model on character embeddings, and (3) a Long Short-Term Memory (LSTM) neural network on character embeddings. Details of these models can be found in our white paper, however we find that our top performing Transformer model performs on-par with the best performing non-Transformer baseline (a 1D CNN model), which perhaps indicates that the long-range dependencies typically learned by Transformer models are not as useful in the case of malicious URL detection.


Figure 3: ROC curves comparing URL Transformer to other benchmark URL classification models

Summary

Our experiments suggest that Transformers can achieve performance comparable to or better than that of other top-performing models for URL classification, though the details of how to achieve that performance differ from common practice. Contrary to findings from the NLP domain, wherein self-supervised pre-training substantially enhances performance in a fine-tuned classification task, similar pretraining approaches actually diminish performance for malicious URL detection. This suggests that the next character prediction task has too little apparent correlation with the task of malicious/benign prediction for effective/stable transfer.

Interestingly, utilizing next-character prediction as an auxiliary loss function in conjunction with a malicious/benign loss yields improvements over training solely to predict the label. We hypothesize that while pre-training leads to a relatively poor generative model due to randomized content in the URLs within our dataset, a malicious/benign loss may serve to better condition the generative model learned by the next-character prediction task, distilling a subset of relevant information. It may also be the case that the long-distance relationships that are key to the generative pre-training task are not as important for the final malicious URL classification, as evidenced by the performance of the 1D CNN model.

Note that we did not perform a rigorous hyperparameter search for our Transformer, since this research was primarily concerned with loss functions and training regimes. Therefore, it is still an open question as to whether a more optimal architecture, specifically designed for this classification task, could substantially outperform the models described here.

While our URL dataset is not representative of all data in the cyber security space, the difficulty of obtaining a readily fine-tuned model from self-supervised pre-training suggests that this approach is unlikely to work well for training Transformers on longer sequences or sequences with lesser resemblance to natural language (e.g., PE files), but an auxiliary loss might work.

Details about this research and additional results can be found in our associated white paper.

The Bots That Stole Christmas

Intro

Who remembers heading out the night before ticket sales opened for your favorite band and camping out with all the other crazy fans who were in queue to buy the best seats when it opened the following morning? Or doing the same at a game store because a new game was coming out the next day and you needed to be the first to finish the campaign?! I do.

These scenarios are quickly becoming a thing of the past, as these environments are now mechanized and favor machines, not humans. Machines will not take over in the form of Skynet, but in the form of everyday automation, and this machine-scale world is already here today. This holiday season, I found myself in that exact position as I tried to get the new PlayStation 5 (PS5) via every single avenue I could. Each time, I was met with machines beating me to the punch. Online retail is no longer a human-scale offering, but rather an opportunity for bots and machines to outmaneuver and outperform the average buyer and help someone with often less-than-scrupulous morals make a quick buck on people’s fear of missing out (FOMO). In this blog, I want to share that experience and then show how this extends to what is coming for information security. It’s time to defend at machine-scale or die!

This whole scenario makes me think back to a quote from the Matrix:

“Throughout human history, we have been dependent on machines to survive. Fate, it seems, is not without a sense of irony.”

Get the new PS5 via an online retailer, wrap it, and have it ready for Christmas morning. Sounds easy enough. Christmas has passed and still no PS5 in sight. I’m a Distinguished Engineer so it is not that I am new to technology and my failure here is simply the fact that I am trying to shop in the traditional manner which is to show up at a website at a certain time and transact with my browser until my order is complete. That’s the old way. The new way is to employ software automation on your behalf so that your shopping task can operate at machine-scale and not at human-scale. No matter how fast you might be able to get that item in your cart and get to checkout, odds are, you’re not faster than a series of bots doing the same thing en-masse.

The first community to harness this unfair advantage are the folks who don’t want it for themselves, but instead want to use this scarcity to resell them on online auction sites for a profit. In the case of the PS5, the item in question retails at 499.99 USD. Meanwhile, scalpers now regularly sell them at 1100.00 USD on places like eBay. They have rightfully earned the name Grinch Bots. Many online retailers are aware of and actively trying to thwart this kind of activity, blocking tens of millions of bots attempts within the first 30 minutes of another batch being available for sale.

There’s a bot for that!

When mobile phones were coming of age, everyone would say “there’s an app for that!” These days, it is more likely that you will want to claim that “there’s a bot for that!” Yes, that is right, you can find services on the internet that will use bots to do your bidding, allowing you to operate at machine speed and machine-scale. There are even services out there that compare bot services to one another. So, the question becomes: To shop for high demand items on the Internet, will I need to employ bots?!

My experience says YES you will.

These shopping bot services are not illegal (yet). The US has legislation in the form of the 2016 BOTS Act which made it illegal to use software to scalp tickets and is now proposing a similar Stopping Grinch Bot Act that targets people who use bots to circumvent anti-bot protections from retailers.

And before you start thinking that this is just someone’s home project or a side-hustle, some of these bot groups have been known to make millions in profits over the course of a few weeks!

The machine-scale mega trend

The megatrend here is what we used to call “digitization,” but there’s a bit more to it than that. Retail, once a completely manual process, was then augmented by machines, and is now almost fully automated by machines, which brings with it huge advantages – both for the good guys and the bad guys. At what point are you automated enough to consider your business to be operating at machine-scale? The fact of the matter is that like online shopping, you can no longer defend your business at human-scale. I’m not talking about a future that is years out, I am talking about right now. You are facing an adversary that now has easy access to machine-speed, machine-scale perception, and machine-scale operations. Are you ready for this next level of threat actor?

A few questions you may want to consider when assessing your readiness:

  • What percentage of threat detection is automated versus manual?
  • For the automated detection, is the fidelity high enough to be safe to automate a response?
  • How much of your infrastructure can be automated safely?
    • How much is still too dangerous to automate and why?
  • What are your automation goals this year, in 3 years, and again in 5 years? Will you ever get to a 70% automated? 80%?

Automating what was once manual is always considered to be progress – that is at least, when it works as designed.

As a security professional, we must also do our threat modeling to design systems that can operate in the face of a hostile environment and one that has an active and learning set of adversaries.

While I still don’t have a subscription to a bot service to buy a PS5, the game of cybersecurity is one that I consider more fun, more engaging, and one that I am subscribed to whether I like it or not.

Enterprises move on from legacy approaches to software development

Application development and maintenance services in the U.S. are evolving to meet changing demands from enterprises that need dynamic applications with rich user interfaces, according to a report published by Information Services Group. The report for the U.S. finds the growing ranks of companies undergoing digital transformation want to modernize their software portfolios and continuously update their applications. Meeting requirements through next-generation ADM services Service providers are meeting these requirements through next-generation ADM services, which … More

The post Enterprises move on from legacy approaches to software development appeared first on Help Net Security.

Repurposing Neural Networks to Generate Synthetic Media for Information Operations

FireEye’s Data Science and Information Operations Analysis teams released this blog post to coincide with our Black Hat USA 2020 Briefing, which details how open source, pre-trained neural networks can be leveraged to generate synthetic media for malicious purposes. To summarize our presentation, we first demonstrate three successive proof of concepts for how machine learning models can be fine-tuned in order to generate customizable synthetic media in the text, image, and audio domains. Next, we illustrate examples in which synthetically generated media have been weaponized for information operations (IO), as detected on the front lines by Mandiant Threat Intelligence. Finally, we outline challenges in detecting synthetically generated content, and lay out potential paths forward in a future where synthetically generated media will increasingly look, speak, and write like us.

Highlights

  • Open source, pre-trained natural language processing, computer vision, and speech recognition neural networks can be weaponized for offensive social media-driven IO campaigns.
  • Detection, attribution, and response is challenging in scenarios where actors can anonymously generate and distribute credible fake content using proprietary training datasets.
  • The security community can and should help AI researchers, policy makers, and other stakeholders mitigate the harmful use of open source models.

Background: Synthetic Media, Generative Models, and Transfer Learning

Synthetic media is by no means a new development; methods for manipulating media for specific agendas are as old as the media themselves. In the 1930’s, the chief of the Soviet secret police was photographed walking alongside Joseph Stalin before being retouched out of an official press photo, after he himself was arrested and executed during the Great Purge. Digital graphic manipulation like this became prominent with the advent of Photoshop. Then later in the 2010’s, the term “deepfake” was coined. While deepfake videos, including techniques like face swapping and lip syncing, are concerning in the long term, this blog post focuses on more basic, but we argue more believable, synthetic media generation advancements in the text, static image, and audio domains. Machine learning approaches for creating synthetic media are underpinned by generative models, which have been effectively misused to fabricate high volume submissions to federal public comment websites and clone a voice to trick an executive into handing over $240,000.

The pre-training required to produce models capable of synthetic media generation can cost thousands of dollars, take weeks or months of time, and require access to expensive GPU clusters. However, the application of transfer learning can drastically reduce the amount of time and effort involved. In transfer learning, we start from a large generic model that has been pre-trained for an initial task where copious data is available. We then leverage the model’s acquired knowledge to train it further on a different, smaller dataset so that it excels at a subsequent, related task. This process of training the model further is referred to as fine-tuning, which typically requires less resources compared to pre-training from scratch. You can think of this in more relatable terms—if you’re a professional tennis player, you don’t need to completely relearn how to swing a racket in order to excel at badminton.

Repurposing Neural Networks to Generate Synthetic Media for Information Operations

FireEye’s Data Science and Information Operations Analysis teams released this blog post to coincide with our Black Hat USA 2020 Briefing, which details how open source, pre-trained neural networks can be leveraged to generate synthetic media for malicious purposes. To summarize our presentation, we first demonstrate three successive proof of concepts for how machine learning models can be fine-tuned in order to generate customizable synthetic media in the text, image, and audio domains. Next, we illustrate examples in which synthetically generated media have been weaponized for information operations (IO), as detected on the front lines by Mandiant Threat Intelligence. Finally, we outline challenges in detecting synthetically generated content, and lay out potential paths forward in a future where synthetically generated media will increasingly look, speak, and write like us.

Highlights

  • Open source, pre-trained natural language processing, computer vision, and speech recognition neural networks can be weaponized for offensive social media-driven IO campaigns.
  • Detection, attribution, and response is challenging in scenarios where actors can anonymously generate and distribute credible fake content using proprietary training datasets.
  • The security community can and should help AI researchers, policy makers, and other stakeholders mitigate the harmful use of open source models.

Background: Synthetic Media, Generative Models, and Transfer Learning

Synthetic media is by no means a new development; methods for manipulating media for specific agendas are as old as the media themselves. In the 1930’s, the chief of the Soviet secret police was photographed walking alongside Joseph Stalin before being retouched out of an official press photo, after he himself was arrested and executed during the Great Purge. Digital graphic manipulation like this became prominent with the advent of Photoshop. Then later in the 2010’s, the term “deepfake” was coined. While deepfake videos, including techniques like face swapping and lip syncing, are concerning in the long term, this blog post focuses on more basic, but we argue more believable, synthetic media generation advancements in the text, static image, and audio domains. Machine learning approaches for creating synthetic media are underpinned by generative models, which have been effectively misused to fabricate high volume submissions to federal public comment websites and clone a voice to trick an executive into handing over $240,000.

The pre-training required to produce models capable of synthetic media generation can cost thousands of dollars, take weeks or months of time, and require access to expensive GPU clusters. However, the application of transfer learning can drastically reduce the amount of time and effort involved. In transfer learning, we start from a large generic model that has been pre-trained for an initial task where copious data is available. We then leverage the model’s acquired knowledge to train it further on a different, smaller dataset so that it excels at a subsequent, related task. This process of training the model further is referred to as fine-tuning, which typically requires less resources compared to pre-training from scratch. You can think of this in more relatable terms—if you’re a professional tennis player, you don’t need to completely relearn how to swing a racket in order to excel at badminton.