Category Archives: machine learning

Three reasons why Stealthwatch Cloud is a modern-day cloud security solution

The world is changing, and so are your workloads. What was once bound to the on-prem data center, a large cluster of tech stacks at a local branch office, is now effortlessly deployed in the ‘cloud’. The cloud is a way to store data that requires no physical infrastructure for the end-user to manage. Cloud migration is a hot topic and has been on CISO’s minds for years. As you could imagine, this is a cheaper and easier way to manage workloads and store sensitive data. Why wouldn’t you want to migrate your business into the cloud? In a recent episode of the Cloud Unfiltered podcast, John Heintz, Technical Solutions Architect at Cisco, sat down to discuss the state of the cloud and the challenges that organizations are facing as they attempt to migrate and leverage cloud resources, secure their remote workers and more. Further, Heintz explains why Cisco Stealthwatch Cloud supports a holistic approach to security, is simple to use and compatible with hybrid-cloud deployments, and is even better with Cisco SecureX.

Stealthwatch Cloud supports a holistic approach to security

Stealthwatch Cloud, Tetration and AppDynamics come together to provide full protection at the network, workload, and application layers respectively. Stealthwatch Cloud gives users confidence that, whether they are breached using stolen credentials or top-tier workarounds to firewall policies, they will be alerted to any malicious behavior. Even with these tools activated, threats can still get through. Stealthwatch Cloud uses the network itself as a sensor and detect threats through various methods of behavioral modeling. After a number of days, Stealthwatch Cloud understands what is normal and will alert users of deviations or anomalies. That is, if a printer is not acting in a way that a printer would normally act, Stealthwatch Cloud will detect it and flag an alert.

Multi-cloud is complicated. Stealthwatch Cloud is not.

Stealthwatch Cloud ingests all of your public cloud telemetry and detects threats like no other security solution. Solutions from other vendors support multiple clouds but only Stealthwatch Cloud can see into native telemetry like VPC and NSG logs. Additionally, it uses this information to generate alerts that are unique to various public cloud vendors like AWS, Azure and GCP. Deployment is simple and users will even receive a report of all findings during the free trial period. Stealthwatch Cloud is valuable to all kinds of businesses, from large enterprises to “mom and pop pizza shops” according to Heintz. The tool is SaaS-delivered and priced uniquely for each customer so that even those with very small deployments can feel confident that their cloud instance is secure.

Cisco SecureX and Stealthwatch Cloud are better together

SecureX connects the breadth of Cisco’s integrated security portfolio and your entire security infrastructure for a consistent experience that unifies visibility, enables automation, and strengthens security across your network, endpoint, cloud, and applications. Stealthwatch Cloud’s power is in network traffic analysis.

It provides visibility into the public cloud and applies advanced security analytics to detect threats in real-time. SecureX increases the effectiveness of Stealthwatch Cloud, as it ties it to tools like AMP for Endpoints and ISE, that allow for quick host isolation and other remediation methods. In return, SecureX can use east/west data generated from Stealthwatch Cloud to see how threats are moving laterally through the network.

This is only the beginning of cloud migration. A rise in remote workers has shown businesses that they are capable of leaving on-prem data centers behind and Stealthwatch Cloud is the perfect tool to help them utilize the public cloud securely.

Be sure to listen to the Cloud Unfiltered podcast, and sign up for a 60-day free trial today.

The post Three reasons why Stealthwatch Cloud is a modern-day cloud security solution appeared first on Cisco Blogs.

Key cybersecurity industry challenges in the next five years

What key challenges will the cybersecurity industry be dealing with in the next five years? Pete Herzog, Managing Director at ISECOM, is so sure that artificial intelligence could be the biggest security problem to solve and the biggest answer to the privacy problem that he cofounded a company, Urvin.ai, with an eclectic group of coders and scientists to explore this. AI (and machine learning with it) is like a naive child that trusts what you … More

The post Key cybersecurity industry challenges in the next five years appeared first on Help Net Security.

Does analyzing employee emails run afoul of the GDPR?

A desire to remain compliant with the European Union’s General Data Protection Regulation (GDPR) and other privacy laws has made HR leaders wary of any new technology that digs too deeply into employee emails. This is understandable, as GDPR non-compliance pay lead to stiff penalties. At the same time, new technologies are applying artificial intelligence (AI) and machine learning (ML) to solve HR problems like analyzing employee data to help with hiring, completing performance reviews … More

The post Does analyzing employee emails run afoul of the GDPR? appeared first on Help Net Security.

Choosing the right security analytics solution as networks expand: competitive considerations and customer validation

May you live in interesting times…

We’re currently witnessing a fundamental shift towards a more remote workforce amidst the tumultuous world events of 2020. This recent development made me recall the tongue-in-cheek, 19th century adage – “May you live in interesting times”. This old proverb served as a curse disguised as a fortuitous blessing, as the concept of “interesting times” referred to living through turbulent and unsavory world events. And while I personally don’t believe in curses, the shift that’s currently underway with respect to how and where users work does stand to increase cyber-risks for organizations that are ill-equipped to securely monitor network footprints that have expanded to remote workers’ homes.

 Choosing the right security analytics solution as your networks expand

As networks continue to expand and become more complex, oftentimes at the cost of visibility and control, the recent transition towards remote work as the norm has only further accelerated the demand for network security analytics solutions. As a quick side-note, various terms are used to refer to these solutions – Network Traffic Analysis (NTA), Network Analysis and Visibility (NAV), Network Detection and Response (NDR).

In a recent and timely Business Insider article on the rise of Network Traffic Analysis (NTA) tools, Cisco Stealthwatch was recognized as the top solution. Business Insider’s rationale for listing Cisco Stealthwatch as the top solution was based off of a review it conducted on IT Central Station’s Ranking Scores and customer testimonials. IT Central Station awarded Stealthwatch a Ranking Score of 72 versus scores that ranged from 22 – 54 for other vendors, which came as no surprise due to the tool’s multitude of competitive differentiators.

Selecting the right security analytics solution is critical, especially now, as organizations face evolving network dynamics such as expansion in breadth and complexity, the transition to the cloud, and the shift to remote workers. With that in mind, below are 5 features that exemplify key competitive differentiators that make Stealthwatch the most suitable solution to meet these new network challenges:

Five reasons why Stealthwatch is the most competitive solution:

Scalability – Whereas most tools require purchasing and deploying sensors at every control point, Stealthwatch is agentless so that as your network grows, the solution is both scalable and cost effective.

Encrypted Traffic Analytics (ETA) – Stealthwatch is the only tool that can perform analytics on encrypted traffic without decryption to detect stealthy malware and to ensure that compliance standards are met.

Comprehensive visibility – Stealthwatch not only ingests telemetry from multiple network devices like routers, switches and firewalls, but is also the only solution that can monitor all major cloud environments such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform, as well as private cloud environments, Kubernetes, and serverless networks through its SaaS-based offering Stealthwatch Cloud. Additionally, it gets user contextual data from Cisco Identity Services Engine (ISE) and also ingests proxy, web, and endpoint data to provide comprehensive visibility.

Best-in-class analytics capabilities – Having been around for 20 years, Stealthwatch has a mature and proven anomaly detection system with nearly 100 time-tested behavioral models and approximately 400 machine learning algorithms. It’s also updated hourly by intelligence from Cisco Talos which has over 300 researchers and is the largest non-governmental threat intelligence team in the world.

Industry-leading response capabilities – Stealthwatch comes with the Cisco SecureX platform built-in to provide additional rich contextual data from your environment. The integrated platform approach also includes SecureX threat response, which automates integrations and centralizes information across Cisco security tools to expedite incident investigations and remediations. Stealthwatch also integrates with Cisco Identity Services Engine (ISE) so that upon detection of a threat, it can automatically invoke ISE to change the access policy of a user or device.

 Still not convinced?

Stealthwatch has consistently been recognized as the leading network security analytics solution, but if you’re still not convinced, that’s fair. Why would you take my word for it? Naturally, as an ambassador of Stealthwatch and Cisco’s integrated security portfolio, I’m biased. Your methodology for evaluating solutions should be data-driven and rely on information from objective third-party sources. So, if you would like further validation, check out Stealthwatch’s TechValidate research and Gartner Peer Insights results to hear from actual customers.

Summary: Caveat Emptor

I’ll close by referencing another age-old idiom that comes to mind when I think about the current marketplace – “Caveat Emptor” – Latin for “Buyer Beware”. Over the course of your research, you’ll likely come to the realization that not all solutions have been created equal. Although Stealthwatch’s rankings and customer testimonials speak for themselves, I’d be remiss to not say that you should still get under the hood and perform your own due diligence.

New vulnerabilities have emerged as a result of the ongoing transition towards a more remote workforce and malicious actors aren’t standing by idly or waiting for organizations to adapt their security postures to this new reality.

Don’t let attackers take advantage of your expanding network – learn more about how Cisco Stealthwatch stacks up against the competition and try the solution out for yourself today with a free visibility assessment.

The post Choosing the right security analytics solution as networks expand: competitive considerations and customer validation appeared first on Cisco Blogs.

The COVID-19 Pandemic Dominates the Cybersecurity World

Cybersecurity is not a static world. You can say that it is a social system, it affects and is affected by its surrounding environment. For example, back in 2018, it was the GDPR that shook the foundations of security and privacy by making the protection of our personal data a fundamental human right. But that […]… Read More

The post The COVID-19 Pandemic Dominates the Cybersecurity World appeared first on The State of Security.

OPTIMUSCLOUD: Cost and performance efficiency for cloud-hosted databases

A Purdue University data science and machine learning innovator wants to help organizations and users get the most for their money when it comes to cloud-based databases. Her same technology may help self-driving vehicles operate more safely on the road when latency is the primary concern. Somali Chaterji, a Purdue assistant professor of agricultural and biological engineering who directs the Innovatory for Cells and Neural Machines [ICAN], and her team created a technology called OPTIMUSCLOUD. … More

The post OPTIMUSCLOUD: Cost and performance efficiency for cloud-hosted databases appeared first on Help Net Security.

Increased attacks and the power of a fully staffed cybersecurity team

The cybersecurity landscape is constantly evolving, and even more so during this time of disruption. According to ISACA’s survey, most respondents believe that their enterprise will be hit by a cyberattack soon – with 53 percent believing it is likely they will experience one in the next 12 months. Cyberattacks continuing to increase The survey found cyberattacks are also continuing to increase, with 32 percent of respondents reporting an increase in the number of attacks … More

The post Increased attacks and the power of a fully staffed cybersecurity team appeared first on Help Net Security.

A math formula could help 5G networks efficiently share communications frequencies

Researchers at the National Institute of Standards and Technology (NIST) have developed a mathematical formula that, computer simulations suggest, could help 5G and other wireless networks select and share communications frequencies about 5,000 times more efficiently than trial-and-error methods. NIST engineer Jason Coder makes mathematical calculations for a machine learning formula that may help 5G and other wireless networks select and share communications frequencies efficiently The novel formula is a form of machine learning that … More

The post A math formula could help 5G networks efficiently share communications frequencies appeared first on Help Net Security.

AI and ML in Cybersecurity: Adoption is Rising, but Confusion Remains

Reading Time: ~ 3 min.

If you’ve been working in the technology space for any length of time, you’ve undoubtedly heard about the rising importance of artificial intelligence (AI) and machine learning (ML). But what can these tools really do for you? More specifically, what kinds of benefits do they offer for cybersecurity and business operations?

If you’re not so sure, you’re not alone. As it turns out, although 96% of global IT decision-makers have adopted AI/ML-based cybersecurity tools, nearly 7 in 10 admit they’re not sure what these technologies do.

We surveyed 800 global IT decision-makers across the U.S., U.K., Japan, and Australia/New Zealand about their thoughts on AI and ML in cybersecurity. The report highlighted a number of interesting (and contradictory) findings, all of which indicated a general confusion about these tools and whether or not they make a difference for the businesses who use them. Additionally, nearly 3 out of 4 respondents (74%) agreed that, as long as their protection keeps them safe from cybercriminals, they really don’t care if it uses AI/ML.

Here’s a recap of key findings based on responses from all 4 regions.

  • 91% say they understand and research their security tools, and specifically look for ones that use AI/ML.
  • Yet 68% say that, although their tools claim to use AI/ML, they aren’t sure what that means.
  • 84% think their business has all it needs to successfully stop AI/ML-based cyberattacks.
  • But 86% believe they could be doing more to prevent cyberattacks.
  • 72% say it is very important that cybersecurity advertising mention the use of AI/ML.
  • However, 70% of respondents believe cybersecurity vendors’ marketing is intentionally deceptive about their AI/ML-based services.

AI and ML matter because automation matters

As we’ve all had to adjust to “the new normal”, IT professionals have had to tackle a variety of challenges. Not only have they had to figure out how to support a massive shift to working from home, but they also have to deal with the onslaught of opportunistic online scams and other cyberattacks that have surged amidst the chaos around COVID-19.

With all of us working to adapt to these new working conditions, it’s become clear tools that enable automation and productivity are pretty important. That’s where I want to highlight AI and ML. In addition to how AI/ML-based cybersecurity can drastically accelerate threat detection—and even predict shifts and emerging threat sources—these technologies can also make your workforce more efficient, more effective, and more confident.

While many of our survey respondents weren’t sure if AI/ML benefits their cybersecurity strategy, a solid percentage saw notable improvements in workforce efficiency after implementing these tools. Let’s go over those numbers.

  • 42% reported an increase in worker productivity
  • 39% saw increases in automated tasks
  • 39% felt they had more time for training, learning new skills, and other tasks
  • 38% felt more effective in their jobs
  • 37% reported a decrease in human error

As you can see, the benefits of AI and ML aren’t just hype, and they extend well beyond the cybersecurity gains. Real numbers around productivity, automation, time savings, and efficacy are pretty compelling at the best of times, let alone when we’re dealing with sudden and drastic shifts to the ways we conduct business. That’s why I can’t stress the importance of these technologies enough—not only in your security strategy, but across your entire toolset.

Where to learn more

Ultimately, AI and ML-based tools can help businesses of all sizes become more resilient against cyberattacks—not to mention increase automation and operational efficiencies—but it’s important to understand them better to fully reap the benefits they offer.

While there’s clearly still a lot of confusion about what these tools do, I think we’re going to see a continuation of the upward trend in AI/ML adoption. That’s why it’s important that IT decision-makers have the resources to educate themselves about the best ways to implement these tools, and also look to vendors who have the historical knowledge and expertise in the space to guide them.

“Realistically, we can’t expect to stop sophisticated attacks if more than half of IT decision makers don’t understand AI/ML-based cybersecurity tools. We need to do better. That means more training and more emphasis not only on our tools and their capabilities, but also on our teams’ ability to use them to their best advantage.”

– Hal Lonas, SVP and CTO for SMB and Consumer at OpenText.

For further details about how businesses around the world are using AI and ML, their plans for cybersecurity spending, and use cases, download a copy of the full AI/ML report.

And if you still aren’t sure about AI/ML-based cybersecurity, I encourage you to read our white paper, Demystifying AI in Cybersecurity, to gain a better understanding of the technology, myth vs. reality, and how it benefits the cybersecurity industry.

The post AI and ML in Cybersecurity: Adoption is Rising, but Confusion Remains appeared first on Webroot Blog.

Microsoft researchers work with Intel Labs to explore new deep learning approaches for malware classification

The opportunities for innovative approaches to threat detection through deep learning, a category of algorithms within the larger framework of machine learning, are vast. Microsoft Threat Protection today uses multiple deep learning-based classifiers that detect advanced threats, for example, evasive malicious PowerShell.

In continued exploration of novel detection techniques, researchers from Microsoft Threat Protection Intelligence Team and Intel Labs are collaborating to study new applications of deep learning for malware classification, specifically:

  • Leveraging deep transfer learning technique from computer vision to static malware classification
  • Optimizing deep learning techniques in terms of model size and leveraging platform hardware capabilities to improve execution of deep-learning malware detection approaches

For the first part of the collaboration, the researchers built on Intel’s prior work on deep transfer learning for static malware classification and used a real-world dataset from Microsoft to ascertain the practical value of approaching the malware classification problem as a computer vision task. The basis for this study is the observation that if malware binaries are plotted as grayscale images, the textural and structural patterns can be used to effectively classify binaries as either benign or malicious, as well as cluster malicious binaries into respective threat families.

The researchers used an approach that they called static malware-as-image network analysis (STAMINA). Using the dataset from Microsoft, the study showed that the STAMINA approach achieves high accuracy in detecting malware with low false positives.

The results and further technical details of the research are listed in the paper STAMINA: Scalable deep learning approach for malware classification and set the stage for further collaborative exploration.

The role of static analysis in deep learning-based malware classification

While static analysis is typically associated with traditional detection methods, it remains to be an important building block for AI-driven detection of malware. It is especially useful for pre-execution detection engines: static analysis disassembles code without having to run applications or monitor runtime behavior.

Static analysis produces metadata about a file. Machine learning classifiers on the client and in the cloud then analyze the metadata and determine whether a file is malicious. Through static analysis, most threats are caught before they can even run.

For more complex threats, dynamic analysis and behavior analysis build on static analysis to provide more features and build more comprehensive detection. Finding ways to perform static analysis at scale and with high effectiveness benefits overall malware detection methodologies.

To this end, the research borrowed knowledge from  computer vision domain to build an enhanced static malware detection framework that leverages deep transfer learning to train directly on portable executable (PE) binaries represented as images.

Analyzing malware represented as image

To establish the practicality of the STAMINA approach, which posits that malware can be classified at scale by performing static analysis on malware codes represented as images, the study covered three main steps: image conversion, transfer learning, and evaluation.

Diagram showing the steps for the STAMINA approach: pre-processing, transfer learning, and evaluation

First, the researchers prepared the binaries by converting them into two-dimensional images. This step involved pixel conversion, reshaping, and resizing. The binaries were converted into a one-dimensional pixel stream by assigning each byte a value between 0 and 255, corresponding to pixel intensity. Each pixel stream was then transformed into a two-dimensional image by using the file size to determine the width and height of the image.

The second step was to use transfer learning, a technique for overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. Transfer learning has enjoyed tremendous success within several different computer vision applications. It accelerates training time by bypassing the need to search for optimized hyperparameters and different architectures—all this while maintaining high classification performance. For this study, the researchers used Inception-v1 as the base model.

The study was performed on a dataset of 2.2 million PE file hashes provided by Microsoft. This dataset was temporally split into 60:20:20 segments for training, validation, and test sets, respectively.

Diagram showing a DNN with pre-trained weights on natural images, and the last portion fine-tuned with new data

Finally, the performance of the system was measured and reported on the holdout test set. The metrics captured include recall at specific false positive range, along with accuracy, F1 score, and area under the receiver operating curve (ROC).

Findings

The joint research showed that applying STAMINA to real-world hold-out test data set achieved a recall of 87.05% at 0.1% false positive rate, and 99.66% recall and 99.07% accuracy at 2.58% false positive rate overall. The results certainly encourage the use of deep transfer learning for the purpose of malware classification. It helps accelerate training by bypassing the search for optimal hyperparameters and architecture searches, saving time and compute resources in the process.

The study also highlights the pros and cons of sample-based methods like STAMINA and metadata-based classification methods. For example, STAMINA can go in-depth into samples and extract additional signals that might not be captured in the metadata.  However, for bigger size applications, STAMINA becomes less effective due to limitations in converting billions of pixels into JPEG images and then resizing them. In such cases, metadata-based methods show advantages over our research.

Conclusion and future work

The use of deep learning methods for detecting threats drives a lot of innovation across Microsoft. The collaboration with Intel Labs researchers is just one of the ways in which Microsoft researchers and data scientists continue to explore novel ways to improve security overall.

This joint research is a good starting ground for more collaborative work. For example, the researchers plan to collaborate further on platform acceleration optimizations that can allow deep learning models to be deployed on client machines with minimal performance impact. Stay tuned.

 

Jugal Parikh, Marc Marino

Microsoft Threat Protection Intelligence Team

 

The post Microsoft researchers work with Intel Labs to explore new deep learning approaches for malware classification appeared first on Microsoft Security.

Attention is All They Need: Combatting Social Media Information Operations With Neural Language Models

Information operations have flourished on social media in part because they can be conducted cheaply, are relatively low risk, have immediate global reach, and can exploit the type of viral amplification incentivized by platforms. Using networks of coordinated accounts, social media-driven information operations disseminate and amplify content designed to promote specific political narratives, manipulate public opinion, foment discord, or achieve strategic ideological or geopolitical objectives. FireEye’s recent public reporting illustrates the continually evolving use of social media as a vehicle for this activity, highlighting information operations supporting Iranian political interests such as one that leveraged a network of inauthentic news sites and social media accounts and another that impersonated real individuals and leveraged legitimate news outlets.

Identifying sophisticated activity of this nature often requires the subject matter expertise of human analysts. After all, such content is purposefully and convincingly manufactured to imitate authentic online activity, making it difficult for casual observers to properly verify. The actors behind such operations are not transparent about their affiliations, often undertaking concerted efforts to mask their origins through elaborate false personas and the adoption of other operational security measures. With these operations being intentionally designed to deceive humans, can we turn towards automation to help us understand and detect this growing threat? Can we make it easier for analysts to discover and investigate this activity despite the heterogeneity, high traffic, and sheer scale of social media?

In this blog post, we will illustrate an example of how the FireEye Data Science (FDS) team works together with FireEye’s Information Operations Analysis team to better understand and detect social media information operations using neural language models.

Highlights

  • A new breed of deep neural networks uses an attention mechanism to home in on patterns within text, allowing us to better analyze the linguistic fingerprints and semantic stylings of information operations using modern Transformer models.
  • By fine-tuning an open source Transformer known as GPT-2, we can detect social media posts being leveraged in information operations despite their syntactic differences to the model’s original training data.
  • Transfer learning from pre-trained neural language models lowers the barrier to entry for generating high-quality synthetic text at scale, and this has implications for the future of both red and blue team operations as such models become increasingly commoditized.

Background: Using GPT-2 for Transfer Learning

OpenAI’s updated Generative Pre-trained Transformer (GPT-2) is an open source deep neural network that was trained in an unsupervised manner on the causal language modeling task. The objective of this language modeling task is to predict the next word in a sentence from previous context, meaning that a trained model ends up being capable of language generation. If the model can predict the next word accurately, it can be used in turn to predict the following word, and then so on and so forth until eventually, the model produces fully coherent sentences and paragraphs. Figure 1 depicts an example of language model (LM) predictions we generated using GPT-2. To generate text, single words are successively sampled from distributions of candidate words predicted by the model until it predicts an <|endoftext|> word, which signals the end of the generation.


Figure 1: An example GPT-2 generation prior to fine-tuning after priming the model with the phrase “It’s disgraceful that.”  

The quality of this synthetically generated text along with GPT-2’s state of the art accuracy on a host of other natural language processing (NLP) benchmark tasks is due in large part to the model’s improvements over prior 1) neural network architectures and 2) approaches to representing text. GPT-2 uses an attention mechanism to selectively focus the model on relevant pieces of text sequences and identify relationships between positionally distant words. In terms of architectures, Transformers use attention to decrease the time required to train on enormous datasets; they also tend to model lengthy text and scale better than other competing feedforward and recurrent neural networks. In terms of representing text, word embeddings were a popular way to initialize just the first layer of neural networks, but such shallow representations required being trained from scratch for each new NLP task and in order to deal with new vocabulary. GPT-2 instead pre-trains all the model’s layers using hierarchical representations, which better capture language semantics and are readily transferable to other NLP tasks and new vocabulary.

This transfer learning method is advantageous because it allows us to avoid starting from scratch for each and every new NLP task. In transfer learning, we start from a large generic model that has been pre-trained for an initial task where copious data is available. We then leverage the model’s acquired knowledge to train it further on a different, smaller dataset so that it excels at a subsequent, related task. This process of training the model further is referred to as fine-tuning, which involves re-learning portions of the model by adjusting its underlying parameters. Fine-tuning not only requires less data compared to training from scratch, but typically also requires less compute time and resources.

In this blog post, we will show how to perform transfer learning from a pre-trained GPT-2 model in order to better understand and detect information operations on social media. Transformers have shown that Attention is All You Need, but here we will also show that Attention is All They Need: while transfer learning may allow us to more easily detect information operations activity, it likewise lowers the barrier to entry for actors seeking to engage in this activity at scale.

Understanding Information Operations Activity Using Fine-Tuned Neural Generations

In order to study the thematic and linguistic characteristics of a common type of social media-driven information operations activity, we first fine-tuned an LM that could perform text generation. Since the pre-trained GPT-2 model's dataset consisted of 40+ GB of Internet text data extracted from 8+ million reputable web pages, its generations display relatively formal grammar, punctuation, and structure that corresponds to the text present within that original dataset (e.g. Figure 1). To make it appear like social media posts with their shorter length, informal grammar, erratic punctuation, and syntactic quirks including @mentions, #hashtags, emojis, acronyms, and abbreviations, we fine-tuned the pre-trained GPT-2 model on a new language modeling task using additional training data.

For the set of experiments presented in this blog post, this additional training data was obtained from the following open source datasets of identified accounts operated by Russia’s famed Internet Research Agency (IRA) “troll factory”:

  • NBCNews, over 200,000 tweets posted between 2014 and 2017 tied to IRA “malicious activity.”
  • FiveThirtyEight, over 1.8 million tweets associated with IRA activity between 2012 and 2018; we used accounts categorized as Left Troll, Right Troll, or Fearmonger.
  • Twitter Elections Integrity, almost 3 million tweets that were part of the influence effort by the IRA around the 2016 U.S. presidential election.
  • Reddit Suspicious Accounts, consisting of comments and submissions emanating from 944 accounts of suspected IRA origin.

After combining these four datasets, we sampled English-language social media posts from them to use as input for our fine-tuned LM. Fine-tuning experiments were carried out in PyTorch using the 355 million parameter pre-trained GPT-2 model from HuggingFace’s transformers library, and were distributed over up to 8 GPUs.

As opposed to other pre-trained LMs, GPT-2 conveniently requires minimal architectural changes and parameter updates in order to be fine-tuned on new downstream tasks. We simply processed social media posts from the above datasets through the pre-trained model, whose activations were then fed through adjustable weights into a linear output layer. The fine-tuning objective here was the same that GPT-2 was originally trained on (i.e. the language modeling task of predicting the next word, see Figure 1), except now its training dataset included text from social media posts. We also added the <|endoftext|> string as a suffix to each post to adapt the model to the shorter length of social media text, meaning posts were fed into the model according to:

“#Fukushima2015 Zaporozhia NPP can explode at any time
and that's awful! OMG! No way! #Nukraine<|endoftext|>”

Figure 2 depicts a few example generations made after fine-tuning GPT-2 on the IRA datasets. Observe how these text generations are formatted like something we might expect to encounter scrolling through social media – they are short yet biting, express certainty and outrage regarding political issues, and contain emphases like an exclamation point. They also contain idiosyncrasies like hashtags and emojis that positionally manifest at the end of the generated text, depicting a semantic style regularly exhibited by actual users.


Figure 2: Fine-tuning GPT-2 using the IRA datasets for the language modeling task. Example generations are primed with the same phrase from Figure 1, “It’s disgraceful that.” Hyphens are added for readability and not produced by the model.

How does the model produce such credible generations? Besides the weights that were adjusted during LM fine-tuning, some of the heavy lifting is also done by the underlying attention scores that were learned by GPT-2’s Transformer. Attention scores are computed between all words in a text sequence, and represent how important one word is when determining how important its nearby words will be in the next learning iteration. To compute attention scores, the Transformer performs a dot product between a Query vector q and a Key vector k:

  • q encodes the current hidden state, representing the word that searches for other words in the sequence to pay attention to that may help supply context for it.
  • k encodes the previous hidden states, representing the other words that receive attention from the query word and might contribute a better representation for it in its current context.

Figure 3 displays how this dot product is computed based on single neuron activations in q and k using an attention visualization tool called bertviz. Columns in Figure 3 trace the computation of attention scores from the highlighted word on the left, “America,” to the complete sequence of words on the right. For example, to decide to predict “#” following the word “America,” this part of the model focuses its attention on preceding words like “ban,” “Immigrants,” and “disgrace,” (note that the model has broken “Immigrants” into “Imm” and “igrants” because “Immigrants” is an uncommon word relative to its component word pieces within pre-trained GPT-2's original training dataset).  The element-wise product shows how individual elements in q and k contribute to the dot product, which encodes the relationship between each word and every other context-providing word as the network learns from new text sequences. The dot product is finally normalized by a softmax function that outputs attention scores to be fed into the next layer of the neural network.


Figure 3: The attention patterns for the query word highlighted in grey from one of the fine-tuned GPT-2 generations in Figure 2. Individual vertical bars represent neuron activations, horizontal bars represent vectors, and lines represent the strength of attention between words. Blue indicates positive values, red indicates negative values, and color intensity represents the magnitude of these values.

Syntactic relationships between words like “America,” “ban,” and “Immigrants“ are valuable from an analysis point of view because they can help identify an information operation’s interrelated keywords and phrases. These indicators can be used to pivot between suspect social media accounts based on shared lexical patterns, help identify common narratives, and even to perform more proactive threat hunting. While the above example only scratches the surface of this complex, 355 million parameter model, qualitatively visualizing attention to understand the information learned by Transformers can help provide analysts insights into linguistic patterns being deployed as part of broader information operations activity.

Detecting Information Operations Activity by Fine-Tuning GPT-2 for Classification

In order to further support FireEye Threat Analysts’ work in discovering and triaging information operations activity on social media, we next fine-tuned a detection model to perform classification. Just like when we adapted GPT-2 for a new language modeling task in the previous section, we did not need to make any drastic architectural changes or parameter updates to fine-tune the model for the classification task. However, we did need to provide the model with a labeled dataset, so we grouped together social media posts based on whether they were leveraged in information operations (class label CLS = 1) or were benign (CLS = 0).

Benign, English-language posts were gathered from verified social media accounts, which generally corresponded to public figures and other prominent individuals or organizations whose posts contained diverse, innocuous content. For the purposes of this blog post, information operations-related posts were obtained from the previously mentioned open source IRA datasets. For the classification task, we separated the IRA datasets that were previously combined for LM fine-tuning, and selected posts from only one of them for the group associated with CLS = 1. To perform dataset selection quantitatively, we fine-tuned LMs on each IRA dataset to produce three different LMs while keeping 33% of the posts from each dataset held out as test data. Doing so allowed us to quantify the overlap between the individual IRA datasets based on how well one dataset’s LM was able to predict post content originating from the other datasets.


Figure 4: Confusion matrix representing perplexities of the LMs on their test datasets. The LM corresponding to the GPT-2 row was not fine-tuned; it corresponds to the pretrained GPT-2 model with reported perplexity of 18.3 on its own test set, which was unavailable for evaluation using the LMs. The Reddit dataset was excluded due to the low volume of samples.

In Figure 4, we show the result of computing perplexity scores for each of the three LMs and the original pre-trained GPT-2 model on held out test data from each dataset. Lower scores indicate better perplexity, which captures the probability of the model choosing the correct next word. The lowest scores fell along the main diagonal of the perplexity confusion matrix, meaning that the fine-tuned LMs were best at predicting the next word on test data originating from within their own datasets. The LM fine-tuned on Twitter’s Elections Integrity dataset displayed the lowest perplexity scores when averaged across all held out test datasets, so we selected posts sampled from this dataset to demonstrate classification fine-tuning.


Figure 5: (A) Training loss histories during GPT-2 fine-tuning for the classification (red) and LM (grey, inset) tasks. (B) ROC curve (red) evaluated on the held out fine-tuning test set, contrasted with random guess (grey dotted).

To fine-tune for the classification task, we once again processed the selected dataset’s posts through the pre-trained GPT-2 model. This time, activations were fed through adjustable weights into two linear output layers instead of just the single one used for the language modeling task in the previous section. Here, fine-tuning was formulated as a multi-task objective with classification loss together with an auxiliary LM loss, which helped accelerate convergence during training and improved the generalization of the model. We also prepended posts with a new [BOS] (i.e. Beginning Of Sentence) string and suffixed posts with the previously mentioned [CLS] class label string, so that each post was fed into the model according to:

“[BOS]Kevin Mandia was on @CNBC’s @MadMoneyOnCNBC with @jimcramer discussing targeted disinformation heading into the… https://t.co/l2xKQJsuwk[CLS]”

The [BOS] string played a similar delimiting role to the <|endoftext|> string used previously in LM fine-tuning, and the [CLS] string encoded the hidden state ∈ {0, 1} that was the label fed to the model’s classification layer. The example social media post above came from the benign dataset, so this sample’s label was set to CLS = 0 during fine-tuning. Figure 5A shows the evolution of classification and auxiliary LM losses during fine-tuning, and Figure 5B displays the ROC curve for the fine-tuned classifier on its test set consisting of around 66,000 social media posts. The convergence of the losses to low values, together with a high Area Under the ROC Curve (i.e. AUC), illustrates that transfer learning allowed this model to accurately detect social media posts associated with IRA information operations activity versus benign ones. Taken together, these metrics indicate that the fine-tuned classifier should generalize well to newly ingested social media posts, providing analysts a capability they can use to separate signal from noise.

Conclusion

In this blog post, we demonstrated how to fine-tune a neural LM on open source datasets containing social media posts previously leveraged in information operations. Transfer learning allowed us to classify these posts with a high AUC score, and FireEye’s Threat Analysts can utilize this detection capability in order to discover and triage similar emergent operations. Additionally, we showed how Transformer models assign scores to different pieces of text via an attention mechanism. This visualization can be used by analysts to tease apart adversary tradecraft based on posts’ linguistic fingerprints and semantic stylings.

Transfer learning also allowed us to generate credible synthetic text with low perplexity scores. One of the barriers actors face when devising effective information operations is adequately capturing the nuances and context of the cultural climate in which their targets are situated. Our exercise here suggests this costly step could be bypassed using pre-trained LMs, whose generations can be fine-tuned to embody the zeitgeist of social media. GPT-2’s authors and subsequent researchers have warned about potential malicious use cases enabled by this powerful natural language generation technology, and while it was conducted here for a defensive application in a controlled offline setting using readily available open source data, our research reinforces this concern. As trends towards more powerful and readily available language generation models continue, it is important to redouble efforts towards detection as demonstrated by Figure 5 and other promising approaches such as Grover.

This research was conducted during a three-month FireEye IGNITE University Program summer internship, and represents a collaboration between the FDS and FireEye Threat Intelligence’s Information Operations Analysis teams. If you are interested in working on multidisciplinary projects at the intersection of cyber security and machine learning, please consider applying to one of our 2020 summer internships.

Definitive Dossier of Devilish Debug Details – Part Deux: A Didactic Deep Dive into Data Driven Deductions

In Part One of this blog series, Steve Miller outlined what PDB paths are, how they appear in malware, how we use them to detect malicious files, and how we sometimes use them to make associations about groups and actors.

As Steve continued his research into PDB paths, we became interested in applying more general statistical analysis. The PDB path as an artifact poses an intriguing use case for a couple of reasons.

­First, the PDB artifact is not directly tied to the functionality of the binary. As a byproduct of the compilation process, it contains information about the development environment, and by proxy, the malware author themselves. Rarely do we encounter static malware features with such an interesting tie to the human behind the keyboard, rather than the functionality of the file.

Second, file paths are an incredibly complex artifact with many different possible encodings. We had personally been dying to find an excuse to spend more time figuring out how to parse and encode paths in a more useful way. This presented an opportunity to dive into this space and test different approaches to representing file paths in various models.

The objectives of our project were:

  1. Build a large data set of PDB paths and apply some statistical methods to find potentially new signature terms and logic.
  2. Investigate whether applying machine learning classification approaches to this problem could improve our detection above writing hand-crafted signatures.
  3. Build a PDB classifier as a weak signal for binary analysis.

To start, we began gathering data. Our dataset, pulled from internal and external sources, started with over 200,000 samples. Once we deduplicated by PDB path, we had around 50,000 samples. Next, we needed to consistently label these samples, so we considered various labeling schemes.

Labeling Binaries With PDB Paths

For many of the binaries we had internal FireEye labels, and for others we looked up hashes on VirusTotal (VT) to have a look at their detection rates. This covered the majority of our samples. For a relatively small subset we had disagreements between our internal engine and VT results, which merited a slightly more nuanced policy. The disagreement was most often that our internal assessment determined a file to be benign, but the VT results showed a nonzero percentage of vendors detecting the file as malicious. In these cases we plotted the ‘VT ratio”: that is, the percentage of vendors labeling the files as malicious (Figure 1).


Figure 1: Ratio of vendors calling file bad/total number of vendors

The vast majority of these samples had VT detection ratios below 0.3, and in those cases we labeled the binaries as benign. For the remainder of samples we tried two strategies – marking them all as malicious, or removing them from the training set entirely. Our classification performance did not change much between these two policies, so in the end we scrapped the remainder of the samples to reduce label noise.

Building Features

Next, we had to start building features. This is where the fun began. Looking at dozens and dozens of PDB paths, we simply started recording various things that ‘pop out’ to an analyst. As noted earlier, a file path contains tons of implicit information, beyond simply being a string-based artifact. Some analogies we have found useful is that a file path is more akin to a geographical location in its representation of a location on the file system, or like a sentence in that it reflects a series of dependent items.

To further illustrate this point, consider a simple file path such as:

C:\Users\World\Desktop\duck\Zbw138ht2aeja2.pdb (source file)

This path tells us several things:

  • This software was compiled on the system drive of the computer
  • In a user profile, under user ‘World’
  • The project is managed on the Desktop, in a folder called ‘duck’
  • The filename has a high degree of entropy and is not very easy to remember

In contrast, consider something such as:

D:\VSCORE5\BUILD\VSCore\release\EntVUtil.pdb (source file)

This indicates:

  • Compilation on an external or secondary drive
  • Within a non-user directory
  • Contains development terms such as ‘BUILD’ and ‘release’
  • With a sensible, semi-memorable file name

These differences seem relatively straightforward and make intuitive sense as to why one might be representative of malware development whereas the other represents a more “legitimate-looking” development environment.

Feature Representations

How do we represent these differences to a model? The easiest and most obvious option is to calculate some statistics on each path. Features such as folder depth, path length, entropy, and counting things such as numbers, letters, and special characters in the PDB filename are easy to compute.

However, upon evaluation against our dataset, these features did not help to separate the classes very well. The following are some graphics detailing the distributions of these features between our classes of malicious and benign samples:

While there is potentially some separation between benign and malicious distributions, these features alone would likely not lead to an effective classifier (we tried). Additionally, we couldn’t easily translate these differences into explicit detection rules. There was more information in the paths that we needed to extract, so we began to look at how to encode the directory names themselves.

Normalization

As with any dataset, we had to undertake some steps to normalize the paths. For example, the occurrence of individual usernames, while perhaps interesting from an intelligence perspective, would be represented as distinct entities when in fact they have the same semantic meaning. Thus, we had to detect and replace usernames with <username> to normalize this representation. Other folder idiosyncrasies such as version numbers or randomly generated directories could similarly be normalized into <version> or <random>.

A typical normalized path might therefore go from this:

C:\Users\jsmith\Documents\Visual Studio 2013\Projects\mkzyu91952\mkzyu91952\obj\x86\Debug\mkzyu91952.pdb

To this:

c:\users\<username>\documents\visual studio 2013\projects\<random>\<random>\obj\x86\debug\mkzyu91952.pdb

You may notice that the PDB filename itself was not normalized. In this case we wanted to derive features from the filename itself, so we left it. Other approaches could be to normalize it, or even to make note that the same filename string ‘mkzyu91952’ appears earlier in the path. There are endless possible features when dealing with file paths.

Directory Analysis

Once we had normalized directories, we could start to “tokenize” each directory term, to start performing some statistical analysis. Our main goal of this analysis was to see if there were any directory terms that highly corresponded to maliciousness, or see if there were any simple combinations, such as pairs or triplets, that exhibited similar behavior.

We did not find any single directory name that easily separated the classes. That would be too easy. However, we did find some general correlations with directories such as “Desktop” being somewhat more likely to be malicious, and use of shared drives such as Z: to be more indicative of a benign file. This makes intuitive sense given the more collaborative environment a “legitimate” software development process might require. There are, of course, many exceptions and this is what makes the problem tricky.

Another strong signal we found, at least in our dataset, is that when the word “Desktop” was in a non-English language and particularly in a different alphabet, the likelihood of that PDB path being tied to a malicious file was very high (Figure 2). While potentially useful, this can be indicative of geographical bias in our dataset, and further research would need to be done to see if this type of signature would generalize.


Figure 2: Unicode desktop folders from malicious samples

Various Tokenizing Schemes

In recording the directories of a file path, there are several ways you can represent the path. Let’s use this path to illustrate these different approaches:

c:\Leave\smell\Long\ruleThis.pdb (file)

Bag of Words

One very simple way is the “bag-of-words” approach, which simply treats the path as the distinct set of directory names it contains. Therefore, the aforementioned path would be represented as:

[‘c:’,’leave’,’smell’,’long’,’rulethis’]

Positional Analysis

Another approach we considered was recording the position of each directory name, as a distance from the drive. This retained more information about depth, such that a ‘build’ directory on the desktop would be treated differently than a ‘build’ directory nine directories further down. For this purpose, we excluded the drives since they would always have the same depth.

[’leave_1’,’smell_2’,’long_3’,’rulethis_4’]

N-Gram Analysis

Finally, we explored breaking paths into n-grams; that is, as a distinct set of n- adjacent directories. For example, a 2-gram representation of this path might look like:

[‘c:\leave’,’leave\smell’,’smell\long’,’long\rulethis’]

We tested each of these approaches and while positional analysis and n-grams contained more information, in the end, bag-of-words seemed to generalize best. Additionally, using the bag-of-words approach made it easier to extract simple signature logic from the resultant models, as will be shown in a later section.

Term Co-Occurrence

Since we had the bag-of-words vectors created for each path, we were also able to evaluate term co-occurrence across benign and malicious files. When we evaluated the co-occurrence of pairs of terms, we found some other interesting pairings that indeed paint two very different pictures of development environments (Figure 3).

Correlated with Malicious Files

Correlated with Benign Files

users, desktop

src, retail

documents, visual studio 2012

obj, x64

local, temporary projects

src, x86

users, projects

src, win32

users, documents

retail, dynamic

appdata, temporary projects

src, amd64

users, x86

src, x64

Figure 3: Correlated pairs with malicious and benign files

Keyword Lists

Our bag-of-words representation of the PDB paths then gave us a distinct set of nearly 70,000 distinct terms. The vast majority of these terms occurred once or twice in the entire dataset, resulting in what is known as a ‘long-tailed’ distribution. Figure 4 is a graph of only the top 100 most common terms in descending order.


Figure 4: Long tailed distribution of term occurrence

As you can see, the counts drop off quickly, and you are left dealing with an enormous amount of terms that may only appear a handful of times. One very simple way to solve this problem, without losing a ton of information, is to simply cut off a keyword list after a certain number of entries. For example, take the top 50 occurring folder names (across both good and bad files), and save them as a keyword list. Then match this list against every path in the dataset. To create features, one-hot encode each match.  

Rather than arbitrarily setting a cutoff, we wanted to know a bit more about the distribution and understand where might be a good place to set a limit – such that we would cover enough of the samples without drastically increasing the number of features for our model. We therefore calculated the cumulative number of samples covered by each term, as we iterated down the list from most common to least common. Figure 5 is a graph showing the result.


Figure 5: Cumulative share of samples covered by distinct terms

As you can see, with only a small fraction of the terms, we can arrive at a significant percentage of the cumulative total PDB paths. Setting a simple cutoff at about 70% of the dataset resulted in roughly 230 terms for our total vocabulary. This gave us enough information about the dataset without blowing up our model with too many features (and therefore, dimensions). One-hot encoding the presence of these terms was then the final step in featurizing the directory names present in the paths.

YARA Signatures Do Grow on Trees

Armed with some statistical features, as well as one-hot encoded keyword matches, we began to train some models on our now-featurized dataset. In doing so, we hoped to use the model training and evaluation process to give us insights into how to build better signatures. If we developed an effective classification model, that would be an added benefit.

We felt that tree-based models made sense for this use case for two reasons. First, tree-based models have worked well in the past in domains requiring a certain amount of interpretability and using a blend of quantitative and categorical features. Second, the features we used are largely things we could represent in a YARA signature. Therefore, if our models built boolean logic branches that separated large numbers of PDB files, we could potentially translate these into signatures. This is not to say that other model families could not be used to build strong classifiers. Many other options ranging from Logistic Regression to Deep Learning could be considered.

We fed our featurized training set into a Decision Tree, having set a couple ‘hyperparameters’ such as max depth and minimum samples per leaf, etc. We were also able to use a sliding scale of these hyperparameters to dynamically create trees and, essentially, see what shook out. Examining a trained decision tree such as the one in Figure 6 allowed us to immediately build new signatures.


Figure 6: Example decision tree and decision paths

We found several other interesting tidbits within our decision trees. Some terms that resulted in completely or almost-completely malicious subgroups are:

Directory Term

Example Hashes

\poe\

a6b2aa2b489fb481c3cd9eab2f4f4f5c

92904dc99938352525492cd5133b9917

444be936b44cc6bd0cd5d0c88268fa77

\xampp\

4d093061c172b32bf8bef03ac44515ae

4e6c2d60873f644ef5e06a17d85ec777

52d2a08223d0b5cc300f067219021c90

\temporary projects\

a785bd1eb2a8495a93a2f348c9a8ca67

c43c79812d49ca0f3b4da5aca3745090

e540076f48d7069bacb6d607f2d389d9

\stub\

5ea538dfc64e28ad8c4063573a46800c

adf27ce5e67d770321daf90be6f4d895

c6e23da146a6fa2956c3dd7a9314fc97

We also found the term ‘WindowsApplication1’ to be quite useful. 89% of the files in our dataset containing this directory were malicious. Cursory research indicates that this is the default directory generated when using Visual Studio to compile a Windows binary. Once again, this makes some intuitive sense for finding malware authors. Training and evaluating decision trees with various parameters turned out to be a hugely productive exercise in discovering potential new signature terms and logic.

Classification Accuracy and Findings

Since we now had a large dataset of PDB paths and features, we wanted to see if we could train a traditional classifier to separate good files from bad. Using a Random Forest with some tuning, we were able to achieve an average accuracy of 87% over 10 cross validations. However, while our recall (the percentage of bad things we could identify with the model) was relatively high at 89%, our malware precision (the share of those things we called bad that were actually bad) was far too low, hovering at or below 50%. This indicates that using this model alone for malware detection would result in an unacceptably large number of false positives, were we to deploy it in the wild as a singular detection platform. However, used in conjunction with other tools, this could be a useful weak signal to assist with analysis.

Conclusion and Next Steps

While our journey of statistical PDB analysis did not yield a magic malware classifier, it did yield a number of useful findings that we were hoping for:

  1. We developed several file path feature functions which are transferable to other models under development.
  2. By diving into statistical analysis of the dataset, we were able to identify new keywords and logic branches to include in YARA signatures. These signatures have since been deployed and discovered new malware samples.
  3. We answered a number of our own general research questions about PDB paths, and were able to dispel some theories we had not fully tested with data.

While building an independent classifier was not the primary goal, improvements can surely be made to improve the end model accuracy. Generating an even larger, more diverse dataset would likely make the biggest impact on our accuracy, recall, and precision. Further hyperparameter tuning and feature engineering could also help. There is a large amount of established research into text classification using various deep learning methods such as LSTMs, which could be applied effectively to a larger dataset.

PDB paths are only one small family of file paths that we encounter in the field of cyber security. Whether in initial infection, staging, or another part of the attack lifecycle, the file paths found during forensic analysis can reveal incredibly useful information about adversary activity. We look forward to further community research on how to properly extract and represent that information.

Open Sourcing StringSifter

Malware analysts routinely use the Strings program during static analysis in order to inspect a binary's printable characters. However, identifying relevant strings by hand is time consuming and prone to human error. Larger binaries produce upwards of thousands of strings that can quickly evoke analyst fatigue, relevant strings occur less often than irrelevant ones, and the definition of "relevant" can vary significantly among analysts. Mistakes can lead to missed clues that would have reduced overall time spent performing malware analysis, or even worse, incomplete or incorrect investigatory conclusions.

Earlier this year, the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams published a blog post describing a machine learning model that automatically ranked strings to address these concerns. Today, we publicly release this model as part of StringSifter, a utility that identifies and prioritizes strings according to their relevance for malware analysis.

Goals

StringSifter is built to sit downstream from the Strings program; it takes a list of strings as input and returns those same strings ranked according to their relevance for malware analysis as output. It is intended to make an analyst's life easier, allowing them to focus their attention on only the most relevant strings located towards the top of its predicted output. StringSifter is designed to be seamlessly plugged into a user’s existing malware analysis stack. Once its GitHub repository is cloned and installed locally, it can be conveniently invoked from the command line with its default arguments according to:

strings <sample_of_interest> | rank_strings

We are also providing Docker command line tools for additional portability and usability. For a more detailed overview of how to use StringSifter, including how to specify optional arguments for customizable functionality, please view its README file on GitHub.

We have received great initial internal feedback about StringSifter from FireEye’s reverse engineers, SOC analysts, red teamers, and incident responders. Encouragingly, we have also observed users at the opposite ends of the experience spectrum find the tool to be useful – from beginners detonating their first piece of malware as part of a FireEye training course – to expert malware researchers triaging incoming samples on the front lines. By making StringSifter publicly available, we hope to enable a broad set of personas, use cases, and creative downstream applications. We will also welcome external contributions to help improve the tool’s accuracy and utility in future releases.

Conclusion

We are releasing StringSifter to coincide with our presentation at DerbyCon 2019 on Sept. 7, and we will also be doing a technical dive into the model at the Conference on Applied Machine Learning for Information Security this October. With its release, StringSifter will join FLARE VM, FakeNet, and CommandoVM as one of many recent malware analysis tools that FireEye has chosen to make publicly available. If you are interested in developing data-driven tools that make it easier to find evil and help benefit the security community, please consider joining the FDS or FLARE teams by applying to one of our job openings.

Showing Vulnerability to a Machine: Automated Prioritization of Software Vulnerabilities

Introduction

If a software vulnerability can be detected and remedied, then a potential intrusion is prevented. While not all software vulnerabilities are known, 86 percent of vulnerabilities leading to a data breach were patchable, though there is some risk of inadvertent damage when applying software patches. When new vulnerabilities are identified they are published in the Common Vulnerabilities and Exposures (CVE) dictionary by vulnerability databases, such as the National Vulnerability Database (NVD).

The Common Vulnerabilities Scoring System (CVSS) provides a metric for prioritization that is meant to capture the potential severity of a vulnerability. However, it has been criticized for a lack of timeliness, vulnerable population representation, normalization, rescoring and broader expert consensus that can lead to disagreements. For example, some of the worst exploits have been assigned low CVSS scores. Additionally, CVSS does not measure the vulnerable population size, which many practitioners have stated they expect it to score. The design of the current CVSS system leads to too many severe vulnerabilities, which causes user fatigue. ­

To provide a more timely and broad approach, we use machine learning to analyze users’ opinions about the severity of vulnerabilities by examining relevant tweets. The model predicts whether users believe a vulnerability is likely to affect a large number of people, or if the vulnerability is less dangerous and unlikely to be exploited. The predictions from our model are then used to score vulnerabilities faster than traditional approaches, like CVSS, while providing a different method for measuring severity, which better reflects real-world impact.

Our work uses nowcasting to address this important gap of prioritizing early-stage CVEs to know if they are urgent or not. Nowcasting is the economic discipline of determining a trend or a trend reversal objectively in real time. In this case, we are recognizing the value of linking social media responses to the release of a CVE after it is released, but before it is scored by CVSS. Scores of CVEs should ideally be available as soon as possible after the CVE is released, while the current process often hampers prioritization of triage events and ultimately slows response to severe vulnerabilities. This crowdsourced approach reflects numerous practitioner observations about the size and widespread nature of the vulnerable population, as shown in Figure 1. For example, in the Mirai botnet incident in 2017 a massive number of vulnerable IoT devices were compromised leading to the largest Denial of Service (DoS) attack on the internet at the time.


Figure 1: Tweet showing social commentary on a vulnerability that reflects severity

Model Overview

Figure 2 illustrates the overall process that starts with analyzing the content of a tweet and concludes with two forecasting evaluations. First, we run Named Entity Recognition (NER) on tweet contents to extract named entities. Second, we use two classifiers to test the relevancy and severity towards the pre-identified entities. Finally, we match the relevant and severe tweets to the corresponding CVE.


Figure 2: Process overview of the steps in our CVE score forecasting

Each tweet is associated to CVEs by inspecting URLs or the contents hosted at a URL. Specifically, we link a CVE to a tweet if it contains a CVE number in the message body, or if the URL content contains a CVE. Each tweet must be associated with a single CVE and must be classified as relevant to security-related topics to be scored. The first forecasting task considers how well our model can predict the CVSS rankings ahead of time. The second task is predicting future exploitation of the vulnerability for a CVE based on Symantec Antivirus Signatures and Exploit DB. The rationale is that eventual presence in these lists indicates not just that exploits can exist or that they do exist, but that they also are publicly available.

Modeling Approach

Predicting the CVSS scores and exploitability from Twitter data involves multiple steps. First, we need to find appropriate representations (or features) for our natural language to be processed by machine learning models. In this work, we use two natural language processing methods in natural language processing for extracting features from text: (1) N-grams features, and (2) Word embeddings. Second, we use these features to predict if the tweet is relevant to the cyber security field using a classification model. Third, we use these features to predict if the relevant tweets are making strong statements indicative of severity. Finally, we match the severe and relevant tweets up to the corresponding CVE.

N-grams are word sequences, such as word pairs for 2-gram or word triples for 3-grams. In other words, they are contiguous sequence of n words from a text. After we extract these n-grams, we can represent original text as a bag-of-ngrams. Consider the sentence:

A criticial vulnerability was found in Linux.

If we consider all 2-gram features, then the bag-of-ngrams representation contains “A critical”, “critical vulnerability”, etc.

Word embeddings are a way to learn the meaning of a word by how it was used in previous contexts, and then represent that meaning in a vector space. Word embeddings know the meaning of a word by the company it keeps, more formally known as the distribution hypothesis. These word embedding representations are machine friendly, and similar words are often assigned similar representations. Word embeddings are domain specific. In our work, we additionally train terminology specific to cyber security topics, such as related words to threats are defenses, cyberrisk, cybersecurity, threat, and iot-based. The embedding would allow a classifier to implicitly combine the knowledge of similar words and the meaning of how concepts differ. Conceptually, word embeddings may help a classifier use these embeddings to implicitly associate relationships such as:

device + infected = zombie

where an entity called device has a mechanism applied called infected (malicious software infecting it) then it becomes a zombie.

To address issues where social media tweets differ linguistically from natural language, we leverage previous research and software from the Natural Language Processing (NLP) community. This addresses specific nuances like less consistent capitalization, and stemming to account for a variety of special characters like ‘@’ and ‘#’.


Figure 3: Tweet demonstrating value of identifying named entities in tweets in order to gauge severity

Named Entity Recognition (NER) identifies the words that construct nouns based on their context within a sentence, and benefits from our embeddings incorporating cyber security words. Correctly identifying the nouns using NER is important to how we parse a sentence. In Figure 3, for instance, NER facilitates Windows 10 to be understood as an entity while October 2018 is treated as elements of a date. Without this ability, the text in Figure 3 may be confused with the physical notion of windows in a building.

Once NER tokens are identified, they are used to test if a vulnerability affects them. In the Windows 10 example, Windows 10 is the entity and the classifier will predict whether the user believes there is a serious vulnerability affecting Windows 10. One prediction is made per entity, even if a tweet contains multiple entities. Filtering tweets that do not contain named entities reduces tweets to only those relevant to expressing observations on a software vulnerability.

From these normalized tweets, we can gain insight into how strongly users are emphasizing the importance of the vulnerability by observing their choice of words. The choice of adjective is instrumental in the classifier capturing the strong opinions. Twitter users often use strong adjectives and superlatives to convey magnitude in a tweet or when stressing the importance of something related to a vulnerability like in Figure 4. This magnitude often indicates to the model when a vulnerability’s exploitation is widespread. Table 1 shows our analysis of important adjectives that tend to indicate a more severe vulnerability.


Figure 4: Tweet showing strong adjective use


Table 1: Log-odds ratios for words correlated with highly-severe CVEs

Finally, the processed features are evaluated with two different classifiers to output scores to predict relevancy and severity. When a named entity is identified all words comprising it are replaced with a single token to prevent the model from biasing toward that entity. The first model uses an n-gram approach where sequences of two, three, and four tokens are input into a logistic regression model. The second approach uses a one-dimensional Convolutional Neural Network (CNN), comprised of an embedding layer, a dropout layer then a fully connected layer, to extract features from the tweets.

Evaluating Data

To evaluate the performance of our approach, we curated a dataset of 6,000 tweets containing the keywords vulnerability or ddos from Dec 2017 to July 2018. Workers on Amazon’s Mechanical Turk platform were asked to judge whether a user believed a vulnerability they were discussing was severe. For all labeling, multiple users must independently agree on a label, and multiple statistical and expert-oriented techniques are used to eliminate spurious annotations. Five annotators were used for the labels in the relevancy classifier and ten annotators were used for the severity annotation task. Heuristics were used to remove unserious respondents; for example, when users did not agree with other annotators for a majority of the tweets. A subset of tweets were expert-annotated and used to measure the quality of the remaining annotations.

Using the features extracted from tweet contents, including word embeddings and n-grams, we built a model using the annotated data from Amazon Mechanical Turk as labels. First, our model learns if tweets are relevant to a security threat using the annotated data as ground truth. This would remove a statement like “here is how you can #exploit tax loopholes” from being confused with a cyber security-related discussion about a user exploiting a software vulnerability as a malicious tool. Second, a forecasting model scores the vulnerability based on whether annotators perceived the threat to be severe.

CVSS Forecasting Results

Both the relevancy classifier and the severity classifier were applied to various datasets. Data was collected from December 2017 to July 2018. Most notably 1,000 tweets were held-out from the original 6,000 to be used for the relevancy classifier and 466 tweets were held-out for the severity classifier. To measure the performance, we use the Area Under the precision-recall Curve (AUC), which is a correctness score that summarizes the tradeoffs of minimizing the two types of errors (false positive vs false negative), with scores near 1 indicating better performance.

  • The relevancy classifier scored 0.85
  • The severity classifier using the CNN scored 0.65
  • The severity classifier using a Logistic Regression model, without embeddings, scored 0.54

Next, we evaluate how well this approach can be used to forecast CVSS ratings. In this evaluation, all tweets must occur a minimum of five days ahead of CVSS scores. The severity forecast score for a CVE is defined as the maximum severity score among the tweets which are relevant and associated with the CVE. Table 1 shows the results of three models: randomly guessing the severity, modeling based on the volume of tweets covering a CVE, and the ML-based approach described earlier in the post. The scoring metric in Table 2 is precision at top K using our logistic regression model. For example, where K=100, this is a way for us to identify what percent of the 100 most severe vulnerabilities were correctly predicted. The random model would predicted 59, while our model predicted 78 of the top 100 and all ten of the most severe vulnerabilities.


Table 2: Comparison of random simulated predictions, a model based just on quantitative features like “likes”, and the results of our model

Exploit Forecasting Results

We also measured the practical ability of our model to identify the exploitability of a CVE in the wild, since this is one of the motivating factors for tracking. To do this, we collected severe vulnerabilities that have known exploits by their presence in the following data sources:

  • Symantec Antivirus signatures
  • Symantec Intrusion Prevention System signatures
  • ExploitDB catalog

The dataset for exploit forecasting was comprised of 377,468 tweets gathered from January 2016 to November 2017. Of the 1,409 CVEs used in our forecasting evaluation, 134 publicly weaponized vulnerabilities were found across all three data sources.

Using CVEs from the aforementioned sources as ground truth, we find our CVE classification model is more predictive of detecting operationalized exploits from the vulnerabilities than CVSS. Table 3 shows precision scores illustrating seven of the top ten most severe CVEs and 21 of the top 100 vulnerabilities were found to have been exploited in the wild. Compare that to one of the top ten and 16 of the top 100 from using the CVSS score itself. The recall scores show the percentage of our 134 weaponized vulnerabilities found in our K examples. In our top ten vulnerabilities, seven were found to be in the 134 (5.2%), while the CVSS scoring’s top ten included only one (0.7%) CVE being exploited.


Table 3: Precision and recall scores for the top 10, 50 and 100 vulnerabilities when comparing CVSS scoring, our simplistic volume model and our NLP model

Conclusion

Preventing vulnerabilities is critical to an organization’s information security posture, as it effectively mitigates some cyber security breaches. In our work, we found that social media content that pre-dates CVE scoring releases can be effectively used by machine learning models to forecast vulnerability scores and prioritize vulnerabilities days before they are made available. Our approach incorporates a novel social sentiment component, which CVE scores do not, and it allows scores to better predict real-world exploitation of vulnerabilities. Finally, our approach allows for a more practical prioritization of software vulnerabilities effectively indicating the few that are likely to be weaponized by attackers. NIST has acknowledged that the current CVSS methodology is insufficient. The current process of scoring CVSS is expected to be replaced by ML-based solutions by October 2019, with limited human involvement. However, there is no indication of utilizing a social component in the scoring effort.

This work was led by researchers at Ohio State under the IARPA CAUSE program, with support from Leidos and FireEye. This work was originally presented at NAACL in June 2019, our paper describes this work in more detail and was also covered by Wired.

Churning Out Machine Learning Models: Handling Changes in Model Predictions

Introduction

Machine learning (ML) is playing an increasingly important role in cyber security. Here at FireEye, we employ ML for a variety of tasks such as: antivirus, malicious PowerShell detection, and correlating threat actor behavior. While many people think that a data scientist’s job is finished when a model is built, the truth is that cyber threats constantly change and so must our models. The initial training is only the start of the process and ML model maintenance creates a large amount of technical debt. Google provides a helpful introduction to this topic in their paper “Machine Learning: The High-Interest Credit Card of Technical Debt.” A key concept from the paper is the principle of CACE: change anything, change everything. Because ML models deliberately find nonlinear dependencies between input data, small changes in our data can create cascading effects on model accuracy and downstream systems that consume those model predictions. This creates an inherent conflict in cyber security modeling: (1) we need to update models over time to adjust to current threats and (2) changing models can lead to unpredictable outcomes that we need to mitigate.

Ideally, when we update a model, the only change in model outputs are improvements, e.g. fixes to previous errors. Both false negatives (missing malicious activity) and false positives (alerts on benign activity), have significant impact and should be minimized. Since no ML model is perfect, we mitigate mistakes with orthogonal approaches: whitelists and blacklists, external intelligence feeds, rule-based systems, etc. Combining with other information also provides context for alerts that may not otherwise be present. However, CACE! These integrated systems can suffer unintended side effects from a model update. Even when the overall model accuracy has increased, individual changes in model output are not guaranteed to be improvements. Introduction of new false negatives or false positives in an updated model, called churn, creates the potential for new vulnerabilities and negative interactions with cyber security infrastructure that consumes model output. In this article, we discuss churn, how it creates technical debt when considering the larger cyber security product, and methods to reduce it.

Prediction Churn

Whenever we retrain our cyber security-focused ML models, we need to able to calculate and control for churn. Formally, prediction churn is defined as the expected percent difference between two different model predictions (note that prediction churn is not the same as customer churn, the loss of customers over time, which is the more common usage of the term in business analytics). It was originally defined by Cormier et al. for a variety of applications. For cyber security applications, we are often concerned with just those differences where the newer model performs worse than the older model. Let’s define bad churn when retraining a classifier as the percentage of misclassified samples in the test set which the original model correctly classified.

Churn is often a surprising and non-intuitive concept. After all, if the accuracy of our new model is better than the accuracy of our old model, what’s the problem? Consider the simple linear classification problem of malicious red squares and benign blue circles in Figure 1. The original model, A, makes three misclassifications while the newer model, B, makes only two errors. B is the more accurate model. Note, however, that B introduces a new mistake in the lower right corner, misclassifying a red square as benign. That square was correctly classified by model A and represents an instance of bad churn. Clearly, it’s possible to reduce the overall error rate while introducing a small number of new errors which did not exist in the older model.


Figure 1: Two linear classifiers with errors highlighted in orange. The original classifier A has lower accuracy than B. However, B introduces a new error in the bottom right corner.

Practically, churn introduces two problems in our models. First, bad churn may require changes to whitelist/blacklists used in conjunction with ML models. As we previously discussed, these are used to handle the small but inevitable number of incorrect classifications. Testing on large repositories of data is necessary to catch such changes and update associated whitelists and blacklists. Second, churn may create issues for other ML models or rule-based systems which rely on the output of the ML model. For example, consider a hypothetical system which evaluates URLs using both a ML model and a noisy blacklist. The system generates an alert if

  • P(URL = ‘malicious’) > 0.9 or
  • P(URL = ‘malicious’) > 0.5 and the URL is on the blacklist

After retraining, the distribution of P(URL=‘malicious’) changes and all .com domains receive a higher score. The alert rules may need to be readjusted to maintain the required overall accuracy of the combined system. Ultimately, finding ways of reducing churn minimizes this kind of technical debt.

Experimental Setup

We’re going to explore churn and churn reduction techniques using EMBER, an open source malware classification data set. It consists of 1.1 million PE files first seen in 2017, along with their labels and features. The objective is to classify the files as either goodware or malware. For our purposes we need to construct not one model, but two, in order to calculate the churn between models. We have split the data set into three pieces:

  1. January through August is used as training data
  2. September and October are used to simulate running the model in production and retraining (test 1 in Figure 2).
  3. November and December are used to evaluate the models from step 1 and 2 (test 2 in Figure 2).


Figure 2: A comparison of our experimental setup versus the original EMBER data split. EMBER has a ten-month training set and a two-month test set. Our setup splits the data into three sets to simulate model training, then retraining while keeping an independent data set for final evaluation.

Figure 2 shows our data split and how it compares to the original EMBER data split. We have built a LightGBM classifier on the training data, which we’ll refer to as the baseline model. To simulate production testing, we run the baseline model on test 1 and record the FPs and FNs. Then, we retrain our model using both the training data and the FPs/FNs from test 1. We’ll refer to this model as the standard retrain. This is a reasonably realistic simulation of actual production data collection and model retraining. Finally, both the baseline model and the standard retrain are evaluated on test 2. The standard retrain has a higher accuracy than the baseline on test 2, 99.33% vs 99.10% respectively. However, there are 246 misclassifications made by the retrain model that were not made by the baseline or 0.12% bad churn.

Incremental Learning

Since our rationale for retraining is that cyber security threats change over time, e.g. concept drift, it’s a natural suggestion to use techniques like incremental learning to handle retraining. In incremental learning we take new data to learn new concepts without forgetting (all) previously learned concepts. That also suggests that an incrementally trained model may not have as much churn, as the concepts learned in the baseline model still exist in the new model. Not all ML models support incremental learning, but linear and logistic regression, neural networks, and some decision trees do. Other ML models can be modified to implement incremental learning. For our experiment, we incrementally trained the baseline LightGBM model by augmenting the training data with FPs and FNs from test 1 and then trained an additional 100 trees on top of the baseline model (for a total of 1,100 trees). Unlike the baseline model we use regularization (L2 parameter of 1.0); using no regularization resulted in overfitting to the new points. The incremental model has a bad churn of 0.05% (113 samples total) and 99.34% accuracy on test 2. Another interesting metric is the model’s performance on the new training data; how many of the baseline FPs and FNs from test 1 does the new model fix? The incrementally trained model correctly classifies 84% of the previous incorrect classifications. In a very broad sense, incrementally training on a previous model’s mistake provides a “patch” for the “bugs” of the old model.

Churn-Aware Learning

Incremental approaches only work if the features of the original and new model are identical. If new features are added, say to improve model accuracy, then alternative methods are required. If what we desire is both accuracy and low churn, then the most straightforward solution is to include both of these requirements when training. That’s the approach taken by Cormier et al., where samples received different weights during training in such a way as to minimize churn. We have made a few deviations in our approach: (1) we are interested in reducing bad churn (churn involving new misclassifications) as opposed to all churn and (2) we would like to avoid the extreme memory requirements of the original method. In a similar manner to Cormier et al., we want to reduce the weight, e.g. importance, of previously misclassified samples during training of a new model. Practically, the model sees making the same mistakes as the previous model as cheaper than making a new mistake. Our weighing scheme gives all samples correctly classified by the original model a weight of one and all other samples have a weight of: w = α – β |0.5 – Pold (χi)|, where Pold (χi) is the output of the old model on sample χi and αβ are adjustable hyperparameters. We train this reduced churn operator model (RCOP) using an α of 0.9, a β of 0.6 and the same training data as the incremental model. RCOP produces 0.09% bad churn, 99.38% accuracy on test 2.

Results

Figure 3 shows both accuracy and bad churn of each model on test set 2. We compare the baseline model, the standard model retrain, the incrementally learned model and the RCOP model.


Figure 3: Bad churn versus accuracy on test set 2.

Table 1 summarizes each of these approaches, discussed in detail above.

Name

Trained on

Method

Total # of trees

Baseline

train

LightGBM

1000

Standard retrain

train + FPs/FNs from baseline on test 1

LightGBM

1100

Incremental model

train + FPs/FNs from baseline on test 1

Trained 100 new trees, starting from the baseline model

1100

RCOP

train + FPs/FNs from baseline on test 1

LightGBM with altered sample weights

1100

Table 1: A description of the models tested

The baseline model has 100 fewer trees than the other models, which could explain the comparatively reduced accuracy. However, we tried increasing the number of trees which resulted in only a minor increase in accuracy of < 0.001%. The increase in accuracy for the non-baseline methods is due to the differences in data set and training methods. Both incremental training and RCOP work as expected producing less churn than the standard retrain, while showing accuracy improvements over the baseline. In general, there is usually a trend of increasing accuracy being correlated with increasing bad churn: there is no free lunch. That increasing accuracy occurs due to changes in the decision boundary, the more improvement the more changes occur. It seems reasonable the increasing decision boundary changes correlate with an increase in bad churn although we see no theoretical justification for why that must always be the case.

Unexpectedly, both the incremental model and RCOP produce more accurate models with less churn than the standard retrain. We would have assumed that given their additional constraints both models would have less accuracy with less churn. The most direct comparison is RCOP versus the standard retrain. Both models use identical data sets and model parameters, varying only by the weights associated with each sample. RCOP reduces the weight of incorrectly classified samples by the baseline model. That reduction is responsible for the improvement in accuracy. A possible explanation of this behavior is mislabeled training data. Multiple authors have suggested identifying and removing points with label noise, often using the misclassifications of a previously trained model to identify those noisy points. Our scheme, which reduces the weight of those points instead of removing them, is not dissimilar to those other noise reduction approaches which could explain the accuracy improvement.

Conclusion

ML models experience an inherent struggle: not retraining means being vulnerable to new classes of threats, while retraining causes churn and potentially reintroduces old vulnerabilities. In this blog post, we have discussed two different approaches to modifying ML model training in order to reduce churn: incremental model training and churn-aware learning. Both demonstrate effectiveness in the EMBER malware classification data set by reducing the bad churn, while simultaneously improving accuracy. Finally, we also demonstrated the novel conclusion that reducing churn in a data set with label noise can result in a more accurate model. Overall, these approaches provide low technical debt solutions to updating models that allow data scientists and machine learning engineers to keep their models up-to-date against the latest cyber threats at minimal cost. At FireEye, our data scientists work closely with the FireEye Labs detection analysts to quickly identify misclassifications and use these techniques to reduce the impact of churn on our customers.

What are Deep Neural Networks Learning About Malware?

An increasing number of modern antivirus solutions rely on machine learning (ML) techniques to protect users from malware. While ML-based approaches, like FireEye Endpoint Security’s MalwareGuard capability, have done a great job at detecting new threats, they also come with substantial development costs. Creating and curating a large set of useful features takes significant amounts of time and expertise from malware analysts and data scientists (note that in this context a feature refers to a property or characteristic of the executable that can be used to distinguish between goodware and malware). In recent years, however, deep learning approaches have shown impressive results in automatically learning feature representations for complex problem domains, like images, speech, and text. Can we take advantage of these advances in deep learning to automatically learn how to detect malware without costly feature engineering?

As it turns out, deep learning architectures, and in particular convolutional neural networks (CNNs), can do a good job of detecting malware simply by looking at the raw bytes of Windows Portable Executable (PE) files. Over the last two years, FireEye has been experimenting with deep learning architectures for malware classification, as well as methods to evade them. Our experiments have demonstrated surprising levels of accuracy that are competitive with traditional ML-based solutions, while avoiding the costs of manual feature engineering. Since the initial presentation of our findings, other researchers have published similarly impressive results, with accuracy upwards of 96%.

Since these deep learning models are only looking at the raw bytes without any additional structural, semantic, or syntactic context, how can they possibly be learning what separates goodware from malware? In this blog post, we answer this question by analyzing FireEye’s deep learning-based malware classifier.

Highlights

  • FireEye’s deep learning classifier can successfully identify malware using only the unstructured bytes of the Windows PE file.
  • Import-based features, like names and function call fingerprints, play a significant role in the features learned across all levels of the classifier.
  • Unlike other deep learning application areas, where low-level features tend to generally capture properties across all classes, many of our low-level features focused on very specific sequences primarily found in malware.
  • End-to-end analysis of the classifier identified important features that closely mirror those created through manual feature engineering, which demonstrates the importance of classifier depth in capturing meaningful features.

Background

Before we dive into our analysis, let’s first discuss what a CNN classifier is doing with Windows PE file bytes. Figure 1 shows the high-level operations performed by the classifier while “learning” from the raw executable data. We start with the raw byte representation of the executable, absent any structure that might exist (1). This raw byte sequence is embedded into a high-dimensional space where each byte is replaced with an n-dimensional vector of values (2). This embedding step allows the CNN to learn relationships among the discrete bytes by moving them within the n-dimensional embedding space. For example, if the bytes 0xe0 and 0xe2 are used interchangeably, then the CNN can move those two bytes closer together in the embedding space so that the cost of replacing one with the other is small. Next, we perform convolutions over the embedded byte sequence (3). As we do this across our entire training set, our convolutional filters begin to learn the characteristics of certain sequences that differentiate goodware from malware (4). In simpler terms, we slide a fixed-length window across the embedded byte sequence and the convolutional filters learn the important features from across those windows. Once we have scanned the entire sequence, we can then pool the convolutional activations to select the best features from each section of the sequence (i.e., those that maximally activated the filters) to pass along to the next level (5). In practice, the convolution and pooling operations are used repeatedly in a hierarchical fashion to aggregate many low-level features into a smaller number of high-level features that are more useful for classification. Finally, we use the aggregated features from our pooling as input to a fully-connected neural network, which classifies the PE file sample as either goodware or malware (6).


Figure 1: High-level overview of a convolutional neural network applied to raw bytes from a Windows PE files.

The specific deep learning architecture that we analyze here actually has five convolutional and max pooling layers arranged in a hierarchical fashion, which allows it to learn complex features by combining those discovered at lower levels of the hierarchy. To efficiently train such a deep neural network, we must restrict our input sequences to a fixed length – truncating any bytes beyond this length or using special padding symbols to fill out smaller files. For this analysis, we chose an input length of 100KB, though we have experimented with lengths upwards of 1MB. We trained our CNN model on more than 15 million Windows PE files, 80% of which were goodware and the remainder malware. When evaluated against a test set of nearly 9 million PE files observed in the wild from June to August 2018, the classifier achieves an accuracy of 95.1% and an F1 score of 0.96, which are on the higher end of scores reported by previous work.

In order to figure out what this classifier has learned about malware, we will examine each component of the architecture in turn. At each step, we use either a sample of 4,000 PE files taken from our training data to examine broad trends, or a smaller set of six artifacts from the NotPetya, WannaCry, and BadRabbit ransomware families to examine specific features.

Bytes in (Embedding) Space

The embedding space can encode interesting relationships that the classifier has learned about the individual bytes and determine whether certain bytes are treated differently than others because of their implied importance to the classifier’s decision. To tease out these relationships, we will use two tools: (1) a dimensionality reduction technique called multi-dimensional scaling (MDS) and (2) a density-based clustering method called HDBSCAN. The dimensionality reduction technique allows us to move from the high-dimensional embedding space to an approximation in two-dimensional space that we can easily visualize, while still retaining the overall structure and organization of the points. Meanwhile, the clustering technique allows us to identify dense groups of points, as well as outliers that have no nearby points. The underlying intuition being that outliers are treated as “special” by the model since there are no other points that can easily replace them without a significant change in upstream calculations, while dense clusters of points can be used interchangeably.


Figure 2: Visualization of the byte embedding space using multi-dimensional scaling (MDS) and clustered with hierarchical density-based clustering (HDBSCAN) with clusters (Left) and outliers labeled (Right).

On the left side of Figure 2, we show the two-dimensional representation of our byte embedding space with each of the clusters labeled, along with an outlier cluster labeled as -1. As you can see, the vast majority of bytes fall into one large catch-all class (Cluster 3), while the remaining three clusters have just two bytes each. Though there are no obvious semantic relationships in these clusters, the bytes that were included are interesting in their own right – for instance, Cluster 0 includes our special padding byte that is only used when files are smaller than the fixed-length cutoff, and Cluster 1 includes the ASCII character ‘r.’

What is more fascinating, however, is the set of outliers that the clustering produced, which are shown in the right side of Figure 3.  Here, there are a number of intriguing trends that start to appear. For one, each of the bytes in the range 0x0 to 0x6 are present, and these bytes are often used in short forward jumps or when registers are used as instruction arguments (e.g., eax, ebx, etc.). Interestingly, 0x7 and 0x8 are grouped together in Cluster 2, which may indicate that they are used interchangeably in our training data even though 0x7 could also be interpreted as a register argument. Another clear trend is the presence of several ASCII characters in the set of outliers, including ‘\n’, ‘A’, ‘e’, ‘s’, and ‘t.’ Finally, we see several opcodes present, including the call instruction (0xe8), loop and loopne (0xe0, 0xe2), and a breakpoint instruction (0xcc).

Given these findings, we immediately get a sense of what the classifier might be looking for in low-level features: ASCII text and usage of specific types of instructions.

Deciphering Low-Level Features

The next step in our analysis is to examine the low-level features learned by the first layer of convolutional filters. In our architecture, we used 96 convolutional filters at this layer, each of which learns basic building-block features that will be combined across the succeeding layers to derive useful high-level features. When one of these filters sees a byte pattern that it has learned in the current convolution, it will produce a large activation value and we can use that value as a method for identifying the most interesting bytes for each filter. Of course, since we are examining the raw byte sequences, this will merely tell us which file offsets to look at, and we still need to bridge the gap between the raw byte interpretation of the data and something that a human can understand. To do so, we parse the file using PEFile and apply BinaryNinja’s disassembler to executable sections to make it easier to identify common patterns among the learned features for each filter.

Since there are a large number of filters to examine, we can narrow our search by getting a broad sense of which filters have the strongest activations across our sample of 4,000 Windows PE files and where in those files those activations occur. In Figure 3, we show the locations of the 100 strongest activations across our 4,000-sample dataset. This shows a couple of interesting trends, some of which could be expected and others that are perhaps more surprising. For one, the majority of the activations at this level in our architecture occur in the ‘.text’ section, which typically contains executable code. When we compare the ‘.text’ section activations between malware and goodware subsets, there are significantly more activations for the malware set, meaning that even at this low level there appear to be certain filters that have keyed in on specific byte sequences primarily found in malware. Additionally, we see that the ‘UNKNOWN’ section– basically, any activation that occurs outside the valid bounds of the PE file – has many more activations in the malware group than in goodware. This makes some intuitive sense since many obfuscation and evasion techniques rely on placing data in non-standard locations (e.g., embedding PE files within one another).


Figure 3: Distribution of low-level activation locations across PE file headers and sections. Overall distribution of activations (Left), and activations for goodware/malware subsets (Right). UNKNOWN indicates an area outside the valid bounds of the file and NULL indicates an empty section name.

We can also examine the activation trends among the convolutional filters by plotting the top-100 activations for each filter across our 4,000 PE files, as shown in Figure 4. Here, we validate our intuition that some of these filters are overwhelmingly associated with features found in our malware samples. In this case, the activations for Filter 57 occur almost exclusively in the malware set, so that will be an important filter to look at later in our analysis. The other main takeaway from the distribution of filter activations is that the distribution is quite skewed, with only two filters handling the majority of activations at this level in our architecture. In fact, some filters are not activated at all on the set of 4,000 files we are analyzing.


Figure 4: Distribution of activations over each of the 96 low-level convolutional filters. Overall distribution of activations (Left), and activations for goodware/malware subsets (Right).

Now that we have identified the most interesting and active filters, we can disassemble the areas surrounding their activation locations and see if we can tease out some trends. In particular, we are going to look at Filters 83 and 57, both of which were important filters in our model based on activation value. The disassembly results for these filters across several of our ransomware artifacts is shown in Figure 5.

For Filter 83, the trend in activations becomes pretty clear when we look at the ASCII encoding of the bytes, which shows that the filter has learned to detect certain types of imports. If we look closer at the activations (denoted with a ‘*’), these always seem to include characters like ‘r’, ‘s’, ‘t’, and ‘e’, all of which were identified as outliers or found in their own unique clusters during our embedding analysis.  When we look at the disassembly of Filter 57’s activations, we see another clear pattern, where the filter activates on sequences containing multiple push instructions and a call instruction – essentially, identifying function calls with multiple parameters.

In some ways, we can look at Filters 83 and 57 as detecting two sides of the same overarching behavior, with Filter 83 detecting the imports and 57 detecting the potential use of those imports (i.e., by fingerprinting the number of parameters and usage). Due to the independent nature of convolutional filters, the relationships between the imports and their usage (e.g., which imports were used where) is lost, and that the classifier treats these as two completely independent features.


Figure 5: Example disassembly of activations for filters 83 (Left) and 57 (Right) from ransomware samples. Lines prepended with '*' contain the actual filter activations, others are provided for context.

Aside from the import-related features described above, our analysis also identified some filters that keyed in on particular byte sequences found in functions containing exploit code, such as DoublePulsar or EternalBlue. For instance, Filter 94 activated on portions of the EternalRomance exploit code from the BadRabbit artifact we analyzed. Note that these low-level filters did not necessarily detect the specific exploit activity, but instead activate on byte sequences within the surrounding code in the same function.

These results indicate that the classifier has learned some very specific byte sequences related to ASCII text and instruction usage that relate to imports, function calls, and artifacts found within exploit code. This finding is surprising because in other machine learning domains, such as images, low-level filters often learn generic, reusable features across all classes.

Bird’s Eye View of End-to-End Features

While it seems that lower layers of our CNN classifier have learned particular byte sequences, the larger question is: does the depth and complexity of our classifier (i.e., the number of layers) help us extract more meaningful features as we move up the hierarchy? To answer this question, we have to examine the end-to-end relationships between the classifier’s decision and each of the input bytes. This allows us to directly evaluate each byte (or segment thereof) in the input sequence and see whether it pushed the classifier toward a decision of malware or goodware, and by how much. To accomplish this type of end-to-end analysis, we leverage the SHapley Additive exPlanations (SHAP) framework developed by Lundberg and Lee. In particular, we use the GradientSHAP method that combines a number of techniques to precisely identify the contributions of each input byte, with positive SHAP values indicating areas that can be considered to be malicious features and negative values for benign features.

After applying the GradientSHAP method to our ransomware dataset, we noticed that many of the most important end-to-end features were not directly related to the types of specific byte sequences that we discovered at lower layers of the classifier. Instead, many of the end-to-end features that we discovered mapped closely to features developed from manual feature engineering in our traditional ML models. As an example, the end-to-end analysis on our ransomware samples identified several malicious features in the checksum portion of the PE header, which is commonly used as a feature in traditional ML models. Other notable end-to-end features included the presence or absence of certain directory information related to certificates used to sign the PE files, anomalies in the section table that define the properties of the various sections of the PE file, and specific imports that are often used by malware (e.g., GetProcAddress and VirtualAlloc).

In Figure 6, we show the distribution of SHAP values across the file offsets for the worm artifact of the WannaCry ransomware family. Many of the most important malicious features found in this sample are focused in the PE header structures, including previously mentioned checksum and directory-related features. One particularly interesting observation from this sample, though, is that it contains another PE file embedded within it, and the CNN discovered two end-to-end features related to this. First, it identified an area of the section table that indicated the ‘.data’ section had a virtual size that was more than 10x larger than the stated physical size of the section. Second, it discovered maliciously-oriented imports and exports within the embedded PE file itself. Taken as a whole, these results show that the depth of our classifier appears to have helped it learn more abstract features and generalize beyond the specific byte sequences we observed in the activations at lower layers.


Figure 6: SHAP values for file offsets from the worm artifact of WannaCry. File offsets with positive values are associated with malicious end-to-end features, while offsets with negative values are associated with benign features.

Summary

In this blog post, we dove into the inner workings of FireEye’s byte-based deep learning classifier in order to understand what it, and other deep learning classifiers like it, are learning about malware from its unstructured raw bytes. Through our analysis, we have gained insight into a number of important aspects of the classifier’s operation, weaknesses, and strengths:

  • Import Features: Import-related features play a large role in classifying malware across all levels of the CNN architecture. We found evidence of ASCII-based import features in the embedding layer, low-level convolutional features, and end-to-end features.
  • Low-Level Instruction Features: Several features discovered at the lower layers of our CNN classifier focused on sequences of instructions that capture specific behaviors, such as particular types of function calls or code surrounding certain types of exploits. In many cases, these features were primarily associated with malware, which runs counter to the typical use of CNNs in other domains, such as image classification, where low-level features capture generic aspects of the data (e.g., lines and simple shapes). Additionally, many of these low-level features did not appear in the most malicious end-to-end features.
  • End-to-End Features: Perhaps the most interesting result of our analysis is that many of the most important maliciously-oriented end-to-end features closely map to common manually-derived features from traditional ML classifiers. Features like the presence or absence of certificates, obviously mangled checksums, and inconsistencies in the section table do not have clear analogs to the lower-level features we uncovered. Instead, it appears that the depth and complexity of our CNN classifier plays a key role in generalizing from specific byte sequences to meaningful and intuitive features.

It is clear that deep learning offers a promising path toward sustainable, cutting-edge malware classification. At the same time, significant improvements will be necessary to create a viable real-world solution that addresses the shortcomings discussed in this article. The most important next step will be improving the architecture to include more information about the structural, semantic, and syntactic context of the executable rather than treating it as an unstructured byte sequence. By adding this specialized domain knowledge directly into the deep learning architecture, we allow the classifier to focus on learning relevant features for each context, inferring relationships that would not be possible otherwise, and creating even more robust end-to-end features with better generalization properties.

The content of this blog post is based on research presented at the Conference on Applied Machine Learning for Information Security (CAMLIS) in Washington, DC on Oct. 12-13, 2018. Additional material, including slides and a video of the presentation, can be found on the conference website.

An extended version of the research presented in this blog post can be found in our peer-reviewed paper from the IEEE Deep Learning and Security workshop. A publicly-available version of the paper is also available.