Category Archives: machine learning

Machine learning trumps AI for security analysts

While machine learning is one of the biggest buzzwords in cybersecurity and the tech industry in general, the phrase itself is often overused and mis-applied, leaving many to have their own, incorrect definition of what machine learning actually is. So, how do you cut through all the noise to separate fact from fiction? And how can this tool be best applied to security operations? What is machine learning? Machine learning (ML) is an algorithm that … More

The post Machine learning trumps AI for security analysts appeared first on Help Net Security.

McAfee Blogs: Artificial Intelligence & Your Family: The Wows & the Risks

artificial intelligenceAm I the only one? When I hear or see the word Artificial Intelligence (AI), my mind instantly defaults to images from sci-fi movies I’ve seen like I, Robot, Matrix, and Ex Machina. There’s always been a futuristic element — and self-imposed distance — between AI and myself.

But AI is anything but futuristic or distant. AI is here, and it’s now. And, we’re using it in ways we may not even realize.

AI has been woven throughout our lives for years in various expressions of technology. AI is in our homes, workplaces, and our hands every day via our smartphones.

Just a few everyday examples of AI:

  • Cell phones with built-in smart assistants
  • Toys that listen and respond to children
  • Social networks that determine what content you see
  • Social networking apps with fun filters
  • GPS apps that help you get where you need to go
  • Movie apps that predict what show you’d enjoy next
  • Music apps that curate playlists that echo your taste
  • Video games that deploy bots to play against you
  • Advertisers who follow you online with targeted ads
  • Refrigerators that alert you when food is about to expire
  • Home assistants that carry out voice commands
  • Flights you take that operate via an AI autopilot

The Technology

While AI sounds a little intimidating, it’s not when you break it down. AI is technology that can be programmed to accomplish a specific set of goals without assistance. In short, it’s a computer’s ability to be predictive — to process data, evaluate it, and take action.

AI is being implemented in education, business, manufacturing, retail, transportation, and just about any other sector of industry and culture you can imagine. It’s the smarter, faster, more profitable way to accomplish manual tasks.

An there’s tons of AI-generated good going on. Instagram — the #2 most popular social network — is now using AI technology to detect and combat cyberbullying on in both comments and photos.

No doubt, AI is having a significant impact on everyday life and is positioned to transform the future.

Still, there are concerns. The self-driving cars. The robots that malfunction. The potential jobs lost to AI robots.

So, as quickly as this popular new technology is being applied, now is a great time to talk with your family about both the exciting potential of AI and the risks that may come with it.

Talking points for families

Fake videos, images. AI is making it easier for people to face swap within images and videos. A desktop application called FakeApp allows users to seamlessly swap faces and share fake videos and images. This has led to the rise in “deep fake” videos that appear remarkably realistic (many of which go viral). Tip: Talk to your family about the power of AI technology and the responsibility and critical thinking they must exercise as they consume and share online content.

Privacy breaches. Following the Cambridge Analytica/Facebook scandal of 2018 that allegedly used AI technology unethically to collect Facebook user data, we’re reminded of those out to gather our private (and public) information for financial or political gain. Tip: Discuss locking down privacy settings on social networks and encourage your kids to be hyper mindful about the information they share in the public feed. That information includes liking and commenting on other content — all of which AI technology can piece together into a broader digital picture for misuse.

Cybercrime. As outlined in McAfee’s 2019 Threats Prediction Report, AI technology will likely allow hackers more ease to bypass security measures on networks undetected. This can lead to data breaches, malware attacks, ransomware, and other criminal activity. Additionally, AI-generated phishing emails are scamming people into handing over sensitive data. Tip: Bogus emails can be highly personalized and trick intelligent users into clicking malicious links. Discuss the sophistication of the AI-related scams and warn your family to think about every click — even those from friends.

IoT security. With homes becoming “smarter” and equipped with AI-powered IoT products, the opportunity for hackers to get into these devices to steal sensitive data is growing. According to McAfee’s Threat Prediction Report, voice-activated assistants are especially vulnerable as a point-of-entry for hackers. Also at risk, say security experts, are routers, smartphones, and tablets. Tip: Be sure to keep all devices updated. Secure all of your connected devices and your home internet at its source — the network. Avoid routers that come with your ISP (Internet Security Provider) since they are often less secure. And, be sure to change the default password and secure your primary network and guest network with strong passwords.

The post Artificial Intelligence & Your Family: The Wows & the Risks appeared first on McAfee Blogs.



McAfee Blogs

AI & Your Family: The Wows and Potential Risks

artificial intelligenceAm I the only one? When I hear or see the word Artificial Intelligence (AI), my mind instantly defaults to images from sci-fi movies I’ve seen like I, Robot, Matrix, and Ex Machina. There’s always been a futuristic element — and self-imposed distance — between AI and myself.

But AI is anything but futuristic or distant. AI is here, and it’s now. And, we’re using it in ways we may not even realize.

AI has been woven throughout our lives for years in various expressions of technology. AI is in our homes, workplaces, and our hands every day via our smartphones.

Just a few everyday examples of AI:

  • Cell phones with built-in smart assistants
  • Toys that listen and respond to children
  • Social networks that determine what content you see
  • Social networking apps with fun filters
  • GPS apps that help you get where you need to go
  • Movie apps that predict what show you’d enjoy next
  • Music apps that curate playlists that echo your taste
  • Video games that deploy bots to play against you
  • Advertisers who follow you online with targeted ads
  • Refrigerators that alert you when food is about to expire
  • Home assistants that carry out voice commands
  • Flights you take that operate via an AI autopilot

The Technology

While AI sounds a little intimidating, it’s not when you break it down. AI is technology that can be programmed to accomplish a specific set of goals without assistance. In short, it’s a computer’s ability to be predictive — to process data, evaluate it, and take action.

AI is being implemented in education, business, manufacturing, retail, transportation, and just about any other sector of industry and culture you can imagine. It’s the smarter, faster, more profitable way to accomplish manual tasks.

An there’s tons of AI-generated good going on. Instagram — the #2 most popular social network — is now using AI technology to detect and combat cyberbullying on in both comments and photos.

No doubt, AI is having a significant impact on everyday life and is positioned to transform the future.

Still, there are concerns. The self-driving cars. The robots that malfunction. The potential jobs lost to AI robots.

So, as quickly as this popular new technology is being applied, now is a great time to talk with your family about both the exciting potential of AI and the risks that may come with it.

Talking points for families

Fake videos, images. AI is making it easier for people to face swap within images and videos. A desktop application called FakeApp allows users to seamlessly swap faces and share fake videos and images. This has led to the rise in “deep fake” videos that appear remarkably realistic (many of which go viral). Tip: Talk to your family about the power of AI technology and the responsibility and critical thinking they must exercise as they consume and share online content.

Privacy breaches. Following the Cambridge Analytica/Facebook scandal of 2018 that allegedly used AI technology unethically to collect Facebook user data, we’re reminded of those out to gather our private (and public) information for financial or political gain. Tip: Discuss locking down privacy settings on social networks and encourage your kids to be hyper mindful about the information they share in the public feed. That information includes liking and commenting on other content — all of which AI technology can piece together into a broader digital picture for misuse.

Cybercrime. As outlined in McAfee’s 2019 Threats Prediction Report, AI technology will likely allow hackers more ease to bypass security measures on networks undetected. This can lead to data breaches, malware attacks, ransomware, and other criminal activity. Additionally, AI-generated phishing emails are scamming people into handing over sensitive data. Tip: Bogus emails can be highly personalized and trick intelligent users into clicking malicious links. Discuss the sophistication of the AI-related scams and warn your family to think about every click — even those from friends.

IoT security. With homes becoming “smarter” and equipped with AI-powered IoT products, the opportunity for hackers to get into these devices to steal sensitive data is growing. According to McAfee’s Threat Prediction Report, voice-activated assistants are especially vulnerable as a point-of-entry for hackers. Also at risk, say security experts, are routers, smartphones, and tablets. Tip: Be sure to keep all devices updated. Secure all of your connected devices and your home internet at its source — the network. Avoid routers that come with your ISP (Internet Security Provider) since they are often less secure. And, be sure to change the default password and secure your primary network and guest network with strong passwords.

The post AI & Your Family: The Wows and Potential Risks appeared first on McAfee Blogs.

10 Cybersecurity Conference Trips You Should Make Time for This Year

Cybersecurity remains a top priority for chief information security officers (CISOs) worldwide, but it’s easy to get out of touch as the industry evolves at breakneck speed and attackers discover new and innovative ways to compromise corporate networks. That’s why it’s worth investing in cybersecurity conference trips to help IT professionals stay up-to-date by networking with vendors, thought leaders and colleagues.

Top Cybersecurity Conference Trips You Should Book in 2019

Not sure where to distribute your IT budgets for ideal returns? Here’s a roundup of some of the top cybersecurity conferences happening this year.

Cybertech Israel

Cybertech Israel will once again descend on Tel Aviv from Jan. 28-30. One of the premier B2B networking conferences for security professionals, Cybertech offers both a major exhibition and full conference schedule over the course of three days. This year, speakers will include Prime Minister of Israel Benjamin Netanyahu, Professor Dieter Kempf, president of the Federation of German Industries, and Dr. Sridhar Muppidi, IBM fellow and chief technology officer at IBM Security.

HIMSS 2019

Up next for the new year is HIMSS19, which will take place from Feb. 11–15 in Orlando, Florida. This year’s theme, “Champions of Health Unite,” will bring together insights from trailblazers, game-changers and strategizers to help health IT professionals set the stage for a secure and successful 2019. Topics will range from privacy and telehealth to care culture and clinician engagement. Given the critical role of technology in delivering and empowering health services, HIMSS19 promises to be a great starting point for this year’s conference lineup in the U.S.

Think 2019

IBM Think 2019, happening Feb. 12–15, is making the move this year to San Francisco. With more than 160 security-focused sessions across the conference’s dedicated Security and Resiliency Campus, there’s something for everyone. Key offerings include sessions on making security relevant to the C-suite, understanding the value of collaborative defense and transforming the role of incident response (IR) with new technologies such as IBM’s Watson.

View the Think 2019 security and resiliency curriculum roadmap

RSA Conference

One of the industry’s biggest annual conferences, RSAC is also held in San Francisco and will run from March 4–8. This year’s theme is “Better” — building better solutions, creating better connections and developing better responses. From securing robot-designed code to measuring data breach impacts and examining the value of human risk management, this massive conference (40,000+ attendees) always delivers value.

Cyphercon 4.0

Demonstrating that bigger isn’t always better, Cyphercon 4.0 will be held in Milwaukee from April 11–12. This cryptography and information security-focused offering strives to create an informal, welcoming environment that offers benefits for experts and beginners alike. All session abstracts are reviewed without speaker names attached, ensuring that only high-quality (not merely high-profile) presentations make the cut.

40th IEEE Symposium on Security and Privacy

With the General Data Protection Regulation (GDPR) now in full effect and privacy legislation a top priority for many countries, enterprises would be well served by any cybersecurity conference that tackles this increasingly complex field. The Institute of Electrical and Electronics Engineers (IEEE)’s 40th symposium will take place in San Francisco from May 20–22 and wil lbring together some of the industry’s leading researchers and practitioners to help organizations evaluate their current privacy policies and prepare for the next generation of personal data defense.

Gartner Security and Risk Management Summit

Happening in National Harbor, Maryland, from June 17–20, Gartner’s yearly conference includes sessions about emerging information security priorities such as machine learning, analytics and blockchain. More generally, the conference tackles the critical need to make security and risk top organizational priorities by offering a combination of meaningful networks, expert guidance and real-world scenarios.

Black Hat

One of two premier hacker conferences taking place in Las Vegas each summer — DEF CON is the other — Black Hat is more formal and also one of the most popular conferences every year. This year, the conference will be held from Aug. 3–8. Topics are wide-ranging; last year’s event examined the potential of voting machine compromise, and in 2015, researchers hacked a moving Jeep.

BSides

BSides, scheduled for Aug. 6–7 in Las Vegas, is a free conference that will celebrate its 10th year in 2019 and offers the benefit of small-group participation for all attendees. Walk-in passes are snapped up quickly, so if you’re in town for Black Hat or DEF CON, make sure to stop by the Tuscany Suites; this year, BSides has the entire hotel booked.

GrrCon

Rounding out the year is the more informal GrrCon, scheduled for Oct. 24–25 in Grand Rapids, Michigan. This conference is small — just 1,500 attendees — and focuses on creating a fun atmosphere where executives, security professionals, students and hackers can exchange ideas and uncover new insights.

Start the Year Off Strong

Less than 24 hours after the ball dropped in Times Square, this year saw its first data breach: As reported by CBR Online, more than 30,000 Australian civil servants had their data stolen. It’s a bellwether for 2019 — a not-so-subtle sign that threat actors will continue to compromise corporate data to leverage or generate profit. More importantly, it’s a reminder to start the year off strong — to revisit existing security polices, design more holistic defenses and make time for the best cybersecurity conference offerings of 2019.

The post 10 Cybersecurity Conference Trips You Should Make Time for This Year appeared first on Security Intelligence.

Stay Ahead of the Growing Security Analytics Market With These Best Practices

As breach rates climb and threat actors continue to evolve their techniques, many IT security teams are turning to new tools in the fight against corporate cybercrime. The proliferation of internet of things (IoT) devices, network services and other technologies in the enterprise has expanded the attack surface every year and will continue to do so. This evolving landscape is prompting organizations to seek out new ways of defending critical assets and gathering threat intelligence.

The Security Analytics Market Is Poised for Massive Growth

Enter security analytics, which mixes threat intelligence with big data capabilities to help detect, analyze and mitigate targeted attacks and persistent threats from outside actors as well as those already inside corporate walls.

“It’s no longer enough to protect against outside attacks with perimeter-based cybersecurity solutions,” said Hani Mustafa, CEO and co-founder of Jazz Networks. “Cybersecurity tools that blend user behavior analytics (UBA), machine learning and data visibility will help security professionals contextualize data and demystify human behavior, allowing them to predict, prevent and protect against insider threats.”

Security analytics can also provide information about attempted breaches from outside sources. Analytics tools work together with existing network defenses and strategies and offer a deeper view into suspicious activity, which could be missed or overlooked for long periods due to the massive amount of superfluous data collected each day.

Indeed, more security teams are seeing the value of analytics as the market appears poised for massive growth. According to Global Market Insights, the security analytics market was valued at more than $2 billion in 2015, and it is estimated to grow by more than 26 percent over the coming years — exceeding $8 billion by 2023. ABI Research put that figure even higher, estimating that the need for these tools will drive the security analytics market toward a revenue of $12 billion by 2024.

Why Are Security Managers Turning to Analytics?

For most security managers, investment in analytics tools represents a way to fill the need for more real-time, actionable information that plays a role in a layered, robust security strategy. Filtering out important information from the massive amounts of data that enterprises deal with daily is a primary goal for many leaders. Businesses are using these tools for many use cases, including analyzing user behavior, examining network traffic, detecting insider threats, uncovering lost data, and reviewing user roles and permissions.

“There has been a shift in cybersecurity analytics tooling over the past several years,” said Ray McKenzie, founder and managing director of Red Beach Advisors. “Companies initially were fine with weekly or biweekly security log analytics and threat identification. This has morphed to real-time analytics and tooling to support vulnerability awareness.”

Another reason for analytics is to gain better insight into the areas that are most at risk within an IT environment. But in efforts to cull important information from a wide variety of potential threats, these tools also present challenges to the teams using them.

“The technology can also cause alert fatigue,” said Simon Whitburn, global senior vice president, cybersecurity services at Nominet. “Effective analytics tools should have the ability to reduce false positives while analyzing data in real-time to pinpoint and eradicate malicious activity quickly. At the end of the day, the key is having access to actionable threat intelligence.”

Personalization Is Paramount

Obtaining actionable threat intelligence means configuring these tools with your unique business needs in mind.

“There is no ‘plug and play’ solution in the security analytics space,” said Liviu Arsene, senior cybersecurity analyst at Bitdefender. “Instead, the best way forward for organizations is to identify and deploy the analytics tools that best fits an organization’s needs.”

When evaluating security analytics tools, consider the company’s size and the complexity of the challenges the business hopes to address. Organizations that use analytics may need to include features such as deployment models, scope and depth of analysis, forensics, and monitoring, reporting and visualization. Others may have simpler needs with minimal overhead and a smaller focus on forensics and advanced persistent threats (APTs).

“While there is no single analytics tool that works for all organizations, it’s important for organizations to fully understand the features they need for their infrastructure,” said Arsene.

Best Practices for Researching and Deploying Analytics Solutions

Once you have established your organization’s needs and goals for investing in security analytics, there are other important considerations to keep in mind.

Emphasize Employee Training

Chief information security officers (CISOs) and security managers must ensure that their staffs are prepared to use the tools at the outset of deployment. Training employees on how to make sense of information among the noise of alerts is critical.

“Staff need to be trained to understand the results being generated, what is important, what is not and how to respond,” said Steve Tcherchian, CISO at XYPRO Technology Corporation.

Look for Tools That Can Change With the Threat Landscape

Security experts know that criminals are always one step ahead of technology and tools and that the threat landscape is always evolving. It’s essential to invest in tools that can handle relevant data needs now, but also down the line in several years. In other words, the solutions must evolve alongside the techniques and methodologies of threat actors.

“If the security tools an organization uses remain stagnant in their programming and update schedule, more vulnerabilities will be exposed through other approaches,” said Victor Congionti of Proven Data.

Understand That Analytics Is Only a Supplement to Your Team

Analytics tools are by no means a replacement for your security staff. Having analysts who can understand and interpret data is necessary to get the most out of these solutions.

Be Mindful of the Limitations of Security Analytics

Armed with security analytics tools, organizations can benefit from big data capabilities to analyze data and enhance detection with proactive alerts about potential malicious activity. However, analytics tools have their limitations, and enterprises that invest must evaluate and deploy these tools with their unique business needs in mind. The data obtained from analytics requires context, and trained staff need to understand how to make sense of important alerts among the noise.

The post Stay Ahead of the Growing Security Analytics Market With These Best Practices appeared first on Security Intelligence.

Tech Support Scams: What are They and How do I Stay Safe?

If you read this blog regularly you’re no doubt aware that cyber-criminals are a determined bunch, with a large range of tools and tactics at their disposal to rob you of your identity and hard-earned cash. Tech support scams (TSS) are an increasingly popular way for them to do just this. In 2017, Microsoft Customer Support Services received 153,000 reports from customers around the world who encountered or fell victim to these scams, a 24 percent increase on the year previous. Many lost hundreds of dollars in the process.

Yet the real scale of the problem is likely to be many times bigger.

If you’re still unsure what tech support scams are, and how you can protect yourself, this handy guide will tell you everything you need to know.

What types of tech support scam are there?

Tech support scams target users of any devices, platforms and software and can involve a variety of tactics. Typically, they include both an online element and/or a phone call with the scammer, who pretends to be technical support worker for a reputable company like Microsoft or your ISP. They try to trick you into believing there’s something wrong with your computer so that you agree either to hand over money (and credit card details) to ‘fix’ it, and/or allow them remote access to your machine — which enables them to download covert info-stealing malware.

Here are the two main ways a TSS can begin:

  • Cold calling: You could get a call at any time from one of these fake ‘tech support’ workers. They may even hijack Caller ID to appear legitimate. They’ll try to bamboozle you with tech jargon and create a sense of urgency that your machine and the data on it is in danger if you don’t act immediately.
    They’ll usually persuade you to download a special tool so they can remotely access your PC. They’ll then pretend your machine is infected with malware and ask for payment to remove it, or to buy a meaningless maintenance, support, or security package. Ironically, by giving them access to your PC, you’ve provided an opportunity for the scammers to download real malware to steal more of your personal information.
  • Online issues: A scam could also start online, if you accidentally visit a malicious website. How might you do this? Potentially, by mistyping the address of your favorite site into the address bar, or by clicking on a scam link in an unsolicited email. You might even have been searching for some breaking news on a particular high-profile story, only to find a link high up on the search listings took you to a malicious website.
    After doing so you might suddenly be presented with pop-ups saying your computer is infected with malware or malfunctioning. Sometimes they put your browser onto full screen mode with alerts which can’t be removed, effectively locking your screen. The message they display is likely to have a ‘tech support’ phone number you’re urged to call to sort the non-existent problem out. That will put you through to those same scammers that cold call users in scenario 1.

The bottom line is that if you fall for one of these tactics, you may lose an initial sum of money by paying the scammer, but also be exposed to further fraud on that card in the future as they’ll have your details on file. You could also be at risk of identity theft if the bad guys have downloaded malware to steal more personal info from your machine, like banking log-ins, Social Security numbers and more.

Microsoft claimed last year that three million users are subject to these scams every month, and more than half (56%) are from the US. The FBI, meanwhile, estimated tech support fraud losses in 2017 amounted to $15 million, an 86 percent increase on the previous year.

How do I stay safe, or recover, from a scam?

Fortunately, there are several things you can do to prevent the scammers getting what they want, and even if you are caught out, some quick thinking can help to minimize the impact on your life and finances.

Staying safe:

  • If you receive an unsolicited phone call claiming to come from Apple, Microsoft, Verizon or similar, hang up, or get more details and call the company back directly. Don’t hand over any personal or financial information and don’t allow the caller to download anything to your computer.
  • Stay up-to-date with the latest browser and software/OS versions to minimize the chances the bad guys can take you to malicious sites or launch pop-ups on your machine.
  • Take extra care when typing website names into your address bar.
  • Be cautious online: don’t click on any links in unsolicited emails or on websites.
  • Only download software from legitimate vendor websites/app stores.
  • Invest in third-party security software from a reputable supplier like Trend Micro, to detect TSS malware.

If you’ve been scammed:

  • Immediately delete any remote-access software the scammer may have encouraged you to install.
  • Download and use software from a provider like Trend Micro to detect and remove any installed malware.
  • Once malware has been fully removed, change all your computer and online account passwords.
  • Call your bank/credit or debit card provider to cancel relevant cards and claim back any money already lost.
  • Continue to monitor bank and online account activity and take action if there’s anything suspicious.
  • Upgrade your software, OS and browser to the latest versions.
  • Beware of follow-on scams in the coming days, weeks, or months.
  • Report the scam to Microsoft, Apple or other relevant provider.

How can Trend Micro help?

For the online side of tech support scams, Trend Micro Security offers comprehensive multi-layered protection from the malicious sites, pop-ups, browser takeovers and malware associated with tech support scams. Here are just some of the techniques we use to keep you safe:

  • Web Reputation Service: Blocks access to any malicious URLs linked to scams.
  • Script Analyzer Lineup: Scans websites for any malicious code run on the web pages, to detect the presence of potential tech support threats.
  • Real-time Virus Scanner: Blocks any suspected malware downloads from support scam sites.
  • Static Intelligence Engine: Leverages machine learning to greatly enhance the detection of tech support scams.
  • Scanning/malware removal: Cleans-up any malware installed on infected machines if you have been caught out by a support scam.

Visit Trend Micro Security to find out more about how TMS protects you, or to buy the product.

The post Tech Support Scams: What are They and How do I Stay Safe? appeared first on .

Obfuscated Command Line Detection Using Machine Learning

This blog post presents a machine learning (ML) approach to solving an emerging security problem: detecting obfuscated Windows command line invocations on endpoints. We start out with an introduction to this relatively new threat capability, and then discuss how such problems have traditionally been handled. We then describe a machine learning approach to solving this problem and point out how ML vastly simplifies development and maintenance of a robust obfuscation detector. Finally, we present the results obtained using two different ML techniques and compare the benefits of each.

Introduction

Malicious actors are increasingly “living off the land,” using built-in utilities such as PowerShell and the Windows Command Processor (cmd.exe) as part of their infection workflow in an effort to minimize the chance of detection and bypass whitelisting defense strategies. The release of new obfuscation tools makes detection of these threats even more difficult by adding a layer of indirection between the visible syntax and the final behavior of the command. For example, Invoke-Obfuscation and Invoke-DOSfuscation are two recently released tools that automate the obfuscation of Powershell and Windows command lines respectively.

The traditional pattern matching and rule-based approaches for detecting obfuscation are difficult to develop and generalize, and can pose a huge maintenance headache for defenders. We will show how using ML techniques can address this problem.

Detecting obfuscated command lines is a very useful technique because it allows defenders to reduce the data they must review by providing a strong filter for possibly malicious activity. While there are some examples of “legitimate” obfuscation in the wild, in the overwhelming majority of cases, the presence of obfuscation generally serves as a signal for malicious intent.

Background

There has been a long history of obfuscation being employed to hide the presence of malware, ranging from encryption of malicious payloads (starting with the Cascade virus) and obfuscation of strings, to JavaScript obfuscation. The purpose of obfuscation is two-fold:

  • Make it harder to find patterns in executable code, strings or scripts that can easily be detected by defensive software.
  • Make it harder for reverse engineers and analysts to decipher and fully understand what the malware is doing.

In that sense, command line obfuscation is not a new problem – it is just that the target of obfuscation (the Windows Command Processor) is relatively new. The recent release of tools such as Invoke-Obfuscation (for PowerShell) and Invoke-DOSfuscation (for cmd.exe) have demonstrated just how flexible these commands are, and how even incredibly complex obfuscation will still run commands effectively.

There are two categorical axes in the space of obfuscated vs. non-obfuscated command lines: simple/complex and clear/obfuscated (see Figure 1 and Figure 2). For this discussion “simple” means generally short and relatively uncomplicated, but can still contain obfuscation, while “complex” means long, complicated strings that may or may not be obfuscated. Thus, the simple/complex axis is orthogonal to obfuscated/unobfuscated. The interplay of these two axes produce many boundary cases where simple heuristics to detect if a script is obfuscated (e.g. length of a command) will produce false positives on unobfuscated samples. The flexibility of the command line processor makes classification a difficult task from an ML perspective.


Figure 1: Dimensions of obfuscation


Figure 2: Examples of weak and strong obfuscation

Traditional Obfuscation Detection

Traditional obfuscation detection can be split into three approaches. One approach is to write a large number of complex regular expressions to match the most commonly abused syntax of the Windows command line. Figure 3 shows one such regular expression that attempts to match ampersand chaining with a call command, a common pattern seen in obfuscation. Figure 4 shows an example command sequence this regex is designed to detect.


Figure 3: A common obfuscation pattern captured as a regular expression


Figure 4: A common obfuscation pattern (calling echo in obfuscated fashion in this example)

There are two problems with this approach. First, it is virtually impossible to develop regular expressions to cover every possible abuse of the command line. The flexibility of the command line results in a non-regular language, which is feasible yet impractical to express using regular expressions. A second issue with this approach is that even if a regular expression exists for the technique a malicious sample is using, a determined attacker can make minor modifications to avoid the regular expression. Figure 5 shows a minor modification to the sequence in Figure 4, which avoids the regex detection.


Figure 5: A minor change (extra carets) to an obfuscated command line that breaks the regular expression in Figure 3

The second approach, which is closer to an ML approach, involves writing complex if-then rules. However, these rules are hard to derive, are complex to verify, and pose a significant maintenance burden as authors evolve to escape detection by such rules. Figure 6 shows one such if-then rule.


Figure 6: An if-then rule that *may* indicate obfuscation (notice how loose this rule is, and how false positives are likely)

A third approach is to combine regular expressions and if-then rules. This greatly complicates the development and maintenance burden, and still suffers from the same weaknesses that make the first two approaches fragile. Figure 7 shows an example of an if-then rule with regular expressions. Clearly, it is easy to appreciate how burdensome it is to generate, test, maintain and determine the efficacy of such rules.


Figure 7: A combination of an if-then rule with regular expressions to detect obfuscation (a real hand-built obfuscation detector would consist of tens or hundreds of rules and still have gaps in its detection)

The ML Approach – Moving Beyond Pattern Matching and Rules

Using ML simplifies the solution to these problems. We will illustrate two ML approaches: a feature-based approach and a feature-less end-to-end approach.

There are some ML techniques that can work with any kind of raw data (provided it is numeric), and neural networks are a prime example. Most other ML algorithms require the modeler to extract pertinent information, called features, from raw data before they are fed into the algorithm. Some examples of this latter type are tree-based algorithms, which we will also look at in this blog (we described the structure and uses of Tree-Based algorithms in a previous blog post, where we used a Gradient-Boosted Tree-Based Model).

ML Basics – Neural Networks

Neural networks are a type of ML algorithm that have recently become very popular and consist of a series of elements called neurons. A neuron is essentially an element that takes a set of inputs, computes a weighted sum of these inputs, and then feeds the sum into a non-linear function. It has been shown that a relatively shallow network of neurons can approximate any continuous mapping between input and output. The specific type of neural network we used for this research is what is called a Convolutional Neural Network (CNN), which was developed primarily for computer vision applications, but has also found success in other domains including natural language processing. One of the main benefits of a neural network is that it can be trained without having to manually engineer features.

Featureless ML

While neural networks can be used with feature data, one of the attractions of this approach is that it can work with raw data (converted into numeric form) without doing any feature design or extraction. The first step in the model is converting text data into numeric form. We used a character-based encoding where each character type was encoded by a real valued number. The value was automatically derived during training and conveys semantic information about the relationships between characters as they apply to cmd.exe syntax.

Feature-Based ML

We also experimented with hand-engineered features and a Gradient Boosted Decision Tree algorithm. The features developed for this model were largely statistical in nature – derived from the presence and frequency of character sets and keywords. For example, the presence of dozens of ‘%’ characters or long, contiguous strings might contribute to detecting potential obfuscation. While any single feature will not perfectly separate the two classes, a combination of features as present in a tree-based model can learn flexible patterns in the data. The expectation is that those patterns are robust and can generalize to future obfuscation variants.

Data and Experiments

To develop our models, we collected non-obfuscated data from tens of thousands of endpoint events and generated obfuscated data using a variety of methods in Invoke-DOSfuscation. We developed our models using roughly 80 percent of the data as training data, and tested them on the remaining 20 percent. We ensured that our train-test split was stratified. For featureless ML (i.e. neural networks), we simply input Unicode code points into the first layer of the CNN model. The first layer converts the code point into semantically meaningful numerical representations (called embeddings) before feeding it into the rest of the neural network.

For the Gradient Boosted Tree method, we generated a number of features from the raw command lines. The following are some of them:

  • Length of the command line
  • The number of carets in the command line
  • The count of pipe symbols
  • The fraction of white space in the command line
  • The fraction of special characters
  • Entropy of the string
  • The frequency of the strings “cmd” and “power” in the command line

While each of these features individually is a weak signal and could not possibly be a good discriminator on its own, a flexible classifier such as a Gradient Boosted Tree – trained on sufficient data with these features – is able to classify obfuscated and non-obfuscated command lines in spite of the aforementioned difficulties.

Results

Evaluated against our test set, we were able to get nearly identical results from our Gradient Boosted Tree and neural network models.

The results for the GBT model were near perfect with metrics such as F1-score, precision, and recall all being close to 1.0. The CNN model was slightly less accurate.

While we certainly do not expect perfect results in a real-world scenario, these lab results were nonetheless encouraging. Recall that all of our obfuscated examples were generated by one source, namely the Invoke-DOSfuscation tool. While Invoke-DOSfuscation generates a wide variety of obfuscated samples, in the real world we expect to see at least some samples that are quite dissimilar from any that Invoke-DOSfuscation generates. We are currently collecting real world obfuscated command lines to get a more accurate picture of the generalizability of this model on obfuscated samples from actual malicious actors. We expect that command obfuscation, similar to PowerShell obfuscation before it, will continue to emerge in new malware families.

As an additional test we asked Daniel Bohannon (author of Invoke-DOSfuscation, the Windows command line obfuscation tool) to come up with obfuscated samples that in his experience would be difficult for a traditional obfuscation detector. In every case, our ML detector was still able to detect obfuscation. Some examples are shown in Figure 8.


Figure 8: Some examples of obfuscated text used to test and attempt to defeat the ML obfuscation detector (all were correctly identified as obfuscated text)

We also created very cryptic looking texts that, although valid Windows command lines and non-obfuscated, appear slightly obfuscated to a human observer. This was done to test efficacy of the detector with boundary examples. The detector was correctly able to classify the text as non-obfuscated in this case as well. Figure 9 shows one such example.


Figure 9: An example that appears on first glance to be obfuscated, but isn't really and would likely fool a non-ML solution (however, the ML obfuscation detector currently identifies it as non-obfuscated)

Finally, Figure 10 shows a complicated yet non-obfuscated command line that is correctly classified by our obfuscation detector, but would likely fool a non-ML detector based on statistical features (for example a rule-based detector with a hand-crafted weighing scheme and a threshold, using features such as the proportion of special characters, length of the command line or entropy of the command line).


Figure 10: An example that would likely be misclassified by an ML detector that uses simplistic statistical features; however, our ML obfuscation detector currently identifies it as non-obfuscated

CNN vs. GBT Results

We compared the results of a heavily tuned GBT classifier built using carefully selected features to those of a CNN trained with raw data (featureless ML). While the CNN architecture was not heavily tuned, it is interesting to note that with samples such as those in Figure 10, the GBT classifier confidently predicted non-obfuscated with a score of 19.7 percent (the complement of the measure of the classifier’s confidence in non-obfuscation). Meanwhile, the CNN classifier predicted non-obfuscated with a confidence probability of 50 percent – right at the boundary between obfuscated and non-obfuscated. The number of misclassifications of the CNN model was also more than that of the Gradient Boosted Tree model. Both of these are most likely the result of inadequate tuning of the CNN, and not a fundamental shortcoming of the featureless approach.

Conclusion

In this blog post we described an ML approach to detecting obfuscated Windows command lines, which can be used as a signal to help identify malicious command line usage. Using ML techniques, we demonstrated a highly accurate mechanism for detecting such command lines without resorting to the often inadequate and costly technique of maintaining complex if-then rules and regular expressions. The more comprehensive ML approach is flexible enough to catch new variations in obfuscation, and when gaps are detected, it can usually be handled by adding some well-chosen evader samples to the training set and retraining the model.

This successful application of ML is yet another demonstration of the usefulness of ML in replacing complex manual or programmatic approaches to problems in computer security. In the years to come, we anticipate ML to take an increasingly important role both at FireEye and in the rest of the cyber security industry.

Ethics In Artificial Intelligence: Introducing The SHERPA Consortium

In May of this year, Horizon 2020 SHERPA project activities kicked off with a meeting in Brussels. F-Secure is a partner in the SHERPA consortium – a group consisting of 11 members from six European countries – whose mission is to understand how the combination of artificial intelligence and big data analytics will impact ethics and human rights issues today, and in the future (https://www.project-sherpa.eu/).

As part of this project, one of F-Secure’s first tasks will be to study security issues, dangers, and implications of the use of data analytics and artificial intelligence, including applications in the cyber security domain. This research project will examine:

  • ways in which machine learning systems are commonly mis-implemented (and recommendations on how to prevent this from happening)
  • ways in which machine learning models and algorithms can be adversarially attacked (and mitigations against such attacks)
  • how artificial intelligence and data analysis methodologies might be used for malicious purposes

We’ve already done a fair bit of this research*, so expect to see more articles on this topic in the near future!

 

As strange as it sounds, I sometimes find powerpoint a good tool for arranging my thoughts, especially before writing a long document. As an added bonus, I have a presentation ready to go, should I need it.

 

 

Some members of the SHERPA project recently attended WebSummit in Lisbon – a four day event with over 70,000 attendees and over 70 dedicated discussions and panels. Topics related to artificial intelligence were prevalent this year, ranging from tech presentations on how to develop better AI, to existential debates on the implications of AI on the environment and humanity. The event attracted a wide range of participants, including many technologists, politicians, and NGOs.

During WebSummit, SHERPA members participated in the Social Innovation Village, where they joined forces with projects and initiatives such as Next Generation Internet, CAPPSI, MAZI, DemocratieOuverte, grassroots radio, and streetwize to push for “more social good in technology and more technology in social good”. Here, SHERPA researchers showcased the work they’ve already done to deepen the debate on the implications of AI in policing, warfare, education, health and social care, and transport.

The presentations attracted the keen interest of representatives from more than 100 large and small organizations and networks in Europe and further afield, including the likes of Founder’s Institute, Google, and Amazon, and also led to a public commitment by Carlos Moedas, the European Commissioner for Research, Science and Innovation. You can listen to the highlights of the conversation here.

To get a preview of SHERPA’s scenario work and take part in the debate click here.

 


* If you’re wondering why I haven’t blogged in a long while, it’s because I’ve been hiding away, working on a bunch of AI-related research projects (such as this). Down the road, I’m hoping to post more articles and code – if and when I have results to share 😉

AI and the future of cybersecurity work

In February 2014, journalist Martin Wolf wrote a piece for the London Financial Times[1] titled Enslave the robots and free the poor. He began the piece with the following quote:

“In 1955, Walter Reuther, head of the US car workers’ union, told of a visit to a new automatically operated Ford plant. Pointing to all the robots, his host asked: How are you going to collect union dues from those guys? Mr. Reuther replied: And how are you going to get them to buy Fords?”

Near and long-term directions for adversarial AI in cybersecurity

The frenetic pace at which artificial intelligence (AI) has advanced in the past few years has begun to have transformative effects across a wide variety of fields. Coupled with an increasingly (inter)-connected world in which cyberattacks occur with alarming frequency and scale, it is no wonder that the field of cybersecurity has now turned its eye to AI and machine learning (ML) in order to detect and defend against adversaries.

The use of AI in cybersecurity not only expands the scope of what a single security expert is able to monitor, but importantly, it also enables the discovery of attacks that would have otherwise been undetectable by a human. Just as it was nearly inevitable that AI would be used for defensive purposes, it is undeniable that AI systems will soon be put to use for attack purposes.

Choosing an optimal algorithm for AI in cybersecurity

In the last blog post, we alluded to the No-Free-Lunch (NFL) theorems for search and optimization. While NFL theorems are criminally misunderstood and misrepresented in the service of crude generalizations intended to make a point, I intend to deploy a crude NFL generalization to make just such a point.

You see, NFL theorems (roughly) state that given a universe of problem sets where an algorithm’s goal is to learn a function that maps a set of input data X to a set of target labels Y, for any subset of problems where algorithm A outperforms algorithm B, there will be a subset of problems where B outperforms A. In fact, averaging their results over the space of all possible problems, the performance of algorithms A and B will be the same.

With some hand waving, we can construct an NFL theorem for the cybersecurity domain:  Over the set of all possible attack vectors that could be employed by a hacker, no single detection algorithm can outperform all others across the full spectrum of attacks.

Malicious PowerShell Detection via Machine Learning

Introduction

Cyber security vendors and researchers have reported for years how PowerShell is being used by cyber threat actors to install backdoors, execute malicious code, and otherwise achieve their objectives within enterprises. Security is a cat-and-mouse game between adversaries, researchers, and blue teams. The flexibility and capability of PowerShell has made conventional detection both challenging and critical. This blog post will illustrate how FireEye is leveraging artificial intelligence and machine learning to raise the bar for adversaries that use PowerShell.

In this post you will learn:

  • Why malicious PowerShell can be challenging to detect with a traditional “signature-based” or “rule-based” detection engine.
  • How Natural Language Processing (NLP) can be applied to tackle this challenge.
  • How our NLP model detects malicious PowerShell commands, even if obfuscated.
  • The economics of increasing the cost for the adversaries to bypass security solutions, while potentially reducing the release time of security content for detection engines.

Background

PowerShell is one of the most popular tools used to carry out attacks. Data gathered from FireEye Dynamic Threat Intelligence (DTI) Cloud shows malicious PowerShell attacks rising throughout 2017 (Figure 1).


Figure 1: PowerShell attack statistics observed by FireEye DTI Cloud in 2017 – blue bars for the number of attacks detected, with the red curve for exponentially smoothed time series

FireEye has been tracking the malicious use of PowerShell for years. In 2014, Mandiant incident response investigators published a Black Hat paper that covers the tactics, techniques and procedures (TTPs) used in PowerShell attacks, as well as forensic artifacts on disk, in logs, and in memory produced from malicious use of PowerShell. In 2016, we published a blog post on how to improve PowerShell logging, which gives greater visibility into potential attacker activity. More recently, our in-depth report on APT32 highlighted this threat actor's use of PowerShell for reconnaissance and lateral movement procedures, as illustrated in Figure 2.


Figure 2: APT32 attack lifecycle, showing PowerShell attacks found in the kill chain

Let’s take a deep dive into an example of a malicious PowerShell command (Figure 3).


Figure 3: Example of a malicious PowerShell command

The following is a quick explanation of the arguments:

  • -NoProfile – indicates that the current user’s profile setup script should not be executed when the PowerShell engine starts.
  • -NonI – shorthand for -NonInteractive, meaning an interactive prompt to the user will not be presented.
  • -W Hidden – shorthand for “-WindowStyle Hidden”, which indicates that the PowerShell session window should be started in a hidden manner.
  • -Exec Bypass – shorthand for “-ExecutionPolicy Bypass”, which disables the execution policy for the current PowerShell session (default disallows execution). It should be noted that the Execution Policy isn’t meant to be a security boundary.
  • -encodedcommand – indicates the following chunk of text is a base64 encoded command.

What is hidden inside the Base64 decoded portion? Figure 4 shows the decoded command.


Figure 4: The decoded command for the aforementioned example

Interestingly, the decoded command unveils a stealthy fileless network access and remote content execution!

  • IEX is an alias for the Invoke-Expression cmdlet that will execute the command provided on the local machine.
  • The new-object cmdlet creates an instance of a .NET Framework or COM object, here a net.webclient object.
  • The downloadstring will download the contents from <url> into a memory buffer (which in turn IEX will execute).

It’s worth mentioning that a similar malicious PowerShell tactic was used in a recent cryptojacking attack exploiting CVE-2017-10271 to deliver a cryptocurrency miner. This attack involved the exploit being leveraged to deliver a PowerShell script, instead of downloading the executable directly. This PowerShell command is particularly stealthy because it leaves practically zero file artifacts on the host, making it hard for traditional antivirus to detect.

There are several reasons why adversaries prefer PowerShell:

  1. PowerShell has been widely adopted in Microsoft Windows as a powerful system administration scripting tool.
  2. Most attacker logic can be written in PowerShell without the need to install malicious binaries. This enables a minimal footprint on the endpoint.
  3. The flexible PowerShell syntax imposes combinatorial complexity challenges to signature-based detection rules.

Additionally, from an economics perspective:

  • Offensively, the cost for adversaries to modify PowerShell to bypass a signature-based rule is quite low, especially with open source obfuscation tools.
  • Defensively, updating handcrafted signature-based rules for new threats is time-consuming and limited to experts.

Next, we would like to share how we at FireEye are combining our PowerShell threat research with data science to combat this threat, thus raising the bar for adversaries.

Natural Language Processing for Detecting Malicious PowerShell

Can we use machine learning to predict if a PowerShell command is malicious?

One advantage FireEye has is our repository of high quality PowerShell examples that we harvest from our global deployments of FireEye solutions and services. Working closely with our in-house PowerShell experts, we curated a large training set that was comprised of malicious commands, as well as benign commands found in enterprise networks.

After we reviewed the PowerShell corpus, we quickly realized this fit nicely into the NLP problem space. We have built an NLP model that interprets PowerShell command text, similar to how Amazon Alexa interprets your voice commands.

One of the technical challenges we tackled was synonym, a problem studied in linguistics. For instance, “NOL”, “NOLO”, and “NOLOGO” have identical semantics in PowerShell syntax. In NLP, a stemming algorithm will reduce the word to its original form, such as “Innovating” being stemmed to “Innovate”.

We created a prefix-tree based stemmer for the PowerShell command syntax using an efficient data structure known as trie, as shown in Figure 5. Even in a complex scripting language such as PowerShell, a trie can stem command tokens in nanoseconds.


Figure 5: Synonyms in the PowerShell syntax (left) and the trie stemmer capturing these equivalences (right)

The overall NLP pipeline we developed is captured in the following table:

NLP Key Modules

Functionality

Decoder

Detect and decode any encoded text

Named Entity Recognition (NER)

Detect and recognize any entities such as IP, URL, Email, Registry key, etc.

Tokenizer

Tokenize the PowerShell command into a list of tokens

Stemmer

Stem tokens into semantically identical token, uses trie

Vocabulary Vectorizer

Vectorize the list of tokens into machine learning friendly format

Supervised classifier

Binary classification algorithms:

  • Kernel Support Vector Machine
  • Gradient Boosted Trees
  • Deep Neural Networks

Reasoning

The explanation of why the prediction was made. Enables analysts to validate predications.

The following are the key steps when streaming the aforementioned example through the NLP pipeline:

  • Detect and decode the Base64 commands, if any
  • Recognize entities using Named Entity Recognition (NER), such as the <URL>
  • Tokenize the entire text, including both clear text and obfuscated commands
  • Stem each token, and vectorize them based on the vocabulary
  • Predict the malicious probability using the supervised learning model


Figure 6: NLP pipeline that predicts the malicious probability of a PowerShell command

More importantly, we established a production end-to-end machine learning pipeline (Figure 7) so that we can constantly evolve with adversaries through re-labeling and re-training, and the release of the machine learning model into our products.


Figure 7: End-to-end machine learning production pipeline for PowerShell machine learning

Value Validated in the Field

We successfully implemented and optimized this machine learning model to a minimal footprint that fits into our research endpoint agent, which is able to make predictions in milliseconds on the host. Throughout 2018, we have deployed this PowerShell machine learning detection engine on incident response engagements. Early field validation has confirmed detections of malicious PowerShell attacks, including:

  • Commodity malware such as Kovter.
  • Red team penetration test activities.
  • New variants that bypassed legacy signatures, while detected by our machine learning with high probabilistic confidence.

The unique values brought by the PowerShell machine learning detection engine include:  

  • The machine learning model automatically learns the malicious patterns from the curated corpus. In contrast to traditional detection signature rule engines, which are Boolean expression and regex based, the NLP model has lower operation cost and significantly cuts down the release time of security content.
  • The model performs probabilistic inference on unknown PowerShell commands by the implicitly learned non-linear combinations of certain patterns, which increases the cost for the adversaries to bypass.

The ultimate value of this innovation is to evolve with the broader threat landscape, and to create a competitive edge over adversaries.

Acknowledgements

We would like to acknowledge:

  • Daniel Bohannon, Christopher Glyer and Nick Carr for the support on threat research.
  • Alex Rivlin, HeeJong Lee, and Benjamin Chang from FireEye Labs for providing the DTI statistics.
  • Research endpoint support from Caleb Madrigal.
  • The FireEye ICE-DS Team.

Neural networks and deep learning

Deep learning refers to a family of machine learning algorithms that can be used for supervised, unsupervised and reinforcement learning. 

These algorithms are becoming popular after many years in the wilderness. The name comes from the realization that the addition of increasing numbers of layers typically in a neural network enables a model to learn increasingly complex representations of the data.

Reverse Engineering the Analyst: Building Machine Learning Models for the SOC

Many cyber incidents can be traced back to an original alert that was either missed or ignored by the Security Operations Center (SOC) or Incident Response (IR) team. While most analysts and SOCs are vigilant and responsive, the fact is they are often overwhelmed with alerts. If a SOC is unable to review all the alerts it generates, then sooner or later, something important will slip through the cracks.

The core issue here is scalability. It is far easier to create more alerts than to create more analysts, and the cyber security industry is far better at alert generation than resolution. More intel feeds, more tools, and more visibility all add to the flood of alerts. There are things that SOCs can and should do to manage this flood, such as increasing automation of forensic tasks (pulling PCAP and acquiring files, for example) and using aggregation filters to group alerts into similar batches. These are effective strategies and will help reduce the number of required actions a SOC analyst must take. However, the decisions the SOC makes still form a critical bottleneck. This is the “Analyze/ Decide” block in Figure 1.


Figure 1: Basic SOC triage stages

In this blog post, we propose machine learning based strategies to help mitigate this bottleneck and take back control of the SOC. We have implemented these strategies in our FireEye Managed Defense SOC, and our analysts are taking advantage of this approach within their alert triaging workflow. In the following sections, we will describe our process to collect data, capture alert analysis, create a model, and build an efficacy workflow – all with the ultimate goal of automating alert triage and freeing up analyst time.

Reverse Engineering the Analyst

Every alert that comes into a SOC environment contains certain bits of information that an analyst uses to determine if the alert represents malicious activity. Often, there are well-paved analytical processes and pathways used when evaluating these forensic artifacts over time. We wanted to explore if, in an effort to truly scale our SOC operations, we could extract these analytical pathways, train a machine to traverse them, and potentially discover new ones.

Think of a SOC as a self-contained machine that inputs unlabeled alerts and outputs the alerts labeled as “malicious” or “benign”. How can we capture the analysis and determine that something is indeed malicious, and then recreate that analysis at scale? In other words, what if we could train a machine to make the same analytical decisions as an analyst, within an acceptable level of confidence?

Basic Supervised Model Process

The data science term for this is a “Supervised Classification Model”. It is “supervised” in the sense that it learns by being shown data already labeled as benign or malicious, and it is a “classification model” in the sense that once it has been trained, we want it to look at a new piece of data and make a decision between one of several discrete outcomes. In our case, we only want it to decide between two “classes” of alerts: malicious and benign.

In order to begin creating such a model, a dataset must be collected. This dataset forms the “experience” of the model, and is the information we will use to “train” the model to make decisions. In order to supervise the model, each unit of data must be labeled as either malicious or benign, so that the model can evaluate each observation and begin to figure out what makes something malicious versus what makes it benign. Typically, collecting a clean, labeled dataset is one of the hardest parts of the supervised model pipeline; however, in the case of our SOC, our analysts are constantly triaging (or “labeling”) thousands of alerts every week, and so we were lucky to have an abundance of clean, standardized, labeled alerts.

Once a labeled dataset has been defined, the next step is to define “features” that can be used to portray the information resident in each alert. A “feature” can be thought of as an aspect of a bit of information. For example, if the information is represented as a string, a natural “feature” could be the length of the string. The central idea behind building features for our alert classification model was to find a way to represent and record all the aspects that an analyst might consider when making a decision.

Building the model then requires choosing a model structure to use, and training the model on a subset of the total data available. The larger and more diverse the training data set, generally the better the model will perform. The remaining data is used as a “test set” to see if the trained model is indeed effective. Holding out this test set ensures the model is evaluated on samples it has never seen before, but for which the true labels are known.

Finally, it is critical to ensure there is a way to evaluate the efficacy of the model over time, as well as to investigate mistakes so that appropriate adjustments can be made. Without a plan and a pipeline to evaluate and retrain, the model will almost certainly decay in performance.

Feature Engineering

Before creating any of our own models, we interviewed experienced analysts and documented the information they typically evaluate before making a decision on an alert. Those interviews formed the basis of our feature extraction. For example, when an analyst says that reviewing an alert is “easy”, we ask: “Why? And what helps you make that decision?” It is this reverse engineering of sorts that gives insight into features and models we can use to capture analysis.

For example, consider a process execution event. An alert on a potentially malicious process execution may contain the following fields:

  • Process Path
  • Process MD5
  • Parent Process
  • Process Command Arguments

While this may initially seem like a limited feature space, there is a lot of useful information that one can extract from these fields.

Beginning with the process path of, say, “C:\windows\temp\m.exe”, an analyst can immediately see some features:

  • The process resides in a temporary folder: C:\windows\temp\
  • The process is two directories deep in the file system
  • The process executable name is one character long
  • The process has an .exe extension
  • The process is not a “common” process name

While these may seem simple, over a vast amount of data and examples, extracting these bits of information will help the model to differentiate between events. Even the most basic aspects of an artifact must be captured in order to “teach” the model to view processes the way an analyst does.

The features are then encoded into a more discrete representation, similar to this:

Temp_folder

Depth

Name_Length

Extension

common_process_name

TRUE

2

1

exe

FALSE

Another important feature to consider about a process execution event is the combination of parent process and child process. Deviation from expected “lineage” can be a strong indicator of malicious activity.

Say the parent process of the aforementioned example was ‘powershell.exe’. Potential new features could then be derived from the concatenation of the parent process and the process itself: ‘powershell.exe_m.exe’. This functionally serves as an identity for the parent-child relation and captures another key analysis artifact.

The richest field, however, is probably the process arguments. Process arguments are their own sort of language, and language analysis is a well-tread space in predictive analytics.

We can look for things including, but not limited to:

  • Network connection strings (such as ‘http://’, ‘https://’, ‘ftp://’).
  • Base64 encoded commands
  • Reference to Registry Keys (‘HKLM’, ‘HKCU’)
  • Evidence of obfuscation (ticks, $, semicolons) (read Daniel Bohannon’s work for more)

The way these features and their values appear in a training dataset will define the way the model learns. Based on the distribution of features across thousands of alerts, relationships will start to emerge between features and labels. These relationships will then be recorded in our model, and ultimately used to influence the predictions for new alerts. Looking at distributions of features in the training set can give insight into some of these potential relationships.

For example, Figure 2 shows how the distribution of Process Command Length may appear when grouping by malicious (red) and benign (blue).


Figure 2: Distribution of Process Event alerts grouped by Process Command Length

This graph shows that over a subset of samples, the longer the command length, the more likely it is to be malicious. This manifests as red on the right and blue on the left. However, process length is not the only factor.

As part of our feature set, we also thought it would be useful to approximate the “complexity” of each command. For this, we used “Shannon entropy”, a commonly used metric that measures the degree of randomness present in a string of characters.

Figure 3 shows a distribution of command entropy, broken out into malicious and benign. While the classes do not separate entirely, we can see that for this sample of data, samples with higher entropy generally have a higher chance of being malicious.


Figure 3: Distribution of Process Event alerts grouped by entropy

Model Selection and Generalization

Once features have been generated for the whole dataset, it is time to use them to train a model. There is no perfect procedure for picking the best model, but looking at the type of features in our data can help narrow it down. In the case of a process event, we have a combination of features represented as strings and numbers. When an analyst evaluates each artifact, they ask questions about each of these features, and combine the answers to estimate the probability that the process is malicious.

For our use case, it also made sense to prioritize an ‘interpretable’ model – that is, one that can more easily expose why it made a certain decision about an artifact. This way analysts can build confidence in the model, as well as detect and fix analytical mistakes that the model is making. Given the nature of the data, the decisions analysts make, and the desire for interpretability, we felt that a decision tree-based model would be well-suited for alert classification.

There are many publicly available resources to learn about decision trees, but the basic intuition behind a decision tree is that it is an iterative process, asking a series of questions to try to arrive at a highly confident answer. Anyone who has played the game “Twenty Questions” is familiar with this concept. Initially, general questions are asked to help eliminate possibilities, and then more specific questions are asked to narrow down the possibilities. After enough questions are asked and answered, the ‘questioner’ feels they have a high probability of guessing the right answer.

Figure 4 shows an example of a decision tree that one might use to evaluate process executions.


Figure 4: Decision tree for deciding whether an alert is benign or malicious

For the example alert in the diagram, the “decision path” is marked in red. This is how this decision tree model makes a prediction. It first asks: “Is the length greater than 100 characters?” If so, it moves to the next question “Does it contain the string ‘http’?” and so on until it feels confident in making an educated guess. In the example in Figure 4, given that 95 percent of all the training alerts traveling this decision path were malicious, the model predicts a 95 percent chance that this alert will also be malicious.

Because they can ask such detailed combinations of questions, it is possible that decision trees can “overfit”, or learn rules that are too closely tied to the training set. This reduces the model’s ability to “generalize” to new data. One way to mitigate this effect is to use many slightly different decision trees and have them each “vote” on the outcome. This “ensemble” of decision trees is called a Random Forest, and it can improve performance for the model when deployed in the wild. This is the algorithm we ultimately chose for our model.

How the SOC Alert Model Works

When a new alert appears, the data in the artifact is transformed into a vector of the encoded features, with the same structure as the feature representations used to train the model. The model then evaluates this “feature vector” and applies a confidence level for the predicted label. Based on thresholds we set, we can then classify the alert as malicious or benign.


Figure 5: An alert presented to the analyst with its raw values captured

As an example, the event shown in Figure 5 might create the following feature values:

  • Parent Process: ‘wscript’
  • Command Entropy: 5.08
  • Command Length =103

Based on how they were trained, the trees in the model each ask a series of questions of the new feature vector. As the feature vector traverses each tree, it eventually converges on a terminal “leaf” classifying it as either benign or malicious. We can then evaluate the aggregated decisions made by each tree to estimate which features in the vector played the largest role in the ultimate classification.

For the analysts in the SOC, we then present the features extracted from the model, showing the distribution of those features over the entire dataset. This gives the analysts insight into “why” the model thought what it thought, and how those features are represented across all alerts we have seen. For example, the “explanation” for this alert might look like:

  • Command Entropy = 5.08 > 4.60:  51.73% Threat
  • occuranceOfChar “\”= 9.00 > 4.50:  64.09% Threat
  • occuranceOfChar:“)” (=0.00) <= 0.50: 78.69% Threat
  • NOT processTree=”cmd.exe_to_cscript.exe”: 99.6% Threat

Thus, at the time of analysis, the analysts can see the raw data of the event, the prediction from the model, an approximation of the decision path, and a simplified, interpretable view of the overall feature importance.

How the SOC Uses the Model

Showing the features the model used to reach the conclusion allows experienced analysts to compare their approach with the model, and give feedback if the model is doing something wrong. Conversely, a new analyst may learn to look at features they may have otherwise missed: the parent-child relationship, signs of obfuscation, or network connection strings in the arguments. After all, the model has learned on the collective experience of every analyst over thousands of alerts. Therefore, the model provides an actionable reflection of the aggregate analyst experience back to the SOC, so that each analyst can transitively learn from their colleagues.

Additionally, it is possible to write rules using the output of the model as a parameter. If the model is particularly confident on a subset of alerts, and the SOC feels comfortable automatically classifying that family of threats, it is possible to simply write a rule to say: “If the alert is of this type, AND for this malware family, AND the model confidence is above 99, automatically call this alert bad and generate a report.” Or, if there is a storm of probable false positives, one could write a rule to cull the herd of false positives using a model score below 10.

How the Model Stays Effective

The day the model is trained, it stops learning. However, threats – and therefore alerts – are constantly evolving. Thus, it is imperative to continually retrain the model with new alert data to ensure it continues to learn from changes in the environment.

Additionally, it is critical to monitor the overall efficacy of the model over time. Building an efficacy analysis pipeline to compare model results against analyst feedback will help identify if the model is beginning to drift or develop structural biases. Evaluating and incorporating analyst feedback is also critical to identify and address specific misclassifications, and discover potential new features that may be necessary.

To accomplish these goals, we run a background job that updates our training database with newly labeled events. As we get more and more alerts, we periodically retrain our model with the new observations. If we encounter issues with accuracy, we diagnose and work to address them. Once we are satisfied with the overall accuracy score of our retrained model, we store the model object and begin using that model version.

We also provide a feedback mechanism for analysts to record when the model is wrong. An analyst can look at the label provided by the model and the explanation, but can also make their own decision. Whether they agree with the model or not, they can input their own label through the interface. We store this label provided by the analyst along with any optional explanation given by them regarding the explanation.

Finally, it should be noted that these manual labels may require further evaluation. As an example, consider a commodity malware alert, in which network command and control communications were sinkholed. An analyst may evaluate the alert, pull back triage details, including PCAP samples, and see that while the malware executed, the true threat to the environment was mitigated. Since it does not represent an exigent threat, the analyst may mark this alert as ‘benign’. However, the fact that it was sinkholed does not change that the artifacts of execution still represent malicious activity. Under different circumstances, this infection could have had a negative impact on the organization. However, if the benign label is used when retraining the model, that will teach the model that something inherently malicious is in fact benign, and potentially lead to false negatives in the future.

Monitoring efficacy over time, updating and retraining the model with new alerts, and evaluating manual analyst feedback gives us visibility into how the model is performing and learning over time. Ultimately this helps to build confidence in the model, so we can automate more tasks and free up analyst time to perform tasks such as hunting and investigation.

Conclusion

A supervised learning model is not a replacement for an experienced analyst. However, incorporating predictive analytics and machine learning into the SOC workflow can help augment the productivity of analysts, free up time, and ensure they utilize investigative skills and creativity on the threats that truly require expertise.

This blog post outlines the major components and considerations of building an alert classification model for the SOC. Data collection, labeling, feature generation, model training, and efficacy analysis must all be carefully considered when building such a model. FireEye continues to iterate on this research to improve our detection and response capabilities, continually improve the detection efficacy of our products, and ultimately protect our clients.

The process and examples shown discussed in this post are not mere research. Within our FireEye Managed Defense SOC, we use alert classification models built using the aforementioned processes to increase our efficiency and ensure we apply our analysts’ expertise where it is needed most. In a world of ever increasing threats and alerts, increasing SOC efficiency may mean the difference between missing and catching a critical intrusion.

Acknowledgements

A big thank you to Seth Summersett and Clara Brooks.

***

The FireEye ICE Data Science Team is a small, highly trained team of data scientists and engineers, focused on delivering impactful capabilities to our analysts, products, and customers. ICE-DS is always looking for exceptional candidates interested in researching and solving difficult problems in cybersecurity. If you’re interested, check out  FireEye careers.

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.