Category Archives: machine learning

The Ethical Hacker Network: Book Review: Malware Data Science

[caption id="attachment_169289" align="alignright" width="500"]EH-Net - Book Review - Malware Data Science - Neural Net Learning Neural Network Learning Malware vs Benignware[/caption]

Malware Data Science: Attack Detection and Attribution” (MDS) is a book every information security professional should consider reading due to the rapid growth and variation of malware and the increasing reliance upon data science to defend information systems. Known malware executables have expanded from 1 million in 2008 to more than 700 million in 2018. Intrusion Detection Systems (IDS) are changing from signature-based systems as code packing, encryption, dynamic linking and obfuscation point security towards tools applying heuristics methods supported by data science. This article is a summary and a review, but my primary goal is to encourage the reader to read the book and complete the activities. If you do, I am sure that your security toolkit will be better equipped.

Overview of Malware Data Science

MDS identifies Data Science as a growing set of algorithmic tools that allow us to understand and make predictions about data using statistics, mathematics, and artful statistical data visualizations. While these terms may imply a difficult read, authors Joshua Saxe (Chief Data Scientist at Sophos) and Hillary Sanders (Infrastructure Data Science Team Lead at Sophos) equip the reader for upcoming concepts well, building upon key concepts with python code examples and walking through the code to reinforce learning. At points they identify additional resources or refer to prior chapters in a way that both supports the reader and encourages further study.

EH-Net - Book Review - Malware Data ScienceThe code is downloadable from a site dedicated to MDS. Executing the code as you read helps to learn the concepts. I found working directly with the code itself to be surprisingly encouraging and even fun. Of course, some of the code is malware obtained from VirusTotal or Kaspersky Labs. That code is de-fanged with some flipped bits, but the code should be treated with due care in VirtualBox. The text offers a provisioned VirtualBox download.

The post Book Review: Malware Data Science appeared first on The Ethical Hacker Network.



The Ethical Hacker Network

Key weapon for closing IoT-era cybersecurity gaps? Artificial intelligence

As businesses struggle to combat increasingly sophisticated cybersecurity attacks, the severity of which is exacerbated by both the vanishing IT perimeters in today’s mobile and IoT era, and an acute shortage of skilled

The post Key weapon for closing IoT-era cybersecurity gaps? Artificial intelligence appeared first on The Cyber Security Place.

How Digital Transformation can Save Cybersecurity

Traditional security controls, now too manual and slow to keep pace with digital business processes, customer experiences and workflows are being relegated to the compliance toolbox.  They will satisfy regulators,

The post How Digital Transformation can Save Cybersecurity appeared first on The Cyber Security Place.

Near and long-term directions for adversarial AI in cybersecurity

The frenetic pace at which artificial intelligence (AI) has advanced in the past few years has begun to have transformative effects across a wide variety of fields. Coupled with an increasingly (inter)-connected world in which cyberattacks occur with alarming frequency and scale, it is no wonder that the field of cybersecurity has now turned its eye to AI and machine learning (ML) in order to detect and defend against adversaries.

The use of AI in cybersecurity not only expands the scope of what a single security expert is able to monitor, but importantly, it also enables the discovery of attacks that would have otherwise been undetectable by a human. Just as it was nearly inevitable that AI would be used for defensive purposes, it is undeniable that AI systems will soon be put to use for attack purposes.

Legal AI: How Machine Learning Is Aiding — and Concerning — Law Practitioners

Law firms tasked with analyzing mounds of data and interpreting dense legal texts can vastly improve their efficiency by training artificial intelligence (AI) tools to complete this processing for them. While AI is making headlines in a wide range of industries, legal AI may not come to mind for many. But the technology, which is already prevalent in the manufacturing, cybersecurity, retail and healthcare sectors, is quickly becoming a must-have tool in the legal industry.

Due to the sheer volume of sensitive data belonging to both clients and firms themselves, legal organizations are in a prickly position when it comes to their responsibility to uphold data privacy. Legal professionals are still learning what the privacy threats are and how they intersect with data security regulations. For this reason, it’s critical to understand security best practices for operations involving AI.

Before tackling the cybersecurity implications, let’s explore some reasons why the legal industry is such a compelling use case for AI.

How Do Legal Organizations Use AI?

If you run a law firm, imagine how much more efficient you could be if you could train your software to recognize and predict patterns that not only improve client engagement, but also streamline the workflow of your legal team. Or what if that software could learn to delegate tasks to itself?

With some AI applications already on the market, this is only the beginning of what the technology can do. For example, contract analysis automation solutions can read contracts in seconds, highlight key information visually with easy-to-read graphs and charts, and get “smarter” with each contract reviewed. Other tools use AI to scan legal documents, case files and decisions to predict how courts will rule in tax decisions.

In fact, the use of AI in the legal industry has been around for years, according to Sherry Askin, CEO of Omni Software Systems. Askin has deep roots in the AI field, including work with IBM’s Watson.

“AI is all about increasing efficiency, and is being touted as the next revolution,” she said. “We’ve squeezed as much as we can from human productivity through automation. The next plateau from productivity and the next threshold is AI.”

Why Machine Learning Is Critical

Law is all about words, natural language and the coded version of an unstructured version, said Askin. While we know how to handle the coded versions, she explained, the challenge with legal AI is that outputs are so tightly tailored to past results described by their inputs. That’s where machine learning comes in to predict how these inputs might change.

Askin compared machine learning to the process of intellectual development by which children soak up news words, paragraphs, long arguments, vocabulary and, most importantly, context. With deep learning, not only are you inputting data, but you’re giving the machine context and relevance.

“The machine is no longer a vessel of information,” Askin explained. “It figures out what to do with that information and it can predict things for you.”

Although machines can’t make decisions the same way that humans can, the more the neural processing and training they conduct, the more sophisticated their learning and deliverables can become. Some legal AI tools can process and analyze thousands of lease agreements, doing in seconds what humans would do in weeks.

How Do Privacy Regulations Impact Legal Firms?

For any industry, protecting privileged client data is a paramount concern. The American Bar Association, which requires practitioners to employ reasonable efforts to prevent unauthorized access to client data, has implemented periodic changes and updates to address the advances of technology. In addition, the Legal Cloud Computing Association (LCCA) issued 21 standards to assist law firms and attorneys in addressing these needs, including testing, limitations on third-party access, data retention policy, encryption, end user authentication and modifications to data.

Askin urged legal organizations to evaluate strategies impacting security and privacy in the context of what they modify or replace.

“I believe this is a major factor in legal because the profession has a deep legacy of expert-led art,” she said. “Traditional IT automation solutions perform best with systematized process and structured data. Unfortunately, systematization and structure are not historically compatible with the practice of law or any other professional disciplines that rely on human intelligence and dynamic reasoning.”

How to Keep Legal AI Tools in the Right Hands

Legal organizations are tempting targets for malicious actors because they handle troves of sensitive and confidential information. Rod Soto, director of security research for Jask, recommended several key strategies: employ defense in depth principles at the infrastructure level, train personnel in security awareness and use AI to significantly enhance security posture overall. To protect automated operations conducted by AI, Soto warned, we must understand that while these AI systems are trained to be effective, they can also be steered off course.

“Malicious actors can and will approach AI learning models and will attempt to mistrain them, hence the importance of feedback loops and sanity checks from experienced analysts,” he said. “You cannot trust AI blindly.”

Finally, it’s crucial for legal organizations to understand that AI does not replace a trained analyst.

“AI is there to help the analyst in things that humans have limitations, such as processing very large amounts of alarms or going through thousands of events in a timely manner,” said Soto. “Ultimately, it is upon the trained analyst to make the call. An analyst should always exercise judgment based on his experience when using AI systems.”

Because the pressure to transform is industrywide, profound changes are taking shape to help security experts consistently identify the weakest link in the security chain: people.

“It’s nearly impossible to control all data and privacy risks where decentralized data and human-managed processes are prevalent,” Askin said. “The greater the number of endpoints, the higher the risk of breach. This is where the nature of AI can precipitate a reduction in security and privacy vulnerabilities, particularly where prior IT adoption or data protection practices were limited.”

The post Legal AI: How Machine Learning Is Aiding — and Concerning — Law Practitioners appeared first on Security Intelligence.

Small businesses targeted by highly localized Ursnif campaign

Cyber thieves are continuously looking for new ways to get people to click on a bad link, open a malicious file, or install a poisoned update in order to steal valuable data. In the past, they cast as wide a net as possible to increase the pool of potential victims. But attacks that create a lot of noise are often easier to spot and stop. Cyber thieves are catching on that we are watching them, so they are trying something different. Now were seeing a growing trend of small-scale, localized attacks that use specially crafted social engineering to stay under the radar and compromise more victims.

In social engineering attacks, is less really more?

A new malware campaign puts that to the test by targeting home users and small businesses in specific US cities. This was a focused, highly localized attack that aimed to steal sensitive info from just under 200 targets. Macro-laced documents masqueraded as statements from legitimate businesses. The documents are then distributed via email to target victims in cities where the businesses are located.

With Windows Defender AVs next gen defense, however, the size of the attack doesnt really matter.

Several cloud-based machine learning algorithms detected and blocked the malicious documents at the onset, stopping the attack and protecting customers from what would have been the payload, info-stealing malware Ursnif.

The map below shows the location of the targets.

Figure 1. Geographic distribution of target victims

Highly localized social engineering attack

Heres how the attack played out: Malicious, macro-enabled documents were delivered as email attachments to target small businesses and users. Each document had a file name that spoofed a legitimate business name and masqueraded as a statement from that business. In total, we saw 21 unique document file names used in this campaign.

The attackers sent these emails to intended victims in the city or general geographic area where the businesses are located. For example, the attachment named Dolan_Care_Statement.doc was sent almost exclusively to targets in Missouri. The document file name spoofs a known establishment in St. Louis. While we do not believe the establishment itself was affected or targeted by this attack, the document purports to be from the said establishment when its really not.

The intended effect is for recipients to get documents from local, very familiar business or service providers. Its part of the social engineering scheme to increase likelihood that recipients will think the document is legitimate and take the bait, when in reality it is a malicious document.

Most common lure document file names Top target cities
Dockery_FloorCovering_Statement Johnson City, TN
Kingsport, TN
Knoxville, TN
Dolan_Care_Statement St. Louis, MO
Chesterfield, MO
Lees Summit, MO
DMS_Statement Omaha, NE
Wynot, NE
Norwalk, OH
Dmo_Statement New Braunfels, TX
Seguin, TX
San Antonio, TX
DJACC_Statement Miami, FL
Flagler Beach, FL
Niles, MI
Donovan_Construction_Statement Alexandria, VA
Mclean, VA
Manassas, VA

Table 1. Top target cities of most common document file names

When recipients open the document, they are shown a message that tricks the person into enabling the macro.

Figure 2. Document tricks victim into enabling the macro

As is typical in social engineering attacks, this is not true. If the recipient does enable the macro, no content is shown. Instead the following process is launched to deobfuscate a PowerShell command.

Figure 3. Process to deobfuscate PowerShell

Figure 4. PowerShell command

The PowerShell script connects to any of 12 different URLs that all deliver the payload.

Figure 5. Deobfuscated PowerShell command

The payload is Ursnif, info-stealing malware. When run, Ursnif steals information about infected devices, as well as sensitive information like passwords. Notably, this infection sequence (i.e., cmd.exe process deobfuscates a PowerShell that in turn downloads the payload) is a common method used by other info-stealing malware like Emotet and Trickbot.

How machine learning stopped this small-scale, localized attack

As the malware campaign got under way, four different cloud-based machine learning models gave the verdict that the documents were malicious. These four models are among a diverse set of models that help ensure we catch a wide range of new and emerging threats. Different models have different areas of expertise; they use different algorithms and are trained on their unique set of features.

One of the models that gave the malicious verdict is a generic model designed to detect non-portable executable (PE) threats. We have found that models like this are effective in catching social engineering attacks, which typically use non-PE files like scripts and, as is the case for this campaign, macro-laced documents.

The said non-PE model is a simple averaged perceptron algorithm that uses various features, including expert features, fuzzy hashes of various file sections, and contextual data. The simplicity of the model makes it fast, enabling it to give split-second verdicts before suspicious files could execute. Our analysis into this specific model showed that the expert features and fuzzy hashes had the biggest impact in the models verdict and the eventual blocking of the attack.

Figure 6. Impact of features used by one ML model that detected the attack

Next-generation protection against malware campaigns regardless of size

Machine learning and artificial intelligence power Windows Defender Antivirus to detect and stop new and emerging attacks before they can wreak havoc. Every day, we protect customers from millions of distinct, first-seen malware. Our layered approach to intelligent, cloud-based protection employs a diverse set of machine learning models designed to catch the wide range of threats: from massive malware campaigns to small-scale, localized attacks.

The latter is a growing trend, and we continue to watch the threat landscape to keep machine learning effective against attacks. In a recent blog post, we discussed how we continue to harden machine learning defenses.

Windows Defender AV delivers the next-gen protection capabilities in the Windows Defender Advanced Threat Protection (Windows Defender ATP). Windows Defender ATP integrates attack surface reduction, next-gen protection, endpoint detection and response (EDR), automatic investigation and response, security posture, and advanced hunting capabilities. .

Because of this integration, antivirus detections, such as those related to this campaign, are surfaced in Windows Defender Security Center. Using EDR capabilities, security operations teams can then investigate and respond to the incident. Attack surface reduction rules also block this campaign, and these detections are likewise surfaced in Windows Defender ATP.To test how Windows Defender ATP can help your organization detect, investigate, and respond to advanced attacks, sign up for a free trial.

Across the whole Microsoft 365 threat protection, detections and other security signals are shared among Office 365 ATP, Windows Defender ATP, and Azure ATP. In this Ursnif campaign, the antivirus detection also enables the blocking of related emails in Office 365. This demonstrates how signal sharing and orchestration of remediation across solutions in Microsoft 365 results in better integrated threat protection.

 

 

Bhavna Soman
Windows Defender Research

 

Indicators of compromise (IOCs)

Infector:

Hashes
407a6c99581f428634f9d3b9ec4b79f79c29c79fdea5ea5e97ab3d280b2481a1
77bee1e5c383733efe9d79173ac1de83e8accabe0f2c2408ed3ffa561d46ffd7
e9426252473c88d6a6c5031fef610a803bce3090b868d9a29a38ce6fa5a4800a
f8de4ebcfb8aa7c7b84841efd9a5bcd0935c8c3ee8acf910b3f096a5e8039b1f

File names
CSC_Statement.doc
DBC_Statement.doc
DDG_Statement.doc
DJACC_Statement.doc
DKDS_Statement.doc
DMII_Statement.doc
dmo_statement.doc
DMS_Statement.doc
Dockery_Floorcovering_Statement.doc
Docktail_Bar_Statement.doc
doe_statement.doc
Dolan_Care_Statement.doc
Donovan_Construction_Statement.doc
Donovan_Engineering_Statement.doc
DSD_Statement.doc
dsh_statement.doc
realty_group_statement.doc
statement.doc
tri-lakes_motors_statement.doc
TSC_Statement.doc
UCP_Statement.doc

Payload (Ursnif)

Hashes
31835c6350177eff88265e81335a50fcbe0dc46771bf031c836947851dcebb4f
bd23a2eec4f94c07f4083455f022e4d58de0c2863fa6fa19d8f65bfe16fa19aa
75f31c9015e0f03f24808dca12dd90f4dfbbbd7e0a5626971c4056a07ea1b2b9
070d70d39f310d7b8842f645d3ba2d44b2f6a3d7347a95b3a47d34c8e955885d
15743d098267ce48e934ed0910bc299292754d02432ea775957c631170778d71

URLs
hxxp://vezopilan[.]com/tst/index[.]php?l=soho6[.]tkn
hxxp://cimoselin[.]com/tst/index[.]php?l=soho2[.]tkn
hxxp://cimoselin[.]com/tst/index[.]php?l=soho4[.]tkn
hxxp://vedoriska[.]com/tst/index[.]php?l=soho6[.]tkn
hxxp://baberonto[.]com/tst/index[.]php?l=soho3[.]tkn

hxxp://hertifical[.]com/tst/index[.]php?l=soho8[.]tkn
hxxp://hertifical[.]com/tst/index[.]php?l=soho6[.]tkn
hxxp://condizer[.]com/tst/index[.]php?l=soho1[.]tkn
hxxp://vezeronu[.]com/tst/index[.]php?l=soho2[.]tkn
hxxp://vezeronu[.]com/tst/index[.]php?l=soho5[.]tkn

hxxp://zedrevo[.]com/tst/index[.]php?l=soho8[.]tkn
hxxp://zedrevo[.]com/tst/index[.]php?l=soho10[.]tkn

*Note: The first four domains above are all registered in Russia and are hosted on the IP address 185[.]212[.]44[.]114. The other domains follow the same URL pattern and are also pushing Ursnif, but no registration info is available.

 

 

 

 

 


Talk to us

Questions, concerns, or insights on this story? Join discussions at the Microsoft community and Windows Defender Security Intelligence.

Follow us on Twitter @WDSecurity and Facebook Windows Defender Security Intelligence.

The post Small businesses targeted by highly localized Ursnif campaign appeared first on Microsoft Secure.

Top cloud mining trends of 2018: altcoins, ASIC chips and machine learning

Despite the recent downward trends, the cryptocurrency boom goes on. The most popular coins, such as Bitcoin and Ethereum, are widely accepted as payment. As the number of miners in these cryptocurrencies’ networks increases, the competition grows – sending the hash calculations difficulty up. It’s not an easy time for mining, and experts suggest that its future lies in the cloud.

Before, cloud mining was a simpler way to enter the crypto mining market without making large investments. Now, in some cases, it is the only way. Popular cryptocurrencies are becoming less and less profitable to mine independently – or balancing on the edge of unprofitable, as in the case of Bitcoin. To mine successfully today, we have to mine effectively. This means using specialized equipment and optimizing the process by all available means.

What’s up, Bitcoin?

Bitcoin is the first and, up to this day, the most wanted coin. Just a few years ago you could mine BTC on a regular home PC, but now it’s impossible. High demand started a crazy equipment race, and in this kind of competition big players win.

To stay in the game, today’s bitcoin miners need to make large investments and perform constant power upgrades. Some take advantage of their location, but most of the solo miners have little flexibility as for where to build a rig.

With the emergence of application-specific integrated circuit (ASIC) chips, which are more efficient than GPUs, the stakes raised high. The actual cost of solving the blocks is getting close to the amount of the reward, which poses a problem even for cloud BTC mining.

It was the reason why many individual miners turned to emerging or ASIC-resistant coins, which promised smaller reward, but also less competition. Still, with a proper approach, even Bitcoin mining remains profitable. The key is efficiency: increasing the power and lowering the overall cost of mining. It will also work for less popular coins, increasing the total revenue for the miners.

How to get the highest ROI in cloud mining

Hashtoro.com is aimed at Bitcoin, Ethereum and Litecoin cloud mining. As these cryptocurrencies are in wide use now, the project was carefully created with all the aforementioned issues in mind. Here are the top trends the team has picked to develop a profitable cloud mining system for popular coins:

  • cutting electricity costs by using renewable energy,
  • using ASIC miners with immersion cooling,
  • using cutting-edge technology to optimize the mining process.

First, Hashtoro uses renewable sources of energy to power their equipment. The farms are located in European countries with access to cheaper, clean electricity. The excess heat will be used to heat water for local communities, which cuts the energy expenses even more.

Second, ASIC miners with immersion cooling are used for mining, and the team also plans to create their own ASIC chip in the near future to optimize the process even more. It allows to keep the mining cost as low as possible and offer Hashtoro’s clients cheaper contracts.

Third, due to the high volatility of the crypto market, it is hard to give any long-term prognoses on rate dynamics. At the beginning of this year Ethereum and Litecoin were leading the race, but when their prices dropped, Bitcoin came to the fore once again.

The project’s software is based on cutting-edge machine learning and neural network technologies. “The system determines which currency is more profitable to mine at a given time and dynamically switches to it. It also chooses the appropriate pool to make the highest profit. This way the miners will gain the most from the hashrate they buy”, – comments Alexander Petersons, product director of Hashtoro.

At the moment, the joint cryptocurrencies market capitalization is around 254 billion US dollars. It is almost three times less than the 830 billion maximum we saw in January, but we can be sure: cryptocurrencies are here to stay. Under certain conditions, the crypto market can enter the new stage of growth very soon. In the following months we will see more cloud mining services appear. And with the right approach – regardless of all the challenges we face at the moment – mining will remain not just an interesting hobby, but also a good source of income.

The post Top cloud mining trends of 2018: altcoins, ASIC chips and machine learning appeared first on TechWorm.

Choosing an optimal algorithm for AI in cybersecurity

In the last blog post, we alluded to the No-Free-Lunch (NFL) theorems for search and optimization. While NFL theorems are criminally misunderstood and misrepresented in the service of crude generalizations intended to make a point, I intend to deploy a crude NFL generalization to make just such a point.

You see, NFL theorems (roughly) state that given a universe of problem sets where an algorithm’s goal is to learn a function that maps a set of input data X to a set of target labels Y, for any subset of problems where algorithm A outperforms algorithm B, there will be a subset of problems where B outperforms A. In fact, averaging their results over the space of all possible problems, the performance of algorithms A and B will be the same.

With some hand waving, we can construct an NFL theorem for the cybersecurity domain:  Over the set of all possible attack vectors that could be employed by a hacker, no single detection algorithm can outperform all others across the full spectrum of attacks.

Protecting the protector: Hardening machine learning defenses against adversarial attacks

Harnessing the power of machine learning and artificial intelligence has enabled Windows Defender Advanced Threat Protection (Windows Defender ATP) next-generation protection to stop new malware attacks before they can get started often within milliseconds. These predictive technologies are central to scaling protection and delivering effective threat prevention in the face of unrelenting attacker activity.

Consider this: On a recent typical day, 2.6 million people encountered newly discovered malware in 232 different countries (Figure 1). These attacks were comprised of 1.7 million distinct, first-seen malware and 60% of these campaigns were finished within the hour.

Figure 1. A single day of malware attacks: 2.6M people from 232 countries encountering malware

While intelligent, cloud-based approaches represent a sea change in the fight against malware, attackers are not sitting idly by and letting advanced ML and AI systems eat their Bitcoin-funded lunch. If they can find a way to defeat machine learning models at the heart of next-gen AV solutions, even for a moment, theyll gain the breathing room to launch a successful campaign.

Today at Black Hat USA 2018, in our talk Protecting the Protector: Hardening Machine Learning Defenses Against Adversarial Attacks [PDF], we presented a series of lessons learned from our experience investigating attackers attempting to defeat our ML and AI protections. We share these lessons in this blog post; we use a case study to demonstrate how these same lessons have hardened Microsofts defensive solutions in the real world. We hope these lessons will help provide defensive strategies on deploying ML in the fight against emerging threats.

Lesson: Use a multi-layered approach

In our layered ML approach, defeating one layer does not mean evading detection, as there are still opportunities to detect the attack at the next layer, albeit with an increase in time to detect. To prevent detection of first-seen malware, an attacker would need to find a way to defeat each of the first three layers in our ML-based protection stack.

Figure 2. Layered ML protection

Even if the first three layers were circumvented, leading to patient zero being infected by the malware, the next layers can still uncover the threat and start protecting other users as soon as these layers reach a malware verdict.

Lesson: Leverage the power of the cloud

ML models trained on the backend and shipped to the client are the first (and fastest) layer in our ML-based stack. They come with some drawbacks, not least of which is that an attacker can take the model and apply pressure until it gives up its secrets. This is a very old trick in the malware authors playbook: iteratively tweak prospective threats and keep scanning it until its no longer detected, then unleash it.

Figure 3. Client vs. cloud models

With models hosted in the cloud, it becomes more challenging to brute-force the model. Because the only way to understand what the models may be doing is to keep sending requests to the cloud protection system, such attempts to game the system are out in the open and can be detected and mitigated in the cloud.

Lesson: Use a diverse set of models

In addition to having multiple layers of ML-based protection, within each layer we run numerous individual ML models trained to recognize new and emerging threats. Each model has its own focus, or area of expertise. Some may focus on a specific file type (for example, PE files, VBA macros, JavaScript, etc.) while others may focus on attributes of a potential threat (for example, behavioral signals, fuzzy hash/distance to known malware, etc.). Different models use different ML algorithms and train on their own unique set of features.

Figure 4. Diversity of machine learning models

Each stand-alone model gives its own independent verdict about the likelihood that a potential threat is malware. The diversity, in addition to providing a robust and multi-faceted look at potential threats, offers stronger protection against attackers finding some underlying weakness in any single algorithm or feature set.

Lesson: Use stacked ensemble models

Another effective approach weve found to add resilience against adversarial attacks is to use ensemble models. While individual models provide a prediction scoped to a particular area of expertise, we can treat those individual predictions as features to additional ensemble machine learning models, combining the results from our diverse set of base classifiers to create even stronger predictions that are more resilient to attacks.

In particular, weve found that logistic stacking, where we include the individual probability scores from each base classifier in the ensemble feature set provides increased effectiveness of malware prediction.

Figure 5. Ensemble machine learning model with individual model probabilities as feature inputs

As discussed in detail in our Black Hat talk, experimental verification and real-world performance shows this approach helps us resist adversarial attacks. In June, the ensemble models represented nearly 12% of our total malware blocks from cloud protection, which translates into tens of thousands of computers protected by these new models every day.

Figure 6. Blocks by ensemble models vs. other cloud blocks

Case study: Ensemble models vs. regional banking Trojan

“The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models.”
— Trevor Hastie, Robert Tibshirani, Jerome Friedman

One of the key advantages of ensemble models is the ability to make high-fidelity prediction from a series of lower-fidelity inputs. This can sometimes seem a little spooky and counter-intuitive to researchers, but use cases weve studied show this approach can catch malware that singular models cannot. Thats what happened in early June when a new banking trojan (detected by Windows Defender ATP as TrojanDownloader:VBS/Bancos) targeting users in Brazil was unleashed.

The attack

The attack started with spam e-mail sent to users in Brazil, directing them to download an important document with a name like Doc062108.zip inside of which was a document that is really a highly obfuscated .vbs script.

Figure 7. Initial infection chain

Figure 8. Obfuscated malicious .vbs script

While the script contains several Base64-encoded Brazilian poems, its true purpose is to:

  • Check to make sure its running on a machine in Brazil
  • Check with its command-and-control server to see if the computer has already been infected
  • Download other malicious components, including a Google Chrome extension
  • Modify the shortcut to Google Chrome to run a different malicious .vbs file

Now whenever the user launches Chrome, this new .vbs malware instead runs.

Figure 9. Modified shortcut to Google Chrome

This new .vbs file runs a .bat file that:

  • Kills any running instances of Google Chrome
  • Copies the malicious Chrome extension into %UserProfile%\Chrome
  • Launches Google Chrome with the load-extension= parameter pointing to the malicious extension

Figure 10. Malicious .bat file that loads the malicious Chrome extension

With the .bat files work done, the users Chrome instance is now running the malicious extension.

Figure 11. The installed Chrome extension

The extension itself runs malicious JavaScript (.js) files on every web page visited.

Figure 12. Inside the malicious Chrome extension

The .js files are highly obfuscated to avoid detection:

Figure 13. Obfuscated .js file

Decoding the hex at the start of the script, we can start to see some clues that this is a banking trojan:

Figure 14. Clues in script show its true intention

The .js files detect whether the website visited is a Brazilian banking site. If it is, the POST to the site is intercepted and sent to the attackers C&C to gather the users login credentials, credit card info, and other info before being passed on to the actual banking site. This activity is happening behind the scenes; to the user, theyre just going about their normal routine with their bank.

Ensemble models and the malicious JavaScript

As the attack got under way, our cloud protection service received thousands of queries about the malicious .js files, triggered by a client-side ML model that considered these files suspicious. The files were highly polymorphic, with every potential victim receiving a unique, slightly altered version of the threat:
Figure 15. Polymorphic malware

The interesting part of the story are these malicious JavaScript files. How did our ML models perform detecting these highly obfuscated scripts as malware? Lets look at one of instances. At the time of the query, we received metadata about the file. Heres a snippet:

Report time 2018-06-14 01:16:03Z
SHA-256 1f47ec030da1b7943840661e32d0cb7a59d822e400063cd17dc5afa302ab6a52
Client file type model SUSPICIOUS
File name vNSAml.js
File size 28074
Extension .js
Is PE file FALSE
File age 0
File prevalence 0
Path C:\Users\<user>\Chrome\1.9.6\vNSAml.js
Process name xcopy.exe

Figure 16 File metadata sent during query to cloud protection service

Based on the process name, this query was sent when the .bat file copied the .js files into the %UserProfile%\Chrome directory.

Individual metadata-based classifiers evaluated the metadata and provided their probability scores. Ensemble models then used these probabilities, along with other features, to reach their own probability scores:

Model Probability that file is malware
Fuzzy hash 1 0.01
Fuzzy hash 2 0.06
ResearcherExpertise 0.64
Ensemble 1 0.85
Ensemble 2 0.91

Figure 17. Probability scores by individual classifiers

In this case, the second ensemble model had a strong enough score for the cloud to issue a blocking decision. Even though none of the individual classifiers in this case had a particularly strong score, the ensemble model had learned from training on millions of clean and malicious files that this combination of scores, in conjunction with a few other non-ML based features, indicated the file had a very strong likelihood of being malware.

Figure 18. Ensemble models issue a blocking decision

As the queries on the malicious .js files rolled in, the cloud issued blocking decisions within a few hundred milliseconds using the ensemble models strong probability score, enabling Windows Defender ATPs antivirus capabilities to prevent the malicious .js from running and remove it. Here is a map overlay of the actual ensemble-based blocks of the malicious JavaScript files at the time:

Figure 19. Blocks by ensemble model of malicious JavaScript used in the attack

Ensemble ML models enabled Windows Defender ATPs next-gen protection to defend thousands of customers in Brazil targeted by the unscrupulous attackers from having a potentially bad day, while ensuring the frustrated malware authors didnt hit the big pay day they were hoping for. Bom dia.

 

Further reading on machine learning and artificial intelligence in Windows Defender ATP

Indicators of compromise (IoCs)

  • Doc062018.zip (SHA-256: 93f488e4bb25977443ff34b593652bea06e7914564af5721727b1acdd453ced9)
  • Doc062018-2.vbs (SHA-256: 7b1b7b239f2d692d5f7f1bffa5626e8408f318b545cd2ae30f44483377a30f81)
  • zobXhz.js 1f47(SHA-256: ec030da1b7943840661e32d0cb7a59d822e400063cd17dc5afa302ab6a52)

 

 

 

Randy Treit, Holly Stewart, Jugal Parikh
Windows Defender Research
with special thanks to Allan Sepillo and Samuel Wakasugui

 

 


Talk to us

Questions, concerns, or insights on this story? Join discussions at the Microsoft community and Windows Defender Security Intelligence.

Follow us on Twitter @WDSecurity and Facebook Windows Defender Security Intelligence.

The post Protecting the protector: Hardening machine learning defenses against adversarial attacks appeared first on Microsoft Secure.

Malicious PowerShell Detection via Machine Learning

Introduction

Cyber security vendors and researchers have reported for years how PowerShell is being used by cyber threat actors to install backdoors, execute malicious code, and otherwise achieve their objectives within enterprises. Security is a cat-and-mouse game between adversaries, researchers, and blue teams. The flexibility and capability of PowerShell has made conventional detection both challenging and critical. This blog post will illustrate how FireEye is leveraging artificial intelligence and machine learning to raise the bar for adversaries that use PowerShell.

In this post you will learn:

  • Why malicious PowerShell can be challenging to detect with a traditional “signature-based” or “rule-based” detection engine.
  • How Natural Language Processing (NLP) can be applied to tackle this challenge.
  • How our NLP model detects malicious PowerShell commands, even if obfuscated.
  • The economics of increasing the cost for the adversaries to bypass security solutions, while potentially reducing the release time of security content for detection engines.

Background

PowerShell is one of the most popular tools used to carry out attacks. Data gathered from FireEye Dynamic Threat Intelligence (DTI) Cloud shows malicious PowerShell attacks rising throughout 2017 (Figure 1).


Figure 1: PowerShell attack statistics observed by FireEye DTI Cloud in 2017 – blue bars for the number of attacks detected, with the red curve for exponentially smoothed time series

FireEye has been tracking the malicious use of PowerShell for years. In 2014, Mandiant incident response investigators published a Black Hat paper that covers the tactics, techniques and procedures (TTPs) used in PowerShell attacks, as well as forensic artifacts on disk, in logs, and in memory produced from malicious use of PowerShell. In 2016, we published a blog post on how to improve PowerShell logging, which gives greater visibility into potential attacker activity. More recently, our in-depth report on APT32 highlighted this threat actor's use of PowerShell for reconnaissance and lateral movement procedures, as illustrated in Figure 2.


Figure 2: APT32 attack lifecycle, showing PowerShell attacks found in the kill chain

Let’s take a deep dive into an example of a malicious PowerShell command (Figure 3).


Figure 3: Example of a malicious PowerShell command

The following is a quick explanation of the arguments:

  • -NoProfile – indicates that the current user’s profile setup script should not be executed when the PowerShell engine starts.
  • -NonI – shorthand for -NonInteractive, meaning an interactive prompt to the user will not be presented.
  • -W Hidden – shorthand for “-WindowStyle Hidden”, which indicates that the PowerShell session window should be started in a hidden manner.
  • -Exec Bypass – shorthand for “-ExecutionPolicy Bypass”, which disables the execution policy for the current PowerShell session (default disallows execution). It should be noted that the Execution Policy isn’t meant to be a security boundary.
  • -encodedcommand – indicates the following chunk of text is a base64 encoded command.

What is hidden inside the Base64 decoded portion? Figure 4 shows the decoded command.


Figure 4: The decoded command for the aforementioned example

Interestingly, the decoded command unveils a stealthy fileless network access and remote content execution!

  • IEX is an alias for the Invoke-Expression cmdlet that will execute the command provided on the local machine.
  • The new-object cmdlet creates an instance of a .NET Framework or COM object, here a net.webclient object.
  • The downloadstring will download the contents from <url> into a memory buffer (which in turn IEX will execute).

It’s worth mentioning that a similar malicious PowerShell tactic was used in a recent cryptojacking attack exploiting CVE-2017-10271 to deliver a cryptocurrency miner. This attack involved the exploit being leveraged to deliver a PowerShell script, instead of downloading the executable directly. This PowerShell command is particularly stealthy because it leaves practically zero file artifacts on the host, making it hard for traditional antivirus to detect.

There are several reasons why adversaries prefer PowerShell:

  1. PowerShell has been widely adopted in Microsoft Windows as a powerful system administration scripting tool.
  2. Most attacker logic can be written in PowerShell without the need to install malicious binaries. This enables a minimal footprint on the endpoint.
  3. The flexible PowerShell syntax imposes combinatorial complexity challenges to signature-based detection rules.

Additionally, from an economics perspective:

  • Offensively, the cost for adversaries to modify PowerShell to bypass a signature-based rule is quite low, especially with open source obfuscation tools.
  • Defensively, updating handcrafted signature-based rules for new threats is time-consuming and limited to experts.

Next, we would like to share how we at FireEye are combining our PowerShell threat research with data science to combat this threat, thus raising the bar for adversaries.

Natural Language Processing for Detecting Malicious PowerShell

Can we use machine learning to predict if a PowerShell command is malicious?

One advantage FireEye has is our repository of high quality PowerShell examples that we harvest from our global deployments of FireEye solutions and services. Working closely with our in-house PowerShell experts, we curated a large training set that was comprised of malicious commands, as well as benign commands found in enterprise networks.

After we reviewed the PowerShell corpus, we quickly realized this fit nicely into the NLP problem space. We have built an NLP model that interprets PowerShell command text, similar to how Amazon Alexa interprets your voice commands.

One of the technical challenges we tackled was synonym, a problem studied in linguistics. For instance, “NOL”, “NOLO”, and “NOLOGO” have identical semantics in PowerShell syntax. In NLP, a stemming algorithm will reduce the word to its original form, such as “Innovating” being stemmed to “Innovate”.

We created a prefix-tree based stemmer for the PowerShell command syntax using an efficient data structure known as trie, as shown in Figure 5. Even in a complex scripting language such as PowerShell, a trie can stem command tokens in nanoseconds.


Figure 5: Synonyms in the PowerShell syntax (left) and the trie stemmer capturing these equivalences (right)

The overall NLP pipeline we developed is captured in the following table:

NLP Key Modules

Functionality

Decoder

Detect and decode any encoded text

Named Entity Recognition (NER)

Detect and recognize any entities such as IP, URL, Email, Registry key, etc.

Tokenizer

Tokenize the PowerShell command into a list of tokens

Stemmer

Stem tokens into semantically identical token, uses trie

Vocabulary Vectorizer

Vectorize the list of tokens into machine learning friendly format

Supervised classifier

Binary classification algorithms:

  • Kernel Support Vector Machine
  • Gradient Boosted Trees
  • Deep Neural Networks

Reasoning

The explanation of why the prediction was made. Enables analysts to validate predications.

The following are the key steps when streaming the aforementioned example through the NLP pipeline:

  • Detect and decode the Base64 commands, if any
  • Recognize entities using Named Entity Recognition (NER), such as the <URL>
  • Tokenize the entire text, including both clear text and obfuscated commands
  • Stem each token, and vectorize them based on the vocabulary
  • Predict the malicious probability using the supervised learning model


Figure 6: NLP pipeline that predicts the malicious probability of a PowerShell command

More importantly, we established a production end-to-end machine learning pipeline (Figure 7) so that we can constantly evolve with adversaries through re-labeling and re-training, and the release of the machine learning model into our products.


Figure 7: End-to-end machine learning production pipeline for PowerShell machine learning

Value Validated in the Field

We successfully implemented and optimized this machine learning model to a minimal footprint that fits into our research endpoint agent, which is able to make predictions in milliseconds on the host. Throughout 2018, we have deployed this PowerShell machine learning detection engine on incident response engagements. Early field validation has confirmed detections of malicious PowerShell attacks, including:

  • Commodity malware such as Kovter.
  • Red team penetration test activities.
  • New variants that bypassed legacy signatures, while detected by our machine learning with high probabilistic confidence.

The unique values brought by the PowerShell machine learning detection engine include:  

  • The machine learning model automatically learns the malicious patterns from the curated corpus. In contrast to traditional detection signature rule engines, which are Boolean expression and regex based, the NLP model has lower operation cost and significantly cuts down the release time of security content.
  • The model performs probabilistic inference on unknown PowerShell commands by the implicitly learned non-linear combinations of certain patterns, which increases the cost for the adversaries to bypass.

The ultimate value of this innovation is to evolve with the broader threat landscape, and to create a competitive edge over adversaries.

Acknowledgements

We would like to acknowledge:

  • Daniel Bohannon, Christopher Glyer and Nick Carr for the support on threat research.
  • Alex Rivlin, HeeJong Lee, and Benjamin Chang from FireEye Labs for providing the DTI statistics.
  • Research endpoint support from Caleb Madrigal.
  • The FireEye ICE-DS Team.

Microsoft to acquire Bonsai in move to build ‘brains’ for autonomous systems

Group shot of Bonsai's team members
Bonsai’s team members. Photo courtesy of Bonsai.

With AI’s meteoric rise, autonomous systems have been projected to grow to more than 800 million in operation by 2025. However, while envisioned in science fiction for a long time, truly intelligent autonomous systems are still elusive and remain a holy grail. The reality today is that training autonomous systems that function amidst the many unforeseen situations in the real world is very hard and requires deep expertise in AI — essentially making it unscalable.

To achieve this inflection point in AI’s growth, traditional machine learning methodologies aren’t enough. Bringing intelligence to autonomous systems at scale will require a unique combination of the new practice of machine teaching, advances in deep reinforcement learning and leveraging simulation for training. Microsoft has been on a path to make this a reality through continued AI research breakthroughs; the development of the powerful Azure AI platform of tools, services and infrastructure; advances in deep learning including our acquisition of Maluuba, and the impressive efficiencies we’ve achieved in simulation-based training with Microsoft Research’s AirSim tool. With software developers at the center of digital transformation, our pending acquisition of GitHub further underscores just how imperative it is that we empower developers to break  through and lead this next wave of innovation.

Today we are excited to take another major step forward in our vision to make it easier for developers and subject matter experts to build the “brains”— machine learning modelfor autonomous systems of all kinds with the signing of an agreement to acquire Bonsai. Based in Berkeley, California, and an M12 portfolio company, Bonsai has developed a novel approach using machine teaching that abstracts the low-level mechanics of machine learning, so that subject matter experts, regardless of AI aptitude, can specify and train autonomous systems to accomplish tasks. The actual training takes place inside a simulated environment.

The company is building a general-purpose, deep reinforcement learning platform especially suited for enterprises leveraging industrial control systems such as robotics, energy, HVAC, manufacturing and autonomous systems in general. This includes unique machine-teaching innovations, automated model generation and management, a host of APIs and SDKs for simulator integration, as well as pre-built support for leading simulations all packaged in one end-to-end platform.

Bonsai’s platform combined with rich simulation tools and reinforcement learning work in Microsoft Research becomes the simplest and richest AI toolchain for building any kind of autonomous system for control and calibration tasks. This toolchain will compose with Azure Machine Learning running on the Azure Cloud with GPUs and Brainwave, and models built with it will be deployed and managed in Azure IoT, giving Microsoft an end-to-end solution for building, operating and enhancing “brains” for autonomous systems.

What I find exciting is that Bonsai has achieved some remarkable breakthroughs with their approach that will have a profound impact on AI development. Last fall, they established a new reinforcement learning benchmark for programming industrial control systems. Using a robotics task to demonstrate the achievement, the platform successfully trained a simulated robotic arm to grasp and stack blocks on top of one another by breaking down the task into simpler sub-concepts. Their novel technique performed 45 times faster than a comparable approach from Google’s DeepMind. Then, earlier this year, they extended deep reinforcement learning’s capabilities beyond traditional game play, where it’s often demonstrated, to real-world applications. Using Bonsai’s AI Platform and machine teaching, subject matter experts from Siemens, with no AI expertise, trained an AI model to autocalibrate a Computer Numerical Control machine 30 times faster than the traditional approach. This represented a huge milestone in industrial AI, and the implications when considered across the broader sector are just staggering.

To realize this vision of making AI more accessible and valuable for all, we have to remove the barriers to development, empowering every developer, regardless of machine learning expertise, to be an AI developer. Bonsai has made tremendous progress here and Microsoft remains committed to furthering this work. We already deliver the most comprehensive collection of AI tools and services that make it easier for any developer to code and integrate pre-built and custom AI capabilities into applications and extend to any scenario. There are over a million developers using our pre-built Microsoft Cognitive Services, a collection of intelligent APIs that enable developers to easily leverage high-quality vision, speech, language, search and knowledge technologies in their apps with a few lines of code. And last fall, we led a combined industry push to foster a more open AI ecosystem, bringing AI advances to all developers, on any platform, using any language through the introduction of the Open Neural Network Exchange (ONNX) format and Gluon open source interface for deep learning.

We’re really confident this unique marriage of research, novel approach and technology will have a tremendous effect toward removing barriers and accelerating the current state of AI development. We look forward to having Bonsai and their team join us to help realize this collective vision.

The post Microsoft to acquire Bonsai in move to build ‘brains’ for autonomous systems appeared first on The Official Microsoft Blog.

Neural networks and deep learning

Deep learning refers to a family of machine learning algorithms that can be used for supervised, unsupervised and reinforcement learning. 

These algorithms are becoming popular after many years in the wilderness. The name comes from the realization that the addition of increasing numbers of layers typically in a neural network enables a model to learn increasingly complex representations of the data.

Reverse Engineering the Analyst: Building Machine Learning Models for the SOC

Many cyber incidents can be traced back to an original alert that was either missed or ignored by the Security Operations Center (SOC) or Incident Response (IR) team. While most analysts and SOCs are vigilant and responsive, the fact is they are often overwhelmed with alerts. If a SOC is unable to review all the alerts it generates, then sooner or later, something important will slip through the cracks.

The core issue here is scalability. It is far easier to create more alerts than to create more analysts, and the cyber security industry is far better at alert generation than resolution. More intel feeds, more tools, and more visibility all add to the flood of alerts. There are things that SOCs can and should do to manage this flood, such as increasing automation of forensic tasks (pulling PCAP and acquiring files, for example) and using aggregation filters to group alerts into similar batches. These are effective strategies and will help reduce the number of required actions a SOC analyst must take. However, the decisions the SOC makes still form a critical bottleneck. This is the “Analyze/ Decide” block in Figure 1.


Figure 1: Basic SOC triage stages

In this blog post, we propose machine learning based strategies to help mitigate this bottleneck and take back control of the SOC. We have implemented these strategies in our FireEye Managed Defense SOC, and our analysts are taking advantage of this approach within their alert triaging workflow. In the following sections, we will describe our process to collect data, capture alert analysis, create a model, and build an efficacy workflow – all with the ultimate goal of automating alert triage and freeing up analyst time.

Reverse Engineering the Analyst

Every alert that comes into a SOC environment contains certain bits of information that an analyst uses to determine if the alert represents malicious activity. Often, there are well-paved analytical processes and pathways used when evaluating these forensic artifacts over time. We wanted to explore if, in an effort to truly scale our SOC operations, we could extract these analytical pathways, train a machine to traverse them, and potentially discover new ones.

Think of a SOC as a self-contained machine that inputs unlabeled alerts and outputs the alerts labeled as “malicious” or “benign”. How can we capture the analysis and determine that something is indeed malicious, and then recreate that analysis at scale? In other words, what if we could train a machine to make the same analytical decisions as an analyst, within an acceptable level of confidence?

Basic Supervised Model Process

The data science term for this is a “Supervised Classification Model”. It is “supervised” in the sense that it learns by being shown data already labeled as benign or malicious, and it is a “classification model” in the sense that once it has been trained, we want it to look at a new piece of data and make a decision between one of several discrete outcomes. In our case, we only want it to decide between two “classes” of alerts: malicious and benign.

In order to begin creating such a model, a dataset must be collected. This dataset forms the “experience” of the model, and is the information we will use to “train” the model to make decisions. In order to supervise the model, each unit of data must be labeled as either malicious or benign, so that the model can evaluate each observation and begin to figure out what makes something malicious versus what makes it benign. Typically, collecting a clean, labeled dataset is one of the hardest parts of the supervised model pipeline; however, in the case of our SOC, our analysts are constantly triaging (or “labeling”) thousands of alerts every week, and so we were lucky to have an abundance of clean, standardized, labeled alerts.

Once a labeled dataset has been defined, the next step is to define “features” that can be used to portray the information resident in each alert. A “feature” can be thought of as an aspect of a bit of information. For example, if the information is represented as a string, a natural “feature” could be the length of the string. The central idea behind building features for our alert classification model was to find a way to represent and record all the aspects that an analyst might consider when making a decision.

Building the model then requires choosing a model structure to use, and training the model on a subset of the total data available. The larger and more diverse the training data set, generally the better the model will perform. The remaining data is used as a “test set” to see if the trained model is indeed effective. Holding out this test set ensures the model is evaluated on samples it has never seen before, but for which the true labels are known.

Finally, it is critical to ensure there is a way to evaluate the efficacy of the model over time, as well as to investigate mistakes so that appropriate adjustments can be made. Without a plan and a pipeline to evaluate and retrain, the model will almost certainly decay in performance.

Feature Engineering

Before creating any of our own models, we interviewed experienced analysts and documented the information they typically evaluate before making a decision on an alert. Those interviews formed the basis of our feature extraction. For example, when an analyst says that reviewing an alert is “easy”, we ask: “Why? And what helps you make that decision?” It is this reverse engineering of sorts that gives insight into features and models we can use to capture analysis.

For example, consider a process execution event. An alert on a potentially malicious process execution may contain the following fields:

  • Process Path
  • Process MD5
  • Parent Process
  • Process Command Arguments

While this may initially seem like a limited feature space, there is a lot of useful information that one can extract from these fields.

Beginning with the process path of, say, “C:\windows\temp\m.exe”, an analyst can immediately see some features:

  • The process resides in a temporary folder: C:\windows\temp\
  • The process is two directories deep in the file system
  • The process executable name is one character long
  • The process has an .exe extension
  • The process is not a “common” process name

While these may seem simple, over a vast amount of data and examples, extracting these bits of information will help the model to differentiate between events. Even the most basic aspects of an artifact must be captured in order to “teach” the model to view processes the way an analyst does.

The features are then encoded into a more discrete representation, similar to this:

Temp_folder

Depth

Name_Length

Extension

common_process_name

TRUE

2

1

exe

FALSE

Another important feature to consider about a process execution event is the combination of parent process and child process. Deviation from expected “lineage” can be a strong indicator of malicious activity.

Say the parent process of the aforementioned example was ‘powershell.exe’. Potential new features could then be derived from the concatenation of the parent process and the process itself: ‘powershell.exe_m.exe’. This functionally serves as an identity for the parent-child relation and captures another key analysis artifact.

The richest field, however, is probably the process arguments. Process arguments are their own sort of language, and language analysis is a well-tread space in predictive analytics.

We can look for things including, but not limited to:

  • Network connection strings (such as ‘http://’, ‘https://’, ‘ftp://’).
  • Base64 encoded commands
  • Reference to Registry Keys (‘HKLM’, ‘HKCU’)
  • Evidence of obfuscation (ticks, $, semicolons) (read Daniel Bohannon’s work for more)

The way these features and their values appear in a training dataset will define the way the model learns. Based on the distribution of features across thousands of alerts, relationships will start to emerge between features and labels. These relationships will then be recorded in our model, and ultimately used to influence the predictions for new alerts. Looking at distributions of features in the training set can give insight into some of these potential relationships.

For example, Figure 2 shows how the distribution of Process Command Length may appear when grouping by malicious (red) and benign (blue).


Figure 2: Distribution of Process Event alerts grouped by Process Command Length

This graph shows that over a subset of samples, the longer the command length, the more likely it is to be malicious. This manifests as red on the right and blue on the left. However, process length is not the only factor.

As part of our feature set, we also thought it would be useful to approximate the “complexity” of each command. For this, we used “Shannon entropy”, a commonly used metric that measures the degree of randomness present in a string of characters.

Figure 3 shows a distribution of command entropy, broken out into malicious and benign. While the classes do not separate entirely, we can see that for this sample of data, samples with higher entropy generally have a higher chance of being malicious.


Figure 3: Distribution of Process Event alerts grouped by entropy

Model Selection and Generalization

Once features have been generated for the whole dataset, it is time to use them to train a model. There is no perfect procedure for picking the best model, but looking at the type of features in our data can help narrow it down. In the case of a process event, we have a combination of features represented as strings and numbers. When an analyst evaluates each artifact, they ask questions about each of these features, and combine the answers to estimate the probability that the process is malicious.

For our use case, it also made sense to prioritize an ‘interpretable’ model – that is, one that can more easily expose why it made a certain decision about an artifact. This way analysts can build confidence in the model, as well as detect and fix analytical mistakes that the model is making. Given the nature of the data, the decisions analysts make, and the desire for interpretability, we felt that a decision tree-based model would be well-suited for alert classification.

There are many publicly available resources to learn about decision trees, but the basic intuition behind a decision tree is that it is an iterative process, asking a series of questions to try to arrive at a highly confident answer. Anyone who has played the game “Twenty Questions” is familiar with this concept. Initially, general questions are asked to help eliminate possibilities, and then more specific questions are asked to narrow down the possibilities. After enough questions are asked and answered, the ‘questioner’ feels they have a high probability of guessing the right answer.

Figure 4 shows an example of a decision tree that one might use to evaluate process executions.


Figure 4: Decision tree for deciding whether an alert is benign or malicious

For the example alert in the diagram, the “decision path” is marked in red. This is how this decision tree model makes a prediction. It first asks: “Is the length greater than 100 characters?” If so, it moves to the next question “Does it contain the string ‘http’?” and so on until it feels confident in making an educated guess. In the example in Figure 4, given that 95 percent of all the training alerts traveling this decision path were malicious, the model predicts a 95 percent chance that this alert will also be malicious.

Because they can ask such detailed combinations of questions, it is possible that decision trees can “overfit”, or learn rules that are too closely tied to the training set. This reduces the model’s ability to “generalize” to new data. One way to mitigate this effect is to use many slightly different decision trees and have them each “vote” on the outcome. This “ensemble” of decision trees is called a Random Forest, and it can improve performance for the model when deployed in the wild. This is the algorithm we ultimately chose for our model.

How the SOC Alert Model Works

When a new alert appears, the data in the artifact is transformed into a vector of the encoded features, with the same structure as the feature representations used to train the model. The model then evaluates this “feature vector” and applies a confidence level for the predicted label. Based on thresholds we set, we can then classify the alert as malicious or benign.


Figure 5: An alert presented to the analyst with its raw values captured

As an example, the event shown in Figure 5 might create the following feature values:

  • Parent Process: ‘wscript’
  • Command Entropy: 5.08
  • Command Length =103

Based on how they were trained, the trees in the model each ask a series of questions of the new feature vector. As the feature vector traverses each tree, it eventually converges on a terminal “leaf” classifying it as either benign or malicious. We can then evaluate the aggregated decisions made by each tree to estimate which features in the vector played the largest role in the ultimate classification.

For the analysts in the SOC, we then present the features extracted from the model, showing the distribution of those features over the entire dataset. This gives the analysts insight into “why” the model thought what it thought, and how those features are represented across all alerts we have seen. For example, the “explanation” for this alert might look like:

  • Command Entropy = 5.08 > 4.60:  51.73% Threat
  • occuranceOfChar “\”= 9.00 > 4.50:  64.09% Threat
  • occuranceOfChar:“)” (=0.00) <= 0.50: 78.69% Threat
  • NOT processTree=”cmd.exe_to_cscript.exe”: 99.6% Threat

Thus, at the time of analysis, the analysts can see the raw data of the event, the prediction from the model, an approximation of the decision path, and a simplified, interpretable view of the overall feature importance.

How the SOC Uses the Model

Showing the features the model used to reach the conclusion allows experienced analysts to compare their approach with the model, and give feedback if the model is doing something wrong. Conversely, a new analyst may learn to look at features they may have otherwise missed: the parent-child relationship, signs of obfuscation, or network connection strings in the arguments. After all, the model has learned on the collective experience of every analyst over thousands of alerts. Therefore, the model provides an actionable reflection of the aggregate analyst experience back to the SOC, so that each analyst can transitively learn from their colleagues.

Additionally, it is possible to write rules using the output of the model as a parameter. If the model is particularly confident on a subset of alerts, and the SOC feels comfortable automatically classifying that family of threats, it is possible to simply write a rule to say: “If the alert is of this type, AND for this malware family, AND the model confidence is above 99, automatically call this alert bad and generate a report.” Or, if there is a storm of probable false positives, one could write a rule to cull the herd of false positives using a model score below 10.

How the Model Stays Effective

The day the model is trained, it stops learning. However, threats – and therefore alerts – are constantly evolving. Thus, it is imperative to continually retrain the model with new alert data to ensure it continues to learn from changes in the environment.

Additionally, it is critical to monitor the overall efficacy of the model over time. Building an efficacy analysis pipeline to compare model results against analyst feedback will help identify if the model is beginning to drift or develop structural biases. Evaluating and incorporating analyst feedback is also critical to identify and address specific misclassifications, and discover potential new features that may be necessary.

To accomplish these goals, we run a background job that updates our training database with newly labeled events. As we get more and more alerts, we periodically retrain our model with the new observations. If we encounter issues with accuracy, we diagnose and work to address them. Once we are satisfied with the overall accuracy score of our retrained model, we store the model object and begin using that model version.

We also provide a feedback mechanism for analysts to record when the model is wrong. An analyst can look at the label provided by the model and the explanation, but can also make their own decision. Whether they agree with the model or not, they can input their own label through the interface. We store this label provided by the analyst along with any optional explanation given by them regarding the explanation.

Finally, it should be noted that these manual labels may require further evaluation. As an example, consider a commodity malware alert, in which network command and control communications were sinkholed. An analyst may evaluate the alert, pull back triage details, including PCAP samples, and see that while the malware executed, the true threat to the environment was mitigated. Since it does not represent an exigent threat, the analyst may mark this alert as ‘benign’. However, the fact that it was sinkholed does not change that the artifacts of execution still represent malicious activity. Under different circumstances, this infection could have had a negative impact on the organization. However, if the benign label is used when retraining the model, that will teach the model that something inherently malicious is in fact benign, and potentially lead to false negatives in the future.

Monitoring efficacy over time, updating and retraining the model with new alerts, and evaluating manual analyst feedback gives us visibility into how the model is performing and learning over time. Ultimately this helps to build confidence in the model, so we can automate more tasks and free up analyst time to perform tasks such as hunting and investigation.

Conclusion

A supervised learning model is not a replacement for an experienced analyst. However, incorporating predictive analytics and machine learning into the SOC workflow can help augment the productivity of analysts, free up time, and ensure they utilize investigative skills and creativity on the threats that truly require expertise.

This blog post outlines the major components and considerations of building an alert classification model for the SOC. Data collection, labeling, feature generation, model training, and efficacy analysis must all be carefully considered when building such a model. FireEye continues to iterate on this research to improve our detection and response capabilities, continually improve the detection efficacy of our products, and ultimately protect our clients.

The process and examples shown discussed in this post are not mere research. Within our FireEye Managed Defense SOC, we use alert classification models built using the aforementioned processes to increase our efficiency and ensure we apply our analysts’ expertise where it is needed most. In a world of ever increasing threats and alerts, increasing SOC efficiency may mean the difference between missing and catching a critical intrusion.

Acknowledgements

A big thank you to Seth Summersett and Clara Brooks.

***

The FireEye ICE Data Science Team is a small, highly trained team of data scientists and engineers, focused on delivering impactful capabilities to our analysts, products, and customers. ICE-DS is always looking for exceptional candidates interested in researching and solving difficult problems in cybersecurity. If you’re interested, check out  FireEye careers.

How algorithms learn and adapt

There are numerous techniques for creating algorithms that are capable of learning and adapting over time. Broadly speaking, we can organize these algorithms into one of three categories – supervised, unsupervised, and reinforcement learning.

Supervised learning refers to situations in which each instance of input data is accompanied by a desired or target value for that input. When the target values are a set of finite discrete categories, the learning task is often known as a classification problem. When the targets are one or more continuous variables, the task is called regression.

AI vs. machine learning

“The original question ‘Can machines think?’ I believe to be too meaningless to deserve discussion. Nevertheless, I believe that at the end of the century, the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.” – Alan Turing

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.