This blog post is the next episode in the FireEye Labs Advanced Reverse Engineering (FLARE) team Script Series. Today, we are sharing something quite unusual. It is not a tool or a virtual machine distribution, nor is it a plugin or script for a popular reverse engineering tool or framework. Rather, it is a profile created for a consumer software application completely unrelated to reverse engineering or malware analysis… until now. The software is named VoiceAttack, and its purpose is to make it easy for users to control other software on their computer using voice commands. With FLARE’s new profile for VoiceAttack, users can completely control IDA Pro with their voice! Have you ever dreamed of telling IDA Pro to decompile a function or show you the strings of a binary? Well dream no more! Not only does our profile give you total control of the software, it also provides shortcuts and other cool features not previously available. It’s our hope that providing voice control for the world’s most popular disassembler will further empower users with repetitive stress injuries or disabilities to more effectively put their reverse engineering skills to use with this new accessibility option as well as helping the community at large work more efficiently.
Check out our video demonstration of some of the features of the profile to see it in action.
How Does It Work?
Voice attack is an inexpensive software application that utilizes the Windows Speech Recognition (WSR) feature to enable the creation of user-defined, voice-activated macros. The user specifies a key word or phrase, then defines one or more actions to be taken when that word or phrase is recognized. The most common types of actions to be taken include key presses, mouse movement and clicks, and clipboard manipulation. However, there are many other more advanced features available that provide a lot of flexibility to users including variables, loops, and conditionals. You can even have the computer speak to you in response to your commands! VoiceAttack requires an internet connection, but only during the registration process, after which the network adapter can be disabled or configured to a network that cannot reach the internet without issue.
To use VoiceAttack, you must first train Windows Speech Recognition to recognize your voice. Instructions on how to do so can be found here. This process only takes a few minutes at minimum, but the more time you spend training, the better the experience you will have with it.
What Does the IDA Pro Profile Provide?
FLARE’s IDA Pro profile for VoiceAttack maps every advertised keyboard shortcut in IDA Pro to a voice command. Although this is only one part of what the profile provides, many users will find this in itself very useful. When developing this profile, I was shocked to discover just how many keyboard shortcuts there really are for IDA Pro and what can be accomplished with them. Some of my favorite shortcuts are found under the View->Open Subviews and Windows menus. With this profile, I can simply say “show strings” or “show structures” or “show window x” to change the tab I am currently viewing or open a new view in a tab without having to move my mouse cursor anywhere. The next few paragraphs describe some other useful commands to make any reverse engineer’s job easier. For a more detailed description of the profile and commands available, see the Github page.
A series of voice commands can perform multi-step actions not otherwise reachable by individual keyboard shortcuts. For example, wouldn’t it be nice to have commands to toggle the visibility of opcode bytes (see Figure 1)? Currently, you have to open the Options menu, select the General menu item, input a value in the Number of opcode bytes text field, and click the OK button. Well, now you can simply say “show opcodes” or “hide opcodes” and it will be so!
Figure 1: Configuring the number of opcode bytes to show in IDA Pro's disassembly view
Defining a Unicode string in IDA Pro is a multi-step exercise, whether you navigate to the Edit->Strings menu or use the “string literals” keyboard shortcut Alt+A followed by pressing the U key as shown in Figure 2. Now you can simply say “make Unicode string” and the work is done for you.
Figure 2: String literals dialog in IDA Pro
Reversing a C++ application? The Create struct from selection action is a very helpful feature in this case, but it requires you to navigate to the Edit->Structs menu in order to use it. The voice command “create struct from selection” does this for you automatically. The “look it up” command will copy the currently highlighted token in the disassembly and search Google for it using your default browser. There are several other macros in the profile that are like this and save you a lot of time navigating menus and dialogs to perform simple actions.
Cursor Movement, Dialogs, and Navigation
The cursor movement commands allow the user to move the cursor up, down, left, or right, one or more times, in specified increments. These commands also allow for scrolling with a voice command that commences scrolling in a chosen direction, and another voice command for stopping scrolling. There are even voice commands to set the speed of the scroll to slow, medium, or fast. In the disassembly view, the cursor can also be moved per “word” on the current line of the disassembly or decompilation, or even per basic block or function.
Like many other applications, dialogs are a part of IDA Pro’s user interface. The ability to easily navigate and interact with items in a dialog with your voice is essential to a smooth user experience. Voice commands in the profile enable the user to easily click the OK or Cancel buttons, toggle checkboxes, and tab through controls in the dialog in both directions and in specified increments.
With the aid of a companion IDAPython plugin, additional navigation commands are supported. Commands that allow the user to move the cursor to the beginning or end of the current function, to the next or previous “call” instruction, to the previous or next instruction containing the highlighted token, or to a specified number of bytes forward or backwards from the current cursor position help to make voice-controlled navigation easier.
These cursor movement and navigation commands enable users to have full control of IDA Pro without the use of their hands. While this is true and an important goal for the profile, it is not practical for people who have full use of their hands to go completely hands-free. The commands that navigate the cursor in IDA Pro will never be as fast or easy as simply using the mouse to point and click somewhere on the screen. In any case, users will find themselves building up a collection of voice commands they prefer to use that will depend on personal tastes. However, enabling full voice control allows reverse engineers who do not have full use of their hands to still effectively operate IDA Pro, which we hope will be of great use to the community. Having such a capability is also useful for those who suffer from repetitive strain injury.
The commands described so far give you control over IDA Pro with your voice, but there is still the matter of providing textual input for items such as function and variable names, comments, and other text input fields. VoiceAttack does provide the ability in macros to enable and disable what is called “Dictation Mode”. When in Dictation Mode, any recognized words are added to a buffer of text until Dictation Mode is disabled. Then this text can be used elsewhere in the macro. Unfortunately, this feature is not designed to recognize the kinds of technical terms one would be using in the context of reverse engineering programs. Even if it were, there is still the issue of having to format the text to be a valid function or variable name. Instead of wrestling with this feature to try to make it work for this purpose, a very large and growing collection of “input recognition” commands was created. These commands are designed to recognize common words used in the names of functions and variables, as well as full function names as found in the C runtime libraries and the Windows APIs. Once recognized, the word or function name is copied to the user’s clipboard and pasted into the text field. To avoid the inadvertent triggering of such commands during the regular operation of IDA Pro, these commands are only active when the “input mode” is enabled. This mode is enabled automatically when certain commands are activated such as “rename” or “find”, and automatically disabled when dialog commands such as “OK” or “cancel” are activated. The input mode can also be manually manipulated with the “input mode on” and “input mode off” commands.
Today, the FLARE team is releasing a profile for VoiceAttack and a companion IDAPython plugin that enables full voice control of IDA Pro along with many added convenience features. The profile contains over 1000 defined commands and growing. It is easy to view, edit, and add commands to this profile to customize it to suit your needs or to improve it for the community at large. The VoiceAttack software is highly affordable and enables you to create profiles for any applications or games that you use. For installation instructions and usage information, see the project’s Github page. Give it a try today!
We are pleased to announce the conclusion of the sixth annual Flare-On Challenge. The popularity of this event continues to grow and this year we saw a record number of players as well as finishers. We will break down the numbers later in the post, but right now let’s look at the fun stuff: the prize! Each of the 308 dedicated and amazing players that finished our marathon of reverse engineering this year will receive a medal. These hard-earned awards will be shipping soon. Incidentally, the number of finishers smashed our estimates, so we have had to order more prizes.
We would like to thank the challenge authors individually for their great puzzles and solutions.
- Memecat Battlestation: Nick Harbour (@nickaharbour)
- Overlong: Eamon Walsh
- FlareBear: Mortiz Raabe (@m_r_tz)
- DnsChess: Eamon Walsh
- 4k demo: Christopher Gardner (@t00manybananas)
- Bmphide: Tyler Dean (@spresec)
- Wopr: Sandor Nemes (@sandornemes)
- Snake: Alex Rich (@AlexRRich)
- Reloaderd: Sebastian Vogl
- MugatuWare: Blaine Stancill (@MalwareMechanic)
- vv_max: Dhanesh Kizhakkinan (@dhanesh_k)
- help: Ryan Warns (@NOPAndRoll)
And now for the stats. As of 10:00am ET, participation was at an all-time high, with 5,790 players registered and 3,228 finishing at least one challenge. This year had the most finishers ever with 308 players completing all twelve challenges.
The U.S. reclaimed the top spot of total finishers with 29. Singapore strengthened its already unbelievable position of per-capita top Flare-On finishing country with one Flare-On finisher per every 224,000 persons living in Singapore. Rounding out the top five are the consistent high-finishing countries of Vietnam, Russia, and China.
All the binaries from this year’s challenge are now posted on the Flare-On website. Here are the solutions written by each challenge author:
Malware analysts routinely use the Strings program during static analysis in order to inspect a binary's printable characters. However, identifying relevant strings by hand is time consuming and prone to human error. Larger binaries produce upwards of thousands of strings that can quickly evoke analyst fatigue, relevant strings occur less often than irrelevant ones, and the definition of "relevant" can vary significantly among analysts. Mistakes can lead to missed clues that would have reduced overall time spent performing malware analysis, or even worse, incomplete or incorrect investigatory conclusions.
Earlier this year, the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams published a blog post describing a machine learning model that automatically ranked strings to address these concerns. Today, we publicly release this model as part of StringSifter, a utility that identifies and prioritizes strings according to their relevance for malware analysis.
StringSifter is built to sit downstream from the Strings program; it takes a list of strings as input and returns those same strings ranked according to their relevance for malware analysis as output. It is intended to make an analyst's life easier, allowing them to focus their attention on only the most relevant strings located towards the top of its predicted output. StringSifter is designed to be seamlessly plugged into a user’s existing malware analysis stack. Once its GitHub repository is cloned and installed locally, it can be conveniently invoked from the command line with its default arguments according to:
strings <sample_of_interest> | rank_strings
We are also providing Docker command line tools for additional portability and usability. For a more detailed overview of how to use StringSifter, including how to specify optional arguments for customizable functionality, please view its README file on GitHub.
We have received great initial internal feedback about StringSifter from FireEye’s reverse engineers, SOC analysts, red teamers, and incident responders. Encouragingly, we have also observed users at the opposite ends of the experience spectrum find the tool to be useful – from beginners detonating their first piece of malware as part of a FireEye training course – to expert malware researchers triaging incoming samples on the front lines. By making StringSifter publicly available, we hope to enable a broad set of personas, use cases, and creative downstream applications. We will also welcome external contributions to help improve the tool’s accuracy and utility in future releases.
We are releasing StringSifter to coincide with our presentation at DerbyCon 2019 on Sept. 7, and we will also be doing a technical dive into the model at the Conference on Applied Machine Learning for Information Security this October. With its release, StringSifter will join FLARE VM, FakeNet, and CommandoVM as one of many recent malware analysis tools that FireEye has chosen to make publicly available. If you are interested in developing data-driven tools that make it easier to find evil and help benefit the security community, please consider joining the FDS or FLARE teams by applying to one of our job openings.
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
- Relevant strings occur disproportionately less often than irrelevant strings.
- Larger binaries can output upwards of tens of thousands of individual strings.
- The definition of "relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
- “G” is for gain, which corresponds to the magnitude of each string’s relevance.
- “C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
- “D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
- “N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
The FireEye Labs Advanced Reverse Engineering (FLARE) Team continues to share knowledge and tools with the community. We started this blog series with a script for Automatic Recovery of Constructed Strings in Malware. As always, you can download these scripts at the following location: https://github.com/fireeye/flare-ida. We hope you find all these scripts as useful as we do.
During my summer internship with the FLARE team, my goal was to develop IDAPython plug-ins that speed up the reverse engineering workflow in IDA Pro. While analyzing malware samples with the team, I realized that a lot of time is spent looking up information about functions, arguments, and constants at the Microsoft Developer Network (MSDN) website. Frequently switching to the developer documentation can interrupt the reverse engineering process, so we thought about ways to integrate MSDN information into IDA Pro automatically. In this blog post we will release a script that does just that, and we will show you how to use it.
The MSDN Annotations plug-in integrates information about functions, arguments and return values into IDA Pro’s disassembly listing in the form of IDA comments. This allows the information to be integrated as seamlessly as possible. Additionally, the plug-in is able to automatically rename constants, which further speeds up the analyst workflow. The plug-in relies on an offline XML database file, which is generated from Microsoft’s documentation and IDA type library files.
Table 1 shows what benefit the plug-in provides to an analyst. On the left you can see IDA Pro’s standard disassembly: seven arguments get pushed onto the stack and then the CreateFileA function is called. Normally an analyst would have to look up function, argument and possibly constant descriptions in the documentation to understand what this code snippet is trying to accomplish. To obtain readable constant values, an analyst would be required to research the respective argument, import the corresponding standard enumeration into IDA and then manually rename each value. The right side of Table 1 shows the result of executing our plug-in showing the support it offers to an analyst.
The most obvious change is that constants are renamed automatically. In this example, 40000000h was automatically converted to GENERIC_WRITE. Additionally, each function argument is renamed to a unique name, so the corresponding description can be added to the disassembly.
Table 1: Automatic labelling of standard symbolic constants
In Figure 1 you can see how the plug-in enables you to display function, argument, and constant information right within the disassembly. The top image shows how hovering over the CreateFileA function displays a short description and the return value. In the middle image, hovering over the hTemplateFile argument displays the corresponding description. And in the bottom image, you can see how hovering over dwShareMode, the automatically renamed constant displays descriptive information.
Figure 1: Hovering function names, arguments and constants displays the respective descriptions
How it works
Before the plug-in makes any changes to the disassembly, it creates a backup of the current IDA database file (IDB). This file gets stored in the same directory as the current database and can be used to revert to the previous markup in case you do not like the changes or something goes wrong.
The plug-in is designed to run once on a sample before you start your analysis. It relies on an offline database generated from the MSDN documentation and IDA Pro type library (TIL) files. For every function reference in the import table, the plug-in annotates the function’s description and return value, adds argument descriptions, and renames constants. An example of an annotated import table is depicted in Figure 2. It shows how a descriptive comment is added to each API function call. In order to identify addresses of instructions that position arguments prior to a function call, the plug-in relies on IDA Pro’s markup.
Figure 2: Annotated import table
Figure 3 shows the additional .msdn segment the plug-in creates in order to store argument descriptions. This only impacts the IDA database file and does not modify the original binary.
Figure 3: The additional segment added to the IDA database
The .msdn segment stores the argument descriptions as shown in Figure 4. The unique argument names and their descriptive comments are sequentially added to the segment.
Figure 4: Names and comments inserted for argument descriptions
To allow the user to see constant descriptions by hovering over constants in the disassembly, the plug-in imports IDA Pro’s relevant standard enumeration and adds descriptive comments to the enumeration members. Figure 5 shows this for the MACRO_CREATE enumeration, which stores constants passed as dwCreationDisposition to CreateFileA.
Figure 5: Descriptions added to the constant enumeration members
Preparing the MSDN database file
The plug-in’s graphical interface requires you to have the QT framework and Python scripting installed. This is included with the IDA Pro 6.6 release. You can also set it up for IDA 6.5 as described here (http://www.hexblog.com/?p=333).
As mentioned earlier, the plug-in requires an XML database file storing the MSDN documentation. We cannot distribute the database file with the plug-in because Microsoft holds the copyright for it. However, we provide a script to generate the database file. It can be cloned from the git repository at https://github.com/fireeye/flare-ida together with the annotation plug-in.
You can take the following steps to setup the database file. You only have to do this once.
- Download and install an offline version of the MSDN
documentationYou can download the Microsoft Windows SDK MSDN
documentation. The standalone installer can be downloaded from http://www.microsoft.com/en-us/download/details.aspx?id=18950.
Although it is not the newest SDK version, it includes all the
needed information and data extraction is straight-forward.As shown
in Figure 6, you can select to only install the help files. By
default they are located in C:\Program Files\Microsoft
Figure 6: Installing a local copy of the MSDN documentation
- Extract the files with an archive manager like 7-zip to a directory of your choice.
- Download and extract tilib.exe from Hex-Ray’s
download page at https://www.hex-rays.com/products/ida/support/download.shtml
To allow the plug-in to rename constants, it needs to know which enumerations to import. IDA Pro stores this information in TIL files located in %IDADIR%/til/. Hex-Rays provides a tool (tilib) to show TIL file contents via their download page for registered users. Download the tilib archive and extract the binary into %IDADIR%. If you run tilib without any arguments and it displays its help message, the program is running correctly.
- Run MSDN_crawler/msdn_crawler.py
<path to extracted MSDN documentation> <path to
tilib.exe> <path to til files>
With these prerequisites fulfilled, you can run the MSDN_crawler.py script, located in the MSDN_crawler directory. It expects the path to the TIL files you want to extract (normally %IDADIR%/til/pc/) and the path to the extracted MSDN documentation. After the script finishes execution the final XML database file should be located in the MSDN_data directory.
You can now run our plug-in to annotate your disassembly in IDA.
Running the MSDN annotations plug-in
In IDA, use File - Script file... (ALT + F7) to open the script named annotate_IDB_MSDN.py. This will display the dialog box shown in Figure 7 that allows you to configure the modifications the plug-in performs. By default, the plug-in annotates functions, arguments and rename constants. If you change the settings and execute the plug-in by clicking OK, your settings get stored in a configuration file in the plug-in’s directory. This allows you to quickly run the plug-in on other samples using your preferred settings. If you do not choose to annotate functions and/or arguments, you will not be able to see the respective descriptions by hovering over the element.
Figure 7: The plug-in’s configuration window showing the default settings
When you choose to use repeatable comments for function name annotations, the description is visible in the disassembly listing, as shown in Figure 8.
Figure 8: The plug-in’s preview of function annotations with repeatable comments
Similar Tools and Known Limitations
Parts of our solution were inspired by existing IDA Pro plug-ins, such as IDAScope and IDAAPIHelp. A special thank you goes out to Zynamics for their MSDN crawler and the IDA importer which greatly supported our development.
Our plug-in has mainly been tested on IDA Pro for Windows, though it should work on all platforms. Due to the structure of the MSDN documentation and limitations of the MSDN crawler, not all constants can be parsed automatically. When you encounter missing information you can extend the annotation database by placing files with supplemental information into the MSDN_data directory. In order to be processed correctly, they have to be valid XML following the schema given in the main database file (msdn_data.xml). However, if you want to extend partly existing function information, you only have to add the additional fields. Name tags are mandatory for this, as they get used to identify the respective element.
For example, if the parser did not recognize a commonly used constant, we could add the information manually. For the CreateFileA function’s dwDesiredAccess argument the additional information could look similar to Listing 1.
<?xml version="1.0" encoding="ISO-8859-1"?>
<description>All possible access rights</description>
Listing 1: Additional information enhancing the dwDesiredAccess argument for the CreateFileA function
In this post, we showed how you can generate a MSDN database file used by our plug-in to automatically annotate information about functions, arguments and constants into IDA Pro’s disassembly. Furthermore, we talked about how the plug-in works, and how you can configure and customize it. We hope this speeds up your analysis process!
Stay tuned for the FLARE Team’s next post where we will release solutions for the FLARE On Challenge (www.flare-on.com).
If you deal with the same threats that Mandiant does, you may have noticed a lot of malware lately named "fxsst.dll". If you're wondering why this is happening, this article is for you.
When I spend time working solely on reverse engineering malware, I don't often get the whole story with a malware sample. I wasn't curious about the name fxsst.dll until we had a plump stack of malware samples in our library that all had this name and were completely unrelated to each other. A quick check with our incident responders confirmed that in my latest specimen the malware was operational and the persistence mechanism was not immediately apparent. It gave me a flashback of ntshrui.dll from last year and I decided to investigate the issue.
If you try a quick Google search, you will find that fxsst.dll is a legitimate system DLL, specifically the fax server. The DLL is found on most systems that were installed with default settings, though it is an optional feature during the operating system installation process. It resides in the System32 folder, and a search for the DLL on a live system with Process Explorer shows that it is loaded inside the Explorer.exe process. At this point it sounds very similar to the ntshrui.dll problem, but we'll keep digging.
I found that my malware analysis VM did not contain fxsst.dll so I placed a malware DLL of this name in the System32 folder and rebooted. The malware was loaded by Explorer.exe upon reboot. Unlike ntshrui.dll though, there is no mention of fxsst.dll anywhere in the Registry. A string search through Explorer.exe reveals that it is also not loaded directly by the program through its import table or dynamically with LoadLibrary(). A complete search for fxsst.dll through all files revealed it to be referenced by the file stobject.dll in the System32 folder.
Stobject.dll is the System Tray component of the Windows Shell (the tray at the bottom right of your screen). It is loaded by Explorer.exe as an In-Process Server (InprocServer in Windows lingo) and utilized through a COM interface. It is referenced in the Registry as shown below.
With the mystery of stobject.dll solved, we can focus on how it loads fxsst.dll. We also need to figure out what a piece of malware must do to successfully replace the legitimate DLL and still ensure a stable system. The following is the decompiled fragment of stobject.dll which shows the loading of fxsst.dll.
According to this fragment of code if the fxsst.dll file is found it will be loaded and the two functions, IsFaxMessage() and FaxMonitorShutdown() will be resolved and stored in a pair of global function pointers. If the DLL is not found, or if the functions were not found within the DLL, then the global function pointers will be NULL. Let us now examine how these function pointers are used to see if a NULL pointer will adversely affect the program.
The IsFaxMessage() function is used only once, and in the fragment above you can see that the program checks to see if it is NULL before calling the function. If the pointer is NULL, then the program just skips this piece of functionality. The same is the case for FaxMonitorShutdown(). Why does this matter? If a malware replaces fxsst.dll and is loaded by Explorer through stobject.dll, then it doesn't need to implement the fax server function calls. The only repercussion is that the system can no longer act as a fax server, but that is a feature I don't think anyone would notice was missing. The malware authors obviously arrived at the same conclusion.
Since fxsst.dll resides in the System32 folder, is not protected by KnownDlls and is loaded by the Explorer.exe process, it is vulnerable to the same DLL load order hijack as ntshrui.dll. By that I mean you can place a malicious DLL in the C:WINDOWS folder and it will load instead of the legitimate copy in System32. You can also just overwrite the copy in System32 and the user of the system will not notice. Unlike with ntshrui.dll, the malware replacing fxsst.dll doesn't really need to take the extra step of loading the legitimate copy to restore its important functionality. In this case, the original functionality is useless to all but the last few people in the world who still use their Windows computer to receive faxes.
Conclusion (and TL;DR version):
You can take pretty much any malware DLL, name it fxsst.dll and drop it in C:WINDOWS or the System32 folder and Explorer.exe will load it at boot time. No hassle, no fuss. There are bound to be more problematic DLL locations like this, but I'll keep spreading the word as we find them.