Category Archives: data

What do IGA solutions have in common with listening to music anywhere?

Fifteen years ago, there was a revolution in personal music players. The market had slowly evolved from the Walkman to the Discman, when a bolt of innovation brought the MP3 player. Finally, the solution to having all of one’s music anywhere was solved with a single device, not a device plus a bag full of whatever physical media was popular at that time. History clearly shows that the iPod and a few of its competitors … More

The post What do IGA solutions have in common with listening to music anywhere? appeared first on Help Net Security.

Over half of security leaders still rely on spreadsheets

Senior security leaders within financial services companies are being challenged with a lack of trusted data to make effective security decisions and reduce their risk from cyber incidents, according to Panaseer. Results from a global external survey of over 400 security leaders that work in large financial services companies reveal concerns on security measurement and metrics that include data confidence, manual processes, resource wastage and request overload. Issues with processes, people and technologies The results … More

The post Over half of security leaders still rely on spreadsheets appeared first on Help Net Security.

How data intelligent organizations mitigate risk

Organizations that put data at the center of their vision and strategy realize a differentiated competitive advantage by mitigating cost and risk, growing revenue and improving the customer experience, a Collibra survey of more than 900 global business analysts reveals. Orgs rarely use data to guide business decisions Despite a majority of companies saying they valued using data to drive decisions, many organizations are not consistently executing. While 84% of respondents said that it is … More

The post How data intelligent organizations mitigate risk appeared first on Help Net Security.

Shifting responsibility is causing uncertainty and more security breaches

Data security is creating fear and trust issues for IT professionals, according to a new Oracle and KPMG report. The study of 750 cybersecurity and IT professionals across the globe found that a patchwork approach to data security, misconfigured services and confusion around new cloud security models has created a crisis of confidence that will only be fixed by organizations making security part of the culture of their business. Data security is keeping IT professionals … More

The post Shifting responsibility is causing uncertainty and more security breaches appeared first on Help Net Security.

New software enables existing sensors to detect ransomware

Engineers from SMU’s Darwin Deason Institute for Cybersecurity have developed software to detect ransomware attacks before attackers can inflict catastrophic damage. Ransomware is crippling cities and businesses all over the world, and the number of ransomware attacks have increased since the start of the coronavirus pandemic. Attackers are also threatening to publicly release sensitive data if ransom isn’t paid. The FBI estimates that ransomware victims have paid hackers more than $140 million in the last … More

The post New software enables existing sensors to detect ransomware appeared first on Help Net Security.

U.S. Marshals Announced Data Breach of Prisoners’ Information

The United States Marshals Service announced a data breach involving the personal information of its former and current prisoners. In a data breach notification letter obtained by ZDNet, the U.S. Marshals Service revealed that it had first learned of the security incident in late 2019. On December 30, 2019, the United States Marshals Service (USMS), […]… Read More

The post U.S. Marshals Announced Data Breach of Prisoners’ Information appeared first on The State of Security.

Excelerating Analysis, Part 2 — X[LOOKUP] Gon’ Pivot To Ya

In December 2019, we published a blog post on augmenting analysis using Microsoft Excel for various data sets for incident response investigations. As we described, investigations often include custom or proprietary log formats and miscellaneous, non-traditional forensic artifacts. There are, of course, a variety of ways to tackle this task, but Excel stands out as a reliable way to analyze and transform a majority of data sets we encounter.

In our first post, we discussed summarizing verbose artifacts using the CONCAT function, converting timestamps using the TIME function, and using the COUNTIF function for log baselining. In this post, we will cover two additional versatile features of Excel: LOOKUP functions and PivotTables.

For this scenario, we will use a dataset of logon events for an example Microsoft Office 365 (O365) instance to demonstrate how an analyst can enrich information in the dataset. Then we will demonstrate some examples of how to use PivotTables to summarize information and highlight anomalies in the data quickly.

Our data contains the following columns:

  • Description – Event description
  • User – User’s name
  • User Principle Name – email address
  • App – such as Office 365, Sharepoint, etc.
  • Location – Country
  • Date
  • IP address
  • User agent (simplified)
  • Organization – associated with IP address (as identified by O365)


Figure 1: O365 data set

LOOKUP for Data Enrichment

It may be useful to add more information to the data that could help us in analysis that isn’t provided by the original log source. A step FireEye Mandiant often performs during investigations is to take all unique IP addresses and query threat intelligence sources for each IP address for reputation, WHOIS information, connections to known threat actor activity, etc. This grants more information about each IP address that we can take into consideration in our analysis.

While FireEye Mandiant is privy to historical engagement data and Mandiant Threat Intelligence, if security teams or organizations do not have access to commercial threat intelligence feeds, there are numerous open source intelligence services that can be leveraged.

We can also use IP address geolocation services to obtain latitude and longitude related to each source IP address. This information may be useful in identifying anomalous logons based on geographical location.

After taking all source IP addresses, running them against threat intelligence feeds and geolocating them, we have the following data added to a second sheet called “IP Address Intel” in our Excel document:


Figure 2: IP address enrichment

We can already see before we even dive into the logs themselves that we have suspicious activity: The five IP addresses in the 203.0.113.0/24 range in our data are known to be associated with activity connected to a fictional threat actor tracked as TMP.OGRE.

To enrich our original dataset, we will add three columns to our data to integrate the supplementary information: “Latitude,” “Longitude,” and “Threat Intel” (Figure 3). We can use the VLOOKUP or XLOOKUP functions to quickly retrieve the supplementary data and integrate it into our main O365 log sheet.


Figure 3: Enrichment columns

VLOOKUP

The traditional way to look up particular data in another array is by using the VLOOKUP function. We will use the following formula to reference the “Latitude” values for a given IP address:


Figure 4: VLOOKUP formula for Latitude

There are four parts to this formula:

  1. Value to look up:
    • This dictates what cell value we are going to look up more information for. In this case, it is cell G2, which is the IP address.
  2. Table array:
    • This defines the entire array in which we will look up our value and return data from. The first column in the array must contain the value being looked up. In the aforementioned example, we are searching in ‘IP Address Intel’!$A$2:$D:$15. In other words, we are looking in the other sheet in this workbook we created earlier titled “IP Address Intel”, then in that sheet, search in the cell range of A2 to D15.

      Figure 5: VLOOKUP table array

      Note the use of the “$” to ensure these are absolute references and will not be updated by Excel if we copy this formula to other cells.
  3. Column index number:
    • This identifies the column number from which to return data. The first column is considered column 1. We want to return the “Latitude” value for the given IP address, so in the aforementioned example, we tell Excel to return data from column 2.
  4. Range lookup (match type)
    • This part of the formula tells Excel what type of matching to perform on the value being looked up. Excel defaults to “Approximate” matching, which assumes the data is sorted and will match the closest value. We want to perform “Exact” matching, so we put “0” here (“FALSE” is also accepted).

With the VLOOKUP function complete for the “Latitude” data, we can use the fill handle to update this field for the rest of the data set.

To get the values for the “Longitude” and “Threat Intel” columns, we repeat the process by using a similar function and, adjusting the column index number to reference the appropriate columns, then use the fill handle to fill in the rest of the column in our O365 data sheet:

  • For Longitude:
    • =VLOOKUP(G2,'IP Address Intel'!$A$2:$D$15,3,0)
  • For Threat Intel:
    • =VLOOKUP(G2,'IP Address Intel'!$A$2:$D$15,4,0)

Bonus Option: XLOOKUP

The XLOOKUP function in Excel is a more efficient way to reference the threat intelligence data sheet. XLOOKUP is a newer function introduced to Excel to replace the legacy VLOOKUP function and, at the time of writing this post, is only available to “O365 subscribers in the Monthly channel”, according to Microsoft. In this instance, we will also leverage Excel’s dynamic arrays and “spilling” to fill in this data more efficiently, instead of making an XLOOKUP function for each column.

NOTE: To utilize dynamic arrays and spilling, the data we are seeking to enrich cannot be in the form of a “Table” object. Instead, we will apply filters to the top row of our O365 data set by selecting the “Filter” option under “Sort & Filter” in the “Home” ribbon:


Figure 6: Filter option

To reference the threat intelligence data sheet using XLOOKUP, we will use the following formula:


Figure 7: XLOOKUP function for enrichment

There are three parts to this XLOOKUP formula:

  1. Value to lookup:
    • This dictates what cell value we are going to look up more information for. In this case, it is cell G2, which is the IP address.
  2. Array to look in:
    • This will be the array of data in which Excel will search for the value to look up. Excel does exact matching by default for XLOOKUP. In the aforementioned example, we are searching in ‘IP Address Intel’!$A$2:$A:$15. In other words, we are looking in the other sheet in this workbook titled “IP Address Intel”, then in that sheet, search in the cell range of A2 to A15:

      Figure 8: XLOOKUP array to look in

      Note the use of the “$” to ensure these are absolute references and will not be updated by Excel if we copy this formula to other cells.
  3. Array of data to return:
    • This part will be the array of data from which Excel will return data. In this case, Excel will return the data contained within the absolute range of B2 to D15 from the “IP Address Intel” sheet for the value that was looked up. In the aforementioned example formula, it will return the values in the row for the IP address 198.51.100.126:

      Figure 9: Data to be returned from ‘IP Address Intel’ sheet

      Because this is leveraging dynamic arrays and spilling, all three cells of the returned data will populate, as seen in Figure 4.

Now that our dataset is completely enriched by either using VLOOKUP or XLOOKUP, we can start hunting for anomalous activity. As a quick first step, since we know at least a handful of IP addresses are potentially malicious, we can filter on the “Threat Intel” column for all rows that match “TMP.OGRE” and reveal logons with source IP addresses related to known threat actors. Now we have timeframes and suspected compromised accounts to pivot off of for additional hunting through other data.

PIVOT! PIVOT! PIVOT!

One of the most useful tools for highlighting anomalies by summarizing data, performing frequency analysis and quickly obtaining other statistics about a given dataset is Excel’s PivotTable function.

Location Anomalies

Let’s utilize a PivotTable to perform frequency analysis on the location from which users logged in. This type of technique may highlight activity where a user account logged in from a location which is unusual for them.

To create a PivotTable for our data, we can select any cell in our O365 data and select the entire range with Ctrl+A. Then, under the “Insert” tab in the ribbon, select “PivotTable”:


Figure 10: PivotTable selection

This will bring up a window, as seen in Figure 11, to confirm the data for which we want to make a PivotTable (Step 1 in Figure 11). Since we selected our O365 log data set with Ctrl+A, this should be automatically populated. It will also ask where we want to put the PivotTable (Step 2 in Figure 11). In this instance, we created another sheet called “PivotTable 1” to place the PivotTable:


Figure 11: PivotTable creation

Now that the PivotTable is created, we must select how we want to populate the PivotTable using our data. Remember, we are trying to determine the locations from which all users logged in. We will want a row for each user and a sub-row for each location the user has logged in from. Let’s add a count of how many times they logged in from each location as well. We will use the “Date” field to do this for this example:


Figure 12: PivotTable field definitions

Examining this table, we can immediately see there are two users with source location anomalies: Ginger Breadman and William Brody have a small number of logons from “FarFarAway”, which is abnormal for these users based on this data set.

We can add more data to this PivotTable to get a timeframe of this suspicious activity by adding two more “Date” fields to the “Values” area. Excel defaults to “Count” of whatever field we drop in this area, but we will change this to the “Minimum” and “Maximum” values by using the “Value Field Settings”, as seen in Figure 13.


Figure 13: Adding min and max dates

Now we have a PivotTable that shows us anomalous locations for logons, as well as the timeframe in which the logons occurred, so we can hone our investigation. For this example, we also formatted all cells with timestamp values to reflect the format FireEye Mandiant typically uses during analysis by selecting all the appropriate cells, right-clicking and choosing “Format Cells”, and using a “Custom” format of “YYYY-MM-DD HH:MM:SS”.


Figure 14: PivotTable with suspicious locations and timeframe

IP Address Anomalies

Geolocation anomalies may not always be valuable. However, using a similar configuration as the previous example, we can identify suspicious source IP addresses. We will add “User Principle Name” and “IP Address” fields as Rows, and “IP Address” as Values. Let’s also add the “App” field to Columns. Our field settings and resulting table are displayed in Figure 15:


Figure 15: PivotTable with IP addresses and apps

With just a few clicks, we have a summarized table indicating which IP addresses each user logged in from, and which app they logged into. We can quickly identify two users logged in from IP addresses in the 203.0.113.0/24 range six times, and which applications they logged into from each of these IP addresses.

While these are just a couple use cases, there are many ways to format and view evidence using PivotTables. We recommend trying PivotTables on any data set being reviewed with Excel and experimenting with the Rows, Columns, and Values parameters.

We also recommend adjusting the PivotTable options, which can help reformat the table itself into a format that might fit requirements.

Conclusion

These Excel functions are used frequently during investigations at FireEye Mandiant and are considered important forensic analysis techniques. The examples we give here are just a glimpse into the utility of LOOKUP functions and PivotTables. LOOKUP functions can be used to reference a multitude of data sources and can be applied in other situations during investigations such as tracking remediation and analysis efforts.

PivotTables may be used in a variety of ways as well, depending on what data is available, and what sort of information is being analyzed to identify suspicious activity. Employing these techniques, alongside the ones we highlighted previously, on a consistent basis will go a long way in "excelerating" forensic analysis skills and efficiency.

Take Action This Data Privacy Day

We all know that data breaches have been on the rise, and hackers are finding clever, new ways to access our devices and information. But sometimes it takes a little push to get us to take action when it comes to protecting our most sensitive information. That’s why this Data Privacy Day, on January 28th, we have the perfect opportunity to own our privacy by taking the time to safeguard data, and help others do the same.

After all, there are now roughly four billion consumers connected online, living various moments of truth that could potentially put them at risk. From sharing photos and socializing with friends, to completing bank transactions—people expect to do what they desire online whenever and wherever they want. But as the saying goes, “with great power comes great responsibility”, and it is imperative that consumers take accountability, not just by enjoying the advantages of connecting online, but by protecting their online identities, too.

Remember, your personal information and online presence are as valuable as money, and what you post online can last a lifetime. Data Privacy Day is a reminder for everybody to make sure that they are protecting what matters most to them: their personal data, as well as their families and friends.

So, let’s get started. Even if you have a large online footprint, protecting this information doesn’t have to be overwhelming.

Here are a few tips:

Update your privacy and security settings—Begin with the websites and applications that you use the most. Check to see if your accounts are marked as private, or if they are open to the public. Also, look to see if your data is being leaked to third parties. You want to select the most secure settings available, while still being able to use these tools correctly.  Here’s a guide from StaySafeOnline to help you get started.

Start the New Year with a new digital you— When opening new online accounts for sharing personal information such as your email address or date of birth, create a new digital persona that has alternative answers that only you would know. This will limit online tracking of your real personal information.

Lockdown your logins—At the same time, secure your logins by making sure that you are creating long and unique passphrases for all of your accounts. Use multi-factor identification, when available. This is a security protocol that takes more than just one step to validate your login, such as a password and a code sent to your mobile device, or a fingerprint. It is exponentially more secure than a simple password.

Spread the word and get involved— Once you have done your own privacy check, help others do the same. It’s important that we all feel empowered to protect our privacy, so share the safety tips in this article with your family, coworkers, and community. Here are some helpful resources to create privacy awareness where you live.

Protect your family and friends – If you are a parent, you can make a big difference by helping raise privacy-savvy kids. After all, today’s kids represent the future of online security. If they start building their digital footprints with solid safety habits, it makes all of us more secure.

Begin with this handy tip sheet.

Own your information—It’s time for everyone to feel empowered to own their information. While there will always be online threats, you can minimize any potential harm by committing yourself to the action steps we listed above. Once you have, spread the word by using the hashtag #privacyaware on Twitter, Instagram, or Facebook.

Let’s make this 12th annual international Data Privacy Day the most effective ever! Stay up to date with all the event happenings, here, and keep informed year-round on the latest threats and security tips.

The post Take Action This Data Privacy Day appeared first on McAfee Blogs.

Excelerating Analysis – Tips and Tricks to Analyze Data with Microsoft Excel

Incident response investigations don’t always involve standard host-based artifacts with fully developed parsing and analysis tools. At FireEye Mandiant, we frequently encounter incidents that involve a number of systems and solutions that utilize custom logging or artifact data. Determining what happened in an incident involves taking a dive into whatever type of data we are presented with, learning about it, and developing an efficient way to analyze the important evidence.

One of the most effective tools to perform this type of analysis is one that is in almost everyone’s toolkit: Microsoft Excel. In this article we will detail some tips and tricks with Excel to perform analysis when presented with any type of data.

Summarizing Verbose Artifacts

Tools such as FireEye Redline include handy timeline features to combine multiple artifact types into one concise timeline. When we use individual parsers or custom artifact formats, it may be tricky to view multiple types of data in the same view. Normalizing artifact data with Excel to a specific set of easy-to-use columns makes for a smooth combination of different artifact types.

Consider trying to review parsed file system, event log, and Registry data in the same view using the following data.

$SI Created

$SI Modified

File Name

File Path

File Size

File MD5

File Attributes

File Deleted

2019-10-14 23:13:04

2019-10-14 23:33:45

Default.rdp

C:\Users\
attacker\Documents\

485

c482e563df19a40
1941c99888ac2f525

Archive

FALSE

Event Gen Time

Event ID

Event Message

Event Category

Event User

Event System

2019-10-14 23:13:06

4648

A logon was attempted using explicit credentials.

Subject:
   Security ID:  DomainCorp\Administrator
   Account Name:  Administrator
   Account Domain:  DomainCorp
   Logon ID:  0x1b38fe
   Logon GUID:  {00000000-0000-0000-0000-000000000000}
Account Whose Credentials Were Used:
   Account Name:  VictimUser
   Account Domain:  DomainCorp
   Logon GUID:  {00000000-0000-0000-0000-000000000000}
Target Server:
   Target Server Name: DestinationServer
   Additional Information:
Process Information:
   Process ID:  0x5ac
   Process Name:  C:\Program Files\Internet Explorer\iexplore.exe
Network Information:
   Network Address: -
   Port:   -

Logon

Administrator

SourceSystem

KeyModified

Key Path

KeyName

ValueName

ValueText

Type

2019-10-14 23:33:46

HKEY_USER\Software\Microsoft\Terminal Server Client\Servers\

DestinationServer

UsernameHInt

VictimUser

REG_SZ

Since these raw artifact data sets have different column headings and data types, they would be difficult to review in one timeline. If we format the data using Excel string concatenation, we can make the data easy to combine into a single timeline view. To format the data, we can use the “&” operation with a function to join information we may need into a “Summary” field.

An example command to join the relevant file system data delimited by ampersands could be “=D2 & " | " & C2 & " | " & E2 & " | " & F2 & " | " & G2 & " | " & H2”. Combining this format function with a “Timestamp” and “Timestamp Type” column will complete everything we need for streamlined analysis.

Timestamp

Timestamp Type

Event

2019-10-14 23:13:04

$SI Created

C:\Users\attacker\Documents\ | Default.rdp | 485 | c482e563df19a401941c99888ac2f525  | Archive | FALSE

2019-10-14 23:13:06

Event Gen Time

4648 | A logon was attempted using explicit credentials.

Subject:
   Security ID:  DomainCorp\Administrator
   Account Name:  Administrator
   Account Domain:  DomainCorp
   Logon ID:  0x1b38fe
   Logon GUID:  {00000000-0000-0000-0000-000000000000}
Account Whose Credentials Were Used:
   Account Name:  VictimUser
   Account Domain:  DomainCorp
   Logon GUID:  {00000000-0000-0000-0000-000000000000}
Target Server:
   Target Server Name: DestinationServer
   Additional Information:
Process Information:
   Process ID:  0x5ac
   Process Name:  C:\Program Files\Internet Explorer\iexplore.exe
Network Information:
   Network Address: -
   Port:   - | Logon | Administrator | SourceSystem

2019-10-14 23:33:45

$SI Modified

C:\Users\attacker\Documents\ | Default.rdp | 485 | c482e563df19a401941c99888ac2f525  | Archive | FALSE

2019-10-14 23:33:46

KeyModified

HKEY_USER\Software\Microsoft\Terminal Server Client\Servers\ | DestinationServer | UsernameHInt | VictimUser

After sorting by timestamp, we can see evidence of the “DomainCorp\Administrator” account connecting from “SourceSystem” to “DestinationServer” with the “DomainCorp\VictimUser” account via RDP across three artifact types.

Time Zone Conversions

One of the most critical elements of incident response and forensic analysis is timelining. Temporal analysis will often turn up new evidence by identifying events that precede or follow an event of interest. Equally critical is producing an accurate timeline for reporting. Timestamps and time zones can be frustrating, and things can get confusing when the systems being analyzed span various time zones. Mandiant tracks all timestamps in Coordinated Universal Time (UTC) format in its investigations to eliminate any confusion of both time zones and time adjustments such as daylight savings and regional summer seasons. 

Of course, various sources of evidence do not always log time the same way. Some may be local time, some may be UTC, and as mentioned, data from sources in various geographical locations complicates things further. When compiling timelines, it is important to first know whether the evidence source is logged in UTC or local time. If it is logged in local time, we need to confirm which local time zone the evidence source is from. Then we can use the Excel TIME()  formula to convert timestamps to UTC as needed.

This example scenario is based on a real investigation where the target organization was compromised via phishing email, and employee direct deposit information was changed via an internal HR application. In this situation, we have three log sources: email receipt logs, application logins, and application web logs. 

The email logs are recorded in UTC and contain the following information:

The application logins are recorded in Eastern Daylight Time (EDT) and contain the following:

The application web logs are also recorded in Eastern Daylight Time (EDT) and contain the following:

To take this information and turn it into a master timeline, we can use the CONCAT function (an alternative to the ampersand concatenation used previously) to make a summary of the columns in one cell for each log source, such as this example formula for the email receipt logs:

This is where checking our time zones for each data source is critical. If we took the information as it is presented in the logs and assumed the timestamps were all in the same time zone and created a timeline of this information, it would look like this:

As it stands the previous screenshot, we have some login events to the HR application, which may look like normal activity for the employees. Then later in the day, they receive some suspicious emails. If this were hundreds of lines of log events, we would risk the login and web log events being overlooked as the time of activity precedes our suspected initial compromise vector by a few hours. If this were a timeline used for reporting, it would also be inaccurate.

When we know which time zone our log sources are in, we can adjust the timestamps accordingly to reflect UTC. In this case, we confirmed through testing that the application logins and web logs are recorded in EDT, which is four hours behind UTC, or “UTC-4”. To change these to UTC time, we just need to add four hours to the time. The Excel TIME function makes this easy. We can just add a column to the existing tables, and in the first cell we type “=A2+TIME(4,0,0)”. Breaking this down:

  • =A2
    • Reference cell A2 (in this case our EDT timestamp). Note this is not an absolute reference, so we can use this formula for the rest of the rows.
  • +TIME
    • This tells Excel to take the value of the data in cell A2 as a “time” value type and add the following amount of time to it:
  • (4,0,0)
    • The TIME function in this instance requires three values, which are, from left to right: hours, minutes, seconds. In this example, we are adding 4 hours, 0 minutes, and 0 seconds.

Now we have a formula that takes the EDT timestamp and adds four hours to it to make it UTC. Then we can replicate this formula for the rest of the table. The end result looks like this:

When we have all of our logs in the same time zone, we are ready to compile our master timeline. Taking the UTC timestamps and the summary events we made, our new, accurate timeline looks like this:

Now we can clearly see suspicious emails sent to (fictional) employees Austin and Dave. A few minutes later, Austin’s account logs into the HR application and adds a new bank account. After this, we see the same email sent to Jake. Soon after this, Jake’s account logs into the HR application and adds the same bank account information as Austin’s. Converting all our data sources to the same time zone with Excel allowed us to quickly link these events together and easily identify what the attacker did. Additionally, it provided us with more indicators, such as the known-bad bank account number to search for in the rest of the logs.

Pro Tip: Be sure to account for log data spanning over changes in UTC offset due to regional events such as daylight savings or summer seasons. For example, local time zone adjustments will need to change for logs in United States Eastern Time from Virginia, USA from +TIME(5,0,0) to +TIME(4,0,0) the first weekend in March every year and back from +TIME(4,0,0) to +TIME(5,0,0) the first weekend in November to account for daylight and standard shifts.

CountIf for Log Baselining

When reviewing logs that record authentication in the form of a user account and timestamp, we can use COUNTIF to establish simple baselines to identify those user accounts with inconsistent activity.  

In the example of user logons that follows, we'll use the formula "=COUNTIF($B$2:$B$25,B2)" to establish a historical baseline. Here is a breakdown of the parameters for this COUNTIF formula located in C2 in our example: 

  • COUNTIF 
    • This Excel formula counts how many times a value exists in a range of cells. 
  • $B$2:$B$25 
    • This is the entire range of all cells, B2 through B25, that we want to use as a range to search for a specific value. Note the use of "$" to ensure that the start and end of the range are an absolute reference and are not automatically updated by Excel if we copy this formula to other cells. 
  • B2 
    • This is the cell that contains the value we want to search for and count occurrences of in our range of $B$2:$B$25. Note that this parameter is not an absolute reference with a preceding "$". This allows us to fill the formula down through all rows and ensure that we are counting the applicable user name. 

To summarize, this formula will search the username column of all logon data and count how many times the user of each logon has logged on in total across all data points. 

When most user accounts log on regularly, a compromised account being used to logon for the first time may clearly stand out when reviewing total log on counts. If we have a specific time frame in mind, it may be helpful to know which accounts first logged on during that time.  

The COUNTIF formula can help track accounts through time to identify their first log on which can help identify rarely used credentials that were abused for a limited time frame.  

We'll start with the formula "=COUNTIF($B$2:$B2,B2)" in cell D3. Here is a breakdown of the parameters  for this COUNTIF formula. Note that the use of "$" for absolute referencing is slightly different for the range used, and that is an importance nuance: 

  • COUNTIF 
    • This Excel formula counts how many times a value exists in a range of cells. 
  • $B$2:$B2 
    • This is the range of cells, B2 through B2, that we want to start with. Since we want to increase our range as we go through the rows of the log data, the ending cell row number (2 in this example) is not made absolute. As we fill this formula down through the rest of our log data, it will automatically expand the range to include the current log record and all previous logs. 
  • B2 
    • This cell contains the value we want to search for and provides a count of occurrences found in our defined range. Note that this parameter B2 is not an absolute reference with a preceding "$". This allows us to fill the formula down through all rows and ensure that we are counting the applicable user name. 

To summarize, this formula will search the username column of all logon data before and including the current log and count how many times the user of each logon has logged on up to that point in time. 

The following example illustrates how Excel automatically updated the range for D15 to $B$2:$B15 using the fill handle.  


To help visualize a large data set, let's add color scale conditional formatting to each row individually. To do so: 

  1. Select only the cells we want to compare with the color scale (such as D2 to D25). 
  2. On the Home menu, click the Conditional Formatting button in the Styles area. 
  3. Click Color Scales. 
  4. Click the type of color scale we would like to use. 

The following examples set the lowest values to red and the highest values to green. We can see how: 

  • Users with lower authentication counts contrast against users with more authentications. 
  • The first authentication times of users stand out in red. 

Whichever colors are used, be careful not to assume that one color, such as green, implies safety and another color, such as red, implies maliciousness.

Conclusion

The techniques described in this post are just a few ways to utilize Excel to perform analysis on arbitrary data. While these techniques may not leverage some of the more powerful features of Excel, as with any variety of skill set, mastering the fundamentals enables us to perform at a higher level. Employing fundamental Excel analysis techniques can empower an investigator to work through analysis of any presented data type as efficiently as possible.

Showing Vulnerability to a Machine: Automated Prioritization of Software Vulnerabilities

Introduction

If a software vulnerability can be detected and remedied, then a potential intrusion is prevented. While not all software vulnerabilities are known, 86 percent of vulnerabilities leading to a data breach were patchable, though there is some risk of inadvertent damage when applying software patches. When new vulnerabilities are identified they are published in the Common Vulnerabilities and Exposures (CVE) dictionary by vulnerability databases, such as the National Vulnerability Database (NVD).

The Common Vulnerabilities Scoring System (CVSS) provides a metric for prioritization that is meant to capture the potential severity of a vulnerability. However, it has been criticized for a lack of timeliness, vulnerable population representation, normalization, rescoring and broader expert consensus that can lead to disagreements. For example, some of the worst exploits have been assigned low CVSS scores. Additionally, CVSS does not measure the vulnerable population size, which many practitioners have stated they expect it to score. The design of the current CVSS system leads to too many severe vulnerabilities, which causes user fatigue. ­

To provide a more timely and broad approach, we use machine learning to analyze users’ opinions about the severity of vulnerabilities by examining relevant tweets. The model predicts whether users believe a vulnerability is likely to affect a large number of people, or if the vulnerability is less dangerous and unlikely to be exploited. The predictions from our model are then used to score vulnerabilities faster than traditional approaches, like CVSS, while providing a different method for measuring severity, which better reflects real-world impact.

Our work uses nowcasting to address this important gap of prioritizing early-stage CVEs to know if they are urgent or not. Nowcasting is the economic discipline of determining a trend or a trend reversal objectively in real time. In this case, we are recognizing the value of linking social media responses to the release of a CVE after it is released, but before it is scored by CVSS. Scores of CVEs should ideally be available as soon as possible after the CVE is released, while the current process often hampers prioritization of triage events and ultimately slows response to severe vulnerabilities. This crowdsourced approach reflects numerous practitioner observations about the size and widespread nature of the vulnerable population, as shown in Figure 1. For example, in the Mirai botnet incident in 2017 a massive number of vulnerable IoT devices were compromised leading to the largest Denial of Service (DoS) attack on the internet at the time.


Figure 1: Tweet showing social commentary on a vulnerability that reflects severity

Model Overview

Figure 2 illustrates the overall process that starts with analyzing the content of a tweet and concludes with two forecasting evaluations. First, we run Named Entity Recognition (NER) on tweet contents to extract named entities. Second, we use two classifiers to test the relevancy and severity towards the pre-identified entities. Finally, we match the relevant and severe tweets to the corresponding CVE.


Figure 2: Process overview of the steps in our CVE score forecasting

Each tweet is associated to CVEs by inspecting URLs or the contents hosted at a URL. Specifically, we link a CVE to a tweet if it contains a CVE number in the message body, or if the URL content contains a CVE. Each tweet must be associated with a single CVE and must be classified as relevant to security-related topics to be scored. The first forecasting task considers how well our model can predict the CVSS rankings ahead of time. The second task is predicting future exploitation of the vulnerability for a CVE based on Symantec Antivirus Signatures and Exploit DB. The rationale is that eventual presence in these lists indicates not just that exploits can exist or that they do exist, but that they also are publicly available.

Modeling Approach

Predicting the CVSS scores and exploitability from Twitter data involves multiple steps. First, we need to find appropriate representations (or features) for our natural language to be processed by machine learning models. In this work, we use two natural language processing methods in natural language processing for extracting features from text: (1) N-grams features, and (2) Word embeddings. Second, we use these features to predict if the tweet is relevant to the cyber security field using a classification model. Third, we use these features to predict if the relevant tweets are making strong statements indicative of severity. Finally, we match the severe and relevant tweets up to the corresponding CVE.

N-grams are word sequences, such as word pairs for 2-gram or word triples for 3-grams. In other words, they are contiguous sequence of n words from a text. After we extract these n-grams, we can represent original text as a bag-of-ngrams. Consider the sentence:

A criticial vulnerability was found in Linux.

If we consider all 2-gram features, then the bag-of-ngrams representation contains “A critical”, “critical vulnerability”, etc.

Word embeddings are a way to learn the meaning of a word by how it was used in previous contexts, and then represent that meaning in a vector space. Word embeddings know the meaning of a word by the company it keeps, more formally known as the distribution hypothesis. These word embedding representations are machine friendly, and similar words are often assigned similar representations. Word embeddings are domain specific. In our work, we additionally train terminology specific to cyber security topics, such as related words to threats are defenses, cyberrisk, cybersecurity, threat, and iot-based. The embedding would allow a classifier to implicitly combine the knowledge of similar words and the meaning of how concepts differ. Conceptually, word embeddings may help a classifier use these embeddings to implicitly associate relationships such as:

device + infected = zombie

where an entity called device has a mechanism applied called infected (malicious software infecting it) then it becomes a zombie.

To address issues where social media tweets differ linguistically from natural language, we leverage previous research and software from the Natural Language Processing (NLP) community. This addresses specific nuances like less consistent capitalization, and stemming to account for a variety of special characters like ‘@’ and ‘#’.


Figure 3: Tweet demonstrating value of identifying named entities in tweets in order to gauge severity

Named Entity Recognition (NER) identifies the words that construct nouns based on their context within a sentence, and benefits from our embeddings incorporating cyber security words. Correctly identifying the nouns using NER is important to how we parse a sentence. In Figure 3, for instance, NER facilitates Windows 10 to be understood as an entity while October 2018 is treated as elements of a date. Without this ability, the text in Figure 3 may be confused with the physical notion of windows in a building.

Once NER tokens are identified, they are used to test if a vulnerability affects them. In the Windows 10 example, Windows 10 is the entity and the classifier will predict whether the user believes there is a serious vulnerability affecting Windows 10. One prediction is made per entity, even if a tweet contains multiple entities. Filtering tweets that do not contain named entities reduces tweets to only those relevant to expressing observations on a software vulnerability.

From these normalized tweets, we can gain insight into how strongly users are emphasizing the importance of the vulnerability by observing their choice of words. The choice of adjective is instrumental in the classifier capturing the strong opinions. Twitter users often use strong adjectives and superlatives to convey magnitude in a tweet or when stressing the importance of something related to a vulnerability like in Figure 4. This magnitude often indicates to the model when a vulnerability’s exploitation is widespread. Table 1 shows our analysis of important adjectives that tend to indicate a more severe vulnerability.


Figure 4: Tweet showing strong adjective use


Table 1: Log-odds ratios for words correlated with highly-severe CVEs

Finally, the processed features are evaluated with two different classifiers to output scores to predict relevancy and severity. When a named entity is identified all words comprising it are replaced with a single token to prevent the model from biasing toward that entity. The first model uses an n-gram approach where sequences of two, three, and four tokens are input into a logistic regression model. The second approach uses a one-dimensional Convolutional Neural Network (CNN), comprised of an embedding layer, a dropout layer then a fully connected layer, to extract features from the tweets.

Evaluating Data

To evaluate the performance of our approach, we curated a dataset of 6,000 tweets containing the keywords vulnerability or ddos from Dec 2017 to July 2018. Workers on Amazon’s Mechanical Turk platform were asked to judge whether a user believed a vulnerability they were discussing was severe. For all labeling, multiple users must independently agree on a label, and multiple statistical and expert-oriented techniques are used to eliminate spurious annotations. Five annotators were used for the labels in the relevancy classifier and ten annotators were used for the severity annotation task. Heuristics were used to remove unserious respondents; for example, when users did not agree with other annotators for a majority of the tweets. A subset of tweets were expert-annotated and used to measure the quality of the remaining annotations.

Using the features extracted from tweet contents, including word embeddings and n-grams, we built a model using the annotated data from Amazon Mechanical Turk as labels. First, our model learns if tweets are relevant to a security threat using the annotated data as ground truth. This would remove a statement like “here is how you can #exploit tax loopholes” from being confused with a cyber security-related discussion about a user exploiting a software vulnerability as a malicious tool. Second, a forecasting model scores the vulnerability based on whether annotators perceived the threat to be severe.

CVSS Forecasting Results

Both the relevancy classifier and the severity classifier were applied to various datasets. Data was collected from December 2017 to July 2018. Most notably 1,000 tweets were held-out from the original 6,000 to be used for the relevancy classifier and 466 tweets were held-out for the severity classifier. To measure the performance, we use the Area Under the precision-recall Curve (AUC), which is a correctness score that summarizes the tradeoffs of minimizing the two types of errors (false positive vs false negative), with scores near 1 indicating better performance.

  • The relevancy classifier scored 0.85
  • The severity classifier using the CNN scored 0.65
  • The severity classifier using a Logistic Regression model, without embeddings, scored 0.54

Next, we evaluate how well this approach can be used to forecast CVSS ratings. In this evaluation, all tweets must occur a minimum of five days ahead of CVSS scores. The severity forecast score for a CVE is defined as the maximum severity score among the tweets which are relevant and associated with the CVE. Table 1 shows the results of three models: randomly guessing the severity, modeling based on the volume of tweets covering a CVE, and the ML-based approach described earlier in the post. The scoring metric in Table 2 is precision at top K using our logistic regression model. For example, where K=100, this is a way for us to identify what percent of the 100 most severe vulnerabilities were correctly predicted. The random model would predicted 59, while our model predicted 78 of the top 100 and all ten of the most severe vulnerabilities.


Table 2: Comparison of random simulated predictions, a model based just on quantitative features like “likes”, and the results of our model

Exploit Forecasting Results

We also measured the practical ability of our model to identify the exploitability of a CVE in the wild, since this is one of the motivating factors for tracking. To do this, we collected severe vulnerabilities that have known exploits by their presence in the following data sources:

  • Symantec Antivirus signatures
  • Symantec Intrusion Prevention System signatures
  • ExploitDB catalog

The dataset for exploit forecasting was comprised of 377,468 tweets gathered from January 2016 to November 2017. Of the 1,409 CVEs used in our forecasting evaluation, 134 publicly weaponized vulnerabilities were found across all three data sources.

Using CVEs from the aforementioned sources as ground truth, we find our CVE classification model is more predictive of detecting operationalized exploits from the vulnerabilities than CVSS. Table 3 shows precision scores illustrating seven of the top ten most severe CVEs and 21 of the top 100 vulnerabilities were found to have been exploited in the wild. Compare that to one of the top ten and 16 of the top 100 from using the CVSS score itself. The recall scores show the percentage of our 134 weaponized vulnerabilities found in our K examples. In our top ten vulnerabilities, seven were found to be in the 134 (5.2%), while the CVSS scoring’s top ten included only one (0.7%) CVE being exploited.


Table 3: Precision and recall scores for the top 10, 50 and 100 vulnerabilities when comparing CVSS scoring, our simplistic volume model and our NLP model

Conclusion

Preventing vulnerabilities is critical to an organization’s information security posture, as it effectively mitigates some cyber security breaches. In our work, we found that social media content that pre-dates CVE scoring releases can be effectively used by machine learning models to forecast vulnerability scores and prioritize vulnerabilities days before they are made available. Our approach incorporates a novel social sentiment component, which CVE scores do not, and it allows scores to better predict real-world exploitation of vulnerabilities. Finally, our approach allows for a more practical prioritization of software vulnerabilities effectively indicating the few that are likely to be weaponized by attackers. NIST has acknowledged that the current CVSS methodology is insufficient. The current process of scoring CVSS is expected to be replaced by ML-based solutions by October 2019, with limited human involvement. However, there is no indication of utilizing a social component in the scoring effort.

This work was led by researchers at Ohio State under the IARPA CAUSE program, with support from Leidos and FireEye. This work was originally presented at NAACL in June 2019, our paper describes this work in more detail and was also covered by Wired.