Category Archives: Analytics

How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs

Data is the lifeblood of digital businesses, and a key competitive advantage. The question is: how can you store your data cost-efficiently, access it quickly, while abiding by privacy laws?

At Imperva, we wanted to store our data for long-term access. Databases would’ve cost too much in disk and memory, especially since we didn’t know much it would grow, how long we would keep it, and which data we would actually access in the future. The only thing we did know? That new business cases for our data would emerge.

That’s why we deployed a data lake. It turned out to be the right decision, allowing us to store 1,000 times more data than before, even while slashing costs.

What is a data lake?

A data lake is a repository of files stored in a distributed system. Information is stored in its native form, with little or no processing. You simply store the data in its native formats, such as JSON, XML, CSV, or text.

Analytics queries can be run against both data lakes and databases. In a database you create a schema, plan your queries, and add indices to improve performance. In a data lake, it’s different — you simply store the data and it’s query-ready.

Some file formats are better than others, of course. Apache Parquet allows you to store records in a compressed columnar file. The compression saves disk space and IO, while the columnar format allows the query engine to scan only the relevant columns. This reduces query time and costs.

Using a distributed file system lets you store more data at a lower cost. Whether you use Hadoop HDFS, AWS S3, or Azure Storage, the benefits include:

  • Data replication and availability
  • Options to save more money – for example, AWS S3 has different storage options with different costs
  • Retention policy – decide how long you want to keep your data before it’s automatically deleted

No wonder experts such as Adrian Cockcroft, VP of cloud architecture strategy at Amazon Web Services, said this week that “cloud data lakes are the future.”

Analytic queries: data lake versus database

Let’s examine the capabilities, advantages and disadvantages of a data lake versus a database.

The data

A data lake supports structured and unstructured data and everything in-between. All data is collected and immediately ready for analysis. Data can be transformed to improve user experience and performance. For example, fields can be extracted from a data lake and data can be aggregated.

A database contains only structured and transformed data. It is impossible to add data without declaring tables, relations and indices. You have to plan ahead and transform the data according to your schema.

Figure 1: Data Lake versus Database

The Users

Most users in a typical organization are operational, using applications and data in predefined and repetitive ways. A database is usually ideal for these users. Data is structured and optimized for these predefined use-cases. Reports can be generated, and filters can be applied according to the application’s design.

Advanced users, by contrast, may go beyond an application to the data source and use custom tools to process the data. They may also bring in data from outside the organization.

The last group are the data experts, who do deep analysis on the data. They need the raw data, and their requirements change all the time.

Data lakes support all of these users, but especially advanced and expert users, due to the agility and flexibility of a data lake.

Figure 2: Typical user distribution inside an organization

Query engine(s)

In a database, the query engine is internal and is impossible to change. In a data lake, the query engine is external, letting users choose based on their needs. For example, you can choose Presto for SQL-based analytics and Spark for machine learning.

Figure 3: A data lake may have multiple external query engines. A database has a single internal query engine.

Support of new business use-case

Database changes may be complex. Data should be analyzed and formatted, while schema has to be created before data can be inserted. If you have a busy development team, users can wait months or a year to see the new data in their application.

Few businesses can wait this long. Data lakes solve this by letting users go beyond the structure to explore data. If this proves fruitful, than a formal schema can be applied. You get to results quickly, and fail fast. This agility lets organizations quickly improve their use cases, better know their data, and react fast to changes.

Figure 4: Support of new business use-case

Data lake structure

Here’s how data may flow inside a data lake.

Figure 5: Data lake structure and flow

In this example, CSV files are added to the data lake to a “current day” folder. This folder is the daily partition which allows querying a day’s data using a filter like day = ‘2018-1-1’. Partitions are the most efficient way to filter data.

The data under tables/events is an aggregated, sorted and formatted version of the CSV data. It uses the parquet format to improve query performance and for compression. It also has an additional “type” partition, because most queries work only on a single event type. Each file has millions of records inside, with metadata for efficiency. For example, you can know the count, min and max values for all of the columns without scanning the file.

This events table data has been added to the data lake after the raw data has been validated and analyzed.

Here is a simplified example of CSV to Parquet conversion:

Figure 6: Example for conversion of CSV to Parquet

Parquet files normally hold large number of records, and can be divided internally into “row groups” which have their own metadata. Repeating values improves compression and the columnar structure allows scanning only the relevant columns. The CSV data can be queried at any time, but it is not as efficient as querying the data under the tables/events data.

Flow and Architecture


Imperva’s data lake uses Amazon Web Services (AWS). Below shows the flow and services we used to build it.

Figure 7: Architecture and flow

Adding data (ETL – Extract -> Transform -> Load)

  • We use Kafka, which is a producer-consumer distributed streaming platform. Data is added to Kafka, and later read by a microservice which create raw Parquet files in S3.
  • Another microservice uses AWS Athena to hourly or daily process the data – filter, partition, and sort and aggregate it into new Parquet files
  • This flow is done on each of the AWS regions we support

Figure 8: SQL to Parquet flow example

Technical details:

  • Each partition creation is done by one or more Athena:
  • Each query result with one more more Parquet files
  • ETL microservices run on a Kubernetes cluster per region. They are developed and deployed using our development pipeline.


  • Different microservices consume the aggregated data using Athena API through boto3 Python library
  • Day to day queries are done using SQL client like DBeaver with Athena JDBC driver. Athena AWS management console is also used for SQL queries
  • Apache Spark engine is used to run spark queries, including machine learning using the spark-ml Apache Zeppelin is used as a client to run scripts and display visualization. Both Spark and Zeppelin are installed as part of AWS EMR service.

Multi-region queries

Data privacy regulations such as GDPR add a twist, especially since we store data in multiple regions. There are two ways to perform multi-region queries:

  • Single query engine based in one of the regions
  • Query engine per region – get results per region and perform an aggregation

With a single query engine you can run SQL on data from multiple regions, BUT data is transferred between regions, which means you pay both in performance and cost.

With a query engine per region you have to aggregate the results, which may not be a simple task.

With AWS Athena – both options are available, since you don’t need to manage your own query engine.

Threat Analytics Data Lake – before and after

Before the data lake, we had several database solutions – relational and big data. The relational database couldn’t scale, forcing us to delete data or drop old tables. Eventually, we did analytics on a much smaller part of the data than we wanted.

With the big data solutions, the cost was high. We needed dedicated servers, and disks for storage and queries. That’s overkill: we don’t need server access 24/7, as daily batch queries work fine. We also did not have strong SQL capabilities, and found ourselves deleting data because we did not to pay for more servers.

With our data lake, we get better analytics by:

  • Storing more data (billions of records processed daily!), which is used by our queries
  • Using SQL capabilities on a large amount of data using Athena
  • Using multiple query engines with different capabilities, like Spark for machine learning
  • Allowing queries on multiple regions for an average, acceptable response time of just 3 seconds

In addition we also got the following improvements:

  • Huge cost reductions in storage and compute
  • Reduced server maintenance

In conclusion – a data lake worked for us. AWS services made it easier for us to get the results we wanted at an incredibly low cost. It could work for you, depending on factors such as the amount of data, its format, use cases, platform and more. We suggest learning your requirements and do a proof-of-concept with real data to find out!

The post How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs appeared first on Blog.

How to Deploy a Graylog SIEM Server in AWS and Integrate with Imperva Cloud WAF

Security Information and Event Management (SIEM) products provide real-time analysis of security alerts generated by security solutions such as Imperva Cloud Web Application Firewall (WAF). Many organizations implement a SIEM solution to bring visibility of all security events from various solutions and to have the ability to search them or create their own dashboard.

Note that a simpler alternative to SIEM is Imperva Attack Analytics, which reduces the burden of integrating a SIEM logs solution and provides a condensed view of all security events into comprehensive narratives of events rated by severity. A demo of Imperva Attack Analytics is available here.

This article will take you step-by-step through the process of deploying a Graylog server that can ingest Imperva SIEM logs and let you review your data. They are:

  • Step 1: Deploy a new Ubuntu server on AWS
  • Step 2: Install java, Mongodb, elasticsearch
  • Step 3: Install Graylog
  • Step 4: Configure the SFTP server on the AWS server
  • Step 5: Start pushing SIEM logs from Imperva Incapsula

The steps apply to the following scenario:

  • Deployment as a stand-alone EC2 on AWS
  • Installation from scratch, from a clean Ubuntu machine (not a graylog AMI in AWS)
  • Single server setup, where the logs are located in the same server as Graylog
  • Push of the logs from Imperva using SFTP

Most of the steps below also apply to any setup or cloud platforms besides AWS. Note that in AWS, a Graylog AMI image does exist, but only with Ubuntu 14 at the time of writing. Also, I will publish future blogs on how to parse your Imperva SIEM logs and how to create a dashboard to read the logs.

Step 1: Deploy an Ubuntu Server on AWS

As a first step, let’s deploy an Ubuntu machine in AWS with the 4GB RAM required to deploy Graylog.

  1. Sign in to the AWS console and click on EC2
  2. Launch an instance and select.
  3. Select Ubuntu server 16.04, with no other software pre-installed.

It is recommended to use Ubuntu 16.04 and above, as some repo are already pre-included such as MongoDB and Java openjdk-8-jre, which simplifies the installation process. The command lines below apply for Ubuntu 16.04 (systemctl command, for instance, is not applicable for Ubuntu 14).

4. Select the Ubuntu server with 4GB RAM.

4GB is the minimum for Graylog, but you might consider more RAM depending on the volume of the data that you plan to gather.

5. Optional: increase the disk storage.

 Since we will be collecting logs, we will need more storage than the default space. The storage volume will depend a lot on the site traffic and the type of logs you will retrieve (all traffic logs or only security events logs).

Note that you will likely require much more than 40GB. If you are deploying on AWS, you can easily increase the capacity of your EC2 server anytime.

6. Select an existing key pair so you can connect to your AWS server via SSH later.

If you do not have an existing SSH key pair in your AWS account, you can create it using the ssh-keygen tool, which is part of the standard openSSH installation or using puttygen on Windows. Here’s a guide to creating and uploading your SSH key pairs.

7. Give your EC2 server a clear name and identify its public DNS and IPv4 addresses.

8. Configure the server security group in AWS.

Let’s make sure that port 9000 in particular is open. You might need to open other ports if logs are forwarded from another log collector, such as port 514 or 5044.

It is best practice that you open port 22 only from Cloud WAF IP (this link) or from your IP. Prevent from opening port 22 to the world.

You can also consider locking the UI access to your public IP only.

9. SSH to your AWS server with the Ubuntu user, after uploading your key in Putty and putting the AWS public DNS entry.

10. Update your Ubuntu system to the latest versions and updates.

sudo apt-get update

sudo apt-get upgrade

Select “y” when prompted or the default options offered.

Step 2: Install Java, MongoDB and Elasticsearch

11. Install additional packages including Java JDK.

sudo apt-get install apt-transport-https openjdk-8-jre-headless uuid-runtime pwgen

Check that Java is properly installed by running:

java -version

And check the version installed. If all is working properly, you should see a response like:12. Install MongoDB. Graylog uses MongoDB to store the Graylog configuration data

MongoDB is included in the repos of Ubuntu 16.04 and works with Graylog 2.3 and above.

sudo apt-get install mongodb-server

Start mongoDB and make sure it starts with the server:

sudo systemctl start mongod

sudo systemctl enable mongod

And we can check that it is properly running by:

sudo systemctl status mongod

13. Install and configure Elasticsearch

Graylog 2.5.x can be used with Elasticsearch 5.x. You can find more instructions in the Elasticsearch installation guide:

wget -qO – | sudo apt-key add –

echo “deb stable main” | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list

sudo apt-get update && sudo apt-get install elasticsearch

Now modify the Elasticsearch configuration file located at /etc/elasticsearch/elasticsearch.yml and set the cluster name to graylog.

sudo nano /etc/elasticsearch/elasticsearch.yml

Additionally you need to uncomment (remove the # as first character) the line: graylog

Now, you can start Elasticsearch with the following commands:

sudo systemctl daemon-reload
sudo systemctl enable elasticsearch.service
sudo systemctl restart elasticsearch.service

By running sudo systemctl status elasticsearch.service you should see Elasticsearch up and running as below:

Step 3: Install Graylog

14. We can now install Graylog repository and Graylog itself with the following commands:

sudo dpkg -i graylog-2.5-repository_latest.deb

sudo apt-get update && sudo apt-get install graylog-server

15. Configure Graylog

First, create a password of at least 64 characters by running the following command:

pwgen -N 1 -s 96

And copy the result referenced below as password

Let’s create its sha256 checksum as required in the Graylog configuration file:

echo -n password | sha256sum

Now you can open the Graylog configuration file:

sudo nano /etc/graylog/server/server.conf

And replace password_secret and root_password_sha2 with the values you created above.

The configuration file should look as below (replace with your own generated password):

Now replace the following entries with your AWS CNAME that was given when creating your EC2 instance. Note that also, depending on your setup, you can replace the alias below with your internal IP.


web:16. Optional: Configure HTTPS for the Graylog web interface

Although not mandatory, it is recommended that you configure https to your Graylog server.

Please find the steps to setup https in the following link:

17. Start the Graylog service and enable it on system startup

Run the following commands to restart Graylog and enforce it on the server startup:

sudo systemctl daemon-reload
sudo systemctl enable graylog-server.service
sudo systemctl start graylog-server.service

Now we can check that Graylog has properly started:

sudo systemctl status graylog-server.service18. Login to the Graylog console

You should now be able to login to the console.

If the page is not loading at all, check if you have properly configured the security group of your instance and that port 9000 is open.

You can login with username ‘admin’ and the password you set as your secret password.

Step 4: Configure SFTP on your server and Imperva Cloud WAF SFTP push

19. Create a new user and its group

Let’s create a directory where the logs will be sent to Incapsula to send logs.

sudo.adduser incapsula

incapsula is the user name created in this example. You can replace it to the name of your choice. You will be prompted to choose a password.

Let’s create a new group:

sudo groupadd incapsulagroup

And associate the incapsula user to this group

sudo usermod -a -G incapsulagroup incapsula  

20. Let’s create a directory where the logs will be sent to

In this example, we will send all log to /home/incapsula/logs

cd /home

sudo mkdir incapsula

cd incapsula

sudo mkdir logs

21. Now let’s set strict permissions restrictions to that folder

For security purposes, we want to restrict access of this user strictly to the folder where the logs will be sent. The home and incapsula folders can be owned by root while logs will be owned by our newly created user.

sudo chmod 755 /home/incapsula

sudo chown root:root /home/incapsula

Now let’s assign our new user (incapsula in our example) as the owner of the logs directory:

sudo chown -R incapsula:incapsulagroup /home/incapsula/logs

The folder is now owned by incapsula and belongs to incapsulagroup.

And you can see that the incapsula folder is restricted to root, so the newly created incapsula user can only access the /home/incapsula/logs folder, to send its logs.

22. Now let’s configure open-ssh SFTP server and set the appropriate security restrictions.

sudo nano /etc/ssh/sshd_config

Comment out this section:

#Subsystem sftp /usr/lib/openssh/sftp-server

And add this line right below:

subsystem sftp internal-sftp

Change the authentication to allow password authentication so Incapsula can send logs using username / password authentication:

PasswordAuthentication yes

And add the following lines at the bottom of the document:

match group incapsulagroup

chrootDirectory /home/incapsula

X11Forwarding no

AllowTcpForwarding no
ForceCommand internal-sftp
PasswordAuthentication yes

Save the file and exit.

Let’s now restart the SSH server:

sudo service sshd restart

23. Now let’s check that we can send files using SFTP

For that, let’s open use Filezilla and try to upload a file. If everything worked properly, you should be able to:

  • Connect successfully
  • See the logs folder, and be unable to navigate
  • Copy a file to the remote server

Step 5: Push the logs from Imperva Incapsula to the Graylog SFTP folder

24. Configure the Logs in Imperva Cloud WAF SIEM logs tab

  • Log into your account.
  • On the sidebar, click Logs > Log Setup
    • Make sure you have SIEM logs license enabled.
  • Select the SFTP option
  • In the host section enter your public facing AWS hostname. Note that your security group should be open to Incapsula IPs as described in the Security Group section earlier.
  • Update the path to log
  • For the first testing, let’s disable encryption and compression.
  • Select CEF as log format
  • Click Save

See below an example of the settings. Click Test Connection and ensure it is successful. Click Save.

25. Make sure the logs are enabled for the relevant sites as below

You can select either security logs or all access logs on a site-per-site basis.

Selecting All Logs will retrieve all access logs, while Security Logs will push only logs where security events were raised.

  • Note that selecting All Logs will have a significant impact on the volume of logs.

You can find more details on the various settings of the SIEM logs integration in Imperva documentation in this link.

26. Verify that logs are getting pushed from Incapsula servers to your FTP folder


The first logs might take some time to reach your server, depending on the volume of traffic on the site, in particular for a site with little traffic. Generate some traffic and events.

27. Enhance performance and security

To improve the security and performance of your SIEM integration project, you can consider enforcing https in Graylog. You can find a guide to configure https on Graylog here.

 That’s it! In my next blogs, we will describe how to start collecting and parsing Imperva and Incapsula logs using Graylog and how to create your first dashboard.

If you have suggestions for improvements or updates in any of the steps, please share with the community in the comments below.

The post How to Deploy a Graylog SIEM Server in AWS and Integrate with Imperva Cloud WAF appeared first on Blog.

How Imperva’s New Attack Crowdsourcing Secures Your Business’s Applications

Attacks on applications can be divided into two types: targeted attacks and “spray and pray” attacks. Targeted attacks require planning and usually include a reconnaissance phase, where attackers learn all they can about the target organization’s IT stack and application layers. Targeted application attacks are vastly outnumbered by spray and pray attacks. The perpetrators of spray and pray attacks are less discriminating about their victims. Their goal is to find and steal anything that can be leveraged or sold on the dark web. Sometimes spray and pray attacks are used for reconnaissance, and later develop into a targeted attack.

One famous wave of spray and pray attacks took place against Drupal, the popular open-source content management system (CMS). In March 2018, Drupal reported a highly critical vulnerability (CVE-2018-7600) that earned the nickname, Drupalgeddon 2. This vulnerability enables an attacker to run arbitrary code on common Drupal versions, affecting millions of websites. Tools exploiting this weakness became widely available, which caused the number of attacks on Drupal sites to explode.

The ability to identify spray and pray attacks is an important insight for security personnel. It can help them prioritize which attacks to investigate, evaluate the true risk to their application, and/or identify a sniffing attack that could be a precursor to a more serious targeted one.

Identifying Spray and Pray Attacks in Attack Analytics

Attack Analytics, launched in May 2018, aims to crush the maddening pace of alerts that security teams receive. For security analysts unable to triage this alert avalanche, Attack Analytics condenses thousands upon thousands of alerts into a handful of relevant, investigate-able incidents. Powered by artificial intelligence, Attack Analytics automates what would take a team of security analysts days to investigate and cuts that investigation time down to a matter of minutes.

We recently updated Attack Analytics to provide a list of spray and pray attacks that may hit your business as part of a larger campaign. We researched these attacks using crowdsourced attack data gathered with permission from our customers. This insight is now presented in our Attack Analytics dashboard, as can be seen in the red circled portion of Figure 1 below.

Figure 1: Attack Analytics Dashboard

Clicking on the Similar Incidents Insights section shows more detail on the related attacks (Figure 2). An alternative way to get the list of spray and pray incidents potentially affecting the user is to login to the console and use the “How common” filter.

Figure 2: Attack Analytics Many Customers Filter


A closer view of the incidents will tell you the common attributes of the attack affecting other users (Figure 3).

Figure 3: Attack Analytics Incident Insights

How Our Algorithm Works

The algorithm that identifies spray and pray attacks examines incidents across Attack Analytics customers. When there are similar incidents across a large number of customers in a close amount of time, we identify this as a likely spray and pray attack originating from the same source. Determining the similarity of incidents requires domain knowledge, and is based on a combination of factors, such as:

  • The attack source: Network source (IP/Subnet), Geographic location
  • The attack target: URL, Host, Parameters
  • The attack time: Duration, Frequency
  • The attack type: Triggered rule
  • The attack tool: Tool name, type & parameters

In some spray and pray attacks, the origin of the attack is the most valuable piece of information connecting multiple incidents. When it is a distributed attack, the origin of the attack is not relevant, while other factors are relevant. In many cases, a spray and pray attack will be aimed at the same group of URLs.

Another significant common factor is the attack type, in particular, a similar set of rules that were violated in the Web Application Firewall (WAF). Sometimes, the same tools are observed, or the tools belong to the same type of attacks. The time element is also key, especially the duration of the attack or the frequency.

Results and Findings

The Attack Analytics algorithm is designed to identify groups of cross-account incidents. Each group has a set of common features that ties the incidents together. When we reviewed the results and the characteristics of various groupings, we discovered interesting patterns. First, most attacks (83.3%) were common among customers (Figure 4). Second, most attacks (67.4%) belong to groups with single source, meaning the attack came from the same IP address. Third, Bad Bot attacks still have a significant presence (41.1%). In 14.8% of the attacks, a common resource (like a URL) is attacked.

Figure 4: Spray & Pray Incidents Spread

Here’s an interesting example – a spray and pray attack from a single IP that attacked 1,368 customers in the same 3 consecutive days with the same vulnerability scanner, LTX71. We’ve also seen Bad Bots illegally accessing resources, attacking from the same subnet located in Illinois using a Trustwave vulnerability scanner. These bots performed a URLs scan on our customers resources – an attack which was blocked by our Web Application Firewall (WAF). Another attack involved a German IP trying to access the same WordPress-created system files  on more than 50 different customers with a cURL. And the list goes on.

Focusing on single-source spray and pray incidents has shown that these attacks affect a significant percentage of our customers. For example, in Figure 5 we see that the leading attack came from one Ukrainian IP that hit at least 18.49% of our customers. Almost every day, one malicious IP would attack a significant percentage of our customers.

Figure 5: Single Source Spray & Pray Accounts Affected

More Actionable Insights Coming

Identifying spray and pray attacks is a great example of using the intelligence from Imperva’s customer community to create insights that will help speed up your security investigations. Spray and pray attacks are not the only way of adding insights from community knowledge. Using machine-learning algorithms combined with domain knowledge, we plan to add more security insights like these to our Attack Analytics dashboard in the near future.

The post How Imperva’s New Attack Crowdsourcing Secures Your Business’s Applications appeared first on Blog.