Category Archives: AWS

Imperva Cloud WAF and Graylog, Part II: How to Collect and Ingest SIEM Logs

This guide gives step-by-step guidance on how to collect and parse Imperva Cloud Web Application Firewall (WAF, formerly Incapsula) logs into the Graylog SIEM tool. Read Part I to learn how to set up a Graylog server in AWS and integrate with Imperva Cloud WAF.

This guide assumes:

  • You have a clean Graylog server up and running, as described in my earlier blog article
  • You are pushing (or pulling) the Cloud WAF SIEM logs into a folder within the Graylog server
  • You are not collecting the logs yet

Important! The steps below apply for the following scenario:

  • Deployment as a stand-alone EC2 in AWS
  • Single-server setup, with the logs located on the same server as Graylog
  • The logs are pushed to the server uncompressed and unencrypted

Although this blog was created for a deployment on AWS, most of the steps below apply. Other setups (other clouds, on-premises) will require a few networking changes from the guide below.

This article will detail all the steps to configure the log collector and parser in few major steps:

  • Step 1: Install the sidecar collector package for Graylog
  • Step 2: Configure a new log collector in Graylog
  • Step 3: Creating a log Input & extractor with Incapsula content pack for Graylog (the json with the parsing rules)

Step 1: Install the Sidecar Collector Package

  1. Install the Graylog sidecar collector

Let’s first download the appropriate package. Identify the right sidecar collector package suited for our deployment in Github:

Since we are deploying Graylog 2.5.x, the corresponding sidecar collector version is 0.1.x.

Go ahead and install the relevant package.

We are running a 64bits server and .deb work best with Debian/Ubuntu machines.

Run the following commands:

  1. cd /tmp   #or any directory you would like to use

Download the right package:

2. curl -L -O

  1. Install the package:

sudo dpkg -i collector-sidecar_0.1.7-1_amd64.deb  

2. Configure sidecar collector

cd /etc/graylog/collector-sidecar

sudo nano collector_sidecar.yml

Now let’s change the server URL to the local server IP (local IP, not AWS public IP). And let’s add incapsula-logs to the tags:

And let’s install and start graylog-collector-sidecar service with the following commands:

sudo graylog-collector-sidecar -service installsudo systemctl start collector-sidecar 

Step 2: Configure a New Log Collector in Graylog

3. Add a collector in Graylog

Follow the steps below to add a collector:

  • System > Collectors
  • Click Manage Configurations
  • Then click Create configuration

Let’s name it and then click on the newly-created collector:

4. Add Incapsula-logs as a tag

5. Configure the Output and then the Input of the collectors

Click on Create Output and configure the collector as below:

Now click on Create Input and configure the collector as below:

Step 3: Creating Log Inputs and Extractors with Incapsula (now named Imperva Cloud Web Application Firewall) Content Pack for Graylog

6. Let’s now launch a new Input as below:

And configure the Beats collector inputs as required:

The TLS details are not mandatory at this stage as we will work with unencrypted SIEM logs for this blog.

7. Download the Incapsula SIEM package for Graylog in Github

Reach the following link.

Retrieve the json configuration of the package. It includes all the Imperva Cloud Web Application Firewall (formerly Incapsula) parsing rules of event fields which will allows an easy import within Graylog along with clear naming.

Extract the content_pack.json file and import it as an extractor in Graylog.

Go to System/ Content packs and import the json file you just downloaded:

The content pack will use our legacy name and we can now apply the new content pack as below:

The new content pack will be displayed in Graylog System / Input menu from which you can extract its content by clicking “Manage extractors”:

And you can now import it in our predefined extractors  to the input we previously configured.

Paste the content of Incapsula content pack extractor:

If all works as expected, you should get the confirmation as below:

You should now see that the headers that will be parsed by Graylog have been successfully imported and will have appropriate naming as can be seen in a screenshot below:

Field extractors for Incapsula (or Imperva Cloud WAF) events in Graylog SIEM:

8. Restart the sidecar collector service

Once all is configured you can restart the sidecar service with the following command in the server command line:

sudo systemctl restart collector-sidecar

We can also enforce sidecar collector to run at the server startup:

sudo systemctl enable collector-sidecar 

Let’s check that the collector is active and the service is properly running:

The collector service should now appear as active in Graylog:

9. Check that you see messages and logs

Click on the Search bar. After a few minutes you should start to see the logs displayed.

Give it 10-15 minutes before troubleshooting if you don’t see messages displayed immediately.

You can see on the left panel that the filters and retrieved headers are in line with Imperva. Client_IP are retrieved from the Incap-Client-IP and are the real client IPs, Client App are the client classification detected by Imperva Cloud WAF etc…

The various headers are explained in the following headers that you can see:

10. Congratulations!

Congratulations, we are now successfully exporting, collecting and parsing Imperva Cloud WAF/Incapsula SIEM logs!

In the next article, we will review the imported Imperva Cloud WAF dashboard template.

If you have suggestions for improvements or updates in any of the steps, please share with the community in the comments below.

The post Imperva Cloud WAF and Graylog, Part II: How to Collect and Ingest SIEM Logs appeared first on Blog.

How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs

Data is the lifeblood of digital businesses, and a key competitive advantage. The question is: how can you store your data cost-efficiently, access it quickly, while abiding by privacy laws?

At Imperva, we wanted to store our data for long-term access. Databases would’ve cost too much in disk and memory, especially since we didn’t know much it would grow, how long we would keep it, and which data we would actually access in the future. The only thing we did know? That new business cases for our data would emerge.

That’s why we deployed a data lake. It turned out to be the right decision, allowing us to store 1,000 times more data than before, even while slashing costs.

What is a data lake?

A data lake is a repository of files stored in a distributed system. Information is stored in its native form, with little or no processing. You simply store the data in its native formats, such as JSON, XML, CSV, or text.

Analytics queries can be run against both data lakes and databases. In a database you create a schema, plan your queries, and add indices to improve performance. In a data lake, it’s different — you simply store the data and it’s query-ready.

Some file formats are better than others, of course. Apache Parquet allows you to store records in a compressed columnar file. The compression saves disk space and IO, while the columnar format allows the query engine to scan only the relevant columns. This reduces query time and costs.

Using a distributed file system lets you store more data at a lower cost. Whether you use Hadoop HDFS, AWS S3, or Azure Storage, the benefits include:

  • Data replication and availability
  • Options to save more money – for example, AWS S3 has different storage options with different costs
  • Retention policy – decide how long you want to keep your data before it’s automatically deleted

No wonder experts such as Adrian Cockcroft, VP of cloud architecture strategy at Amazon Web Services, said this week that “cloud data lakes are the future.”

Analytic queries: data lake versus database

Let’s examine the capabilities, advantages and disadvantages of a data lake versus a database.

The data

A data lake supports structured and unstructured data and everything in-between. All data is collected and immediately ready for analysis. Data can be transformed to improve user experience and performance. For example, fields can be extracted from a data lake and data can be aggregated.

A database contains only structured and transformed data. It is impossible to add data without declaring tables, relations and indices. You have to plan ahead and transform the data according to your schema.

Figure 1: Data Lake versus Database

The Users

Most users in a typical organization are operational, using applications and data in predefined and repetitive ways. A database is usually ideal for these users. Data is structured and optimized for these predefined use-cases. Reports can be generated, and filters can be applied according to the application’s design.

Advanced users, by contrast, may go beyond an application to the data source and use custom tools to process the data. They may also bring in data from outside the organization.

The last group are the data experts, who do deep analysis on the data. They need the raw data, and their requirements change all the time.

Data lakes support all of these users, but especially advanced and expert users, due to the agility and flexibility of a data lake.

Figure 2: Typical user distribution inside an organization

Query engine(s)

In a database, the query engine is internal and is impossible to change. In a data lake, the query engine is external, letting users choose based on their needs. For example, you can choose Presto for SQL-based analytics and Spark for machine learning.

Figure 3: A data lake may have multiple external query engines. A database has a single internal query engine.

Support of new business use-case

Database changes may be complex. Data should be analyzed and formatted, while schema has to be created before data can be inserted. If you have a busy development team, users can wait months or a year to see the new data in their application.

Few businesses can wait this long. Data lakes solve this by letting users go beyond the structure to explore data. If this proves fruitful, than a formal schema can be applied. You get to results quickly, and fail fast. This agility lets organizations quickly improve their use cases, better know their data, and react fast to changes.

Figure 4: Support of new business use-case

Data lake structure

Here’s how data may flow inside a data lake.

Figure 5: Data lake structure and flow

In this example, CSV files are added to the data lake to a “current day” folder. This folder is the daily partition which allows querying a day’s data using a filter like day = ‘2018-1-1’. Partitions are the most efficient way to filter data.

The data under tables/events is an aggregated, sorted and formatted version of the CSV data. It uses the parquet format to improve query performance and for compression. It also has an additional “type” partition, because most queries work only on a single event type. Each file has millions of records inside, with metadata for efficiency. For example, you can know the count, min and max values for all of the columns without scanning the file.

This events table data has been added to the data lake after the raw data has been validated and analyzed.

Here is a simplified example of CSV to Parquet conversion:

Figure 6: Example for conversion of CSV to Parquet

Parquet files normally hold large number of records, and can be divided internally into “row groups” which have their own metadata. Repeating values improves compression and the columnar structure allows scanning only the relevant columns. The CSV data can be queried at any time, but it is not as efficient as querying the data under the tables/events data.

Flow and Architecture


Imperva’s data lake uses Amazon Web Services (AWS). Below shows the flow and services we used to build it.

Figure 7: Architecture and flow

Adding data (ETL – Extract -> Transform -> Load)

  • We use Kafka, which is a producer-consumer distributed streaming platform. Data is added to Kafka, and later read by a microservice which create raw Parquet files in S3.
  • Another microservice uses AWS Athena to hourly or daily process the data – filter, partition, and sort and aggregate it into new Parquet files
  • This flow is done on each of the AWS regions we support

Figure 8: SQL to Parquet flow example

Technical details:

  • Each partition creation is done by one or more Athena:
  • Each query result with one more more Parquet files
  • ETL microservices run on a Kubernetes cluster per region. They are developed and deployed using our development pipeline.


  • Different microservices consume the aggregated data using Athena API through boto3 Python library
  • Day to day queries are done using SQL client like DBeaver with Athena JDBC driver. Athena AWS management console is also used for SQL queries
  • Apache Spark engine is used to run spark queries, including machine learning using the spark-ml Apache Zeppelin is used as a client to run scripts and display visualization. Both Spark and Zeppelin are installed as part of AWS EMR service.

Multi-region queries

Data privacy regulations such as GDPR add a twist, especially since we store data in multiple regions. There are two ways to perform multi-region queries:

  • Single query engine based in one of the regions
  • Query engine per region – get results per region and perform an aggregation

With a single query engine you can run SQL on data from multiple regions, BUT data is transferred between regions, which means you pay both in performance and cost.

With a query engine per region you have to aggregate the results, which may not be a simple task.

With AWS Athena – both options are available, since you don’t need to manage your own query engine.

Threat Analytics Data Lake – before and after

Before the data lake, we had several database solutions – relational and big data. The relational database couldn’t scale, forcing us to delete data or drop old tables. Eventually, we did analytics on a much smaller part of the data than we wanted.

With the big data solutions, the cost was high. We needed dedicated servers, and disks for storage and queries. That’s overkill: we don’t need server access 24/7, as daily batch queries work fine. We also did not have strong SQL capabilities, and found ourselves deleting data because we did not to pay for more servers.

With our data lake, we get better analytics by:

  • Storing more data (billions of records processed daily!), which is used by our queries
  • Using SQL capabilities on a large amount of data using Athena
  • Using multiple query engines with different capabilities, like Spark for machine learning
  • Allowing queries on multiple regions for an average, acceptable response time of just 3 seconds

In addition we also got the following improvements:

  • Huge cost reductions in storage and compute
  • Reduced server maintenance

In conclusion – a data lake worked for us. AWS services made it easier for us to get the results we wanted at an incredibly low cost. It could work for you, depending on factors such as the amount of data, its format, use cases, platform and more. We suggest learning your requirements and do a proof-of-concept with real data to find out!

The post How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs appeared first on Blog.

Imperva Makes Major Expansion in Application Security

When Imperva announced in 2018 it would acquire the application security solution provider Prevoty, a company I co-founded with Julien Bellanger, I knew it would be a win-win for our industry. Prevoty’s flagship product, Autonomous Application Protection, is the most mature, market-tested runtime application self-protection (RASP) solution (as proof, Prevoty was just named a Silver Winner in the Cybersecurity Excellence Awards). Together, Imperva and Prevoty are creating a consolidated, comprehensive platform for application and data security.

More importantly, this acquisition is a big win for our customers. The combination of Imperva and Autonomous Application Protection extends customers’ visibility into how applications behave and how users interact with sensitive information. With this expanded view across their business assets, customers will have deeper insights to understand and mitigate security risk at the edge, application, and database.

In parallel with product integrations, our teams of security innovators are coming together. I am delighted to join the Imperva team as CTO and to lead a highly accomplished group to radically transform the way our industry thinks about application and data security. In the coming horizon, we will boost data visibility throughout the stack, translate billions of data points into actionable insights, and intelligently automate responses that protect businesses. In fact, we just released two new features that deliver on those goals: Network Activity Protection and Weak Cryptography Protection. Learn more about these at and also in my interview with eWeek.

Network Activity Protection provides organizations with the ability to monitor and prevent unauthorized outbound network communications originating from within their applications, APIs, and microservices — a blind spot for organizations that are undergoing a digital transformation. Organizations now have a clear view into the various endpoints with which their applications communicate.

The new Weak Cryptography Protection feature offers the ability to monitor and protect against the use of specific weak hashing algorithms (including SHA-1, MD5) and cryptographic ciphers (including AES, 3DES/DES, RC4). Applications that leverage Autonomous Application Protection can now monitor and force compliant cryptographic practices.  

Imperva is leading the world’s fight to keep data and applications safe from cyber criminals. Organizations that deploy Imperva will not have to choose between innovation and protecting their customers. The future of application and data security will be smarter,simpler, and we are leading the way there.

Imperva will be at the RSA Conference March 4-8 in San Francisco. Stop by Booth 527 in the South Expo and learn about the New Imperva from me (I’ll be there Tuesday-Thursday) and other executives! We’ve revamped our suite of security solutions under a new license called FlexProtect that makes it simpler for organizations to deploy our security products and services to deliver the agility they need as they digitally transform their businesses.

Start your day or enjoy an afternoon pick-me-up by grabbing a coffee in our booth Tuesday through Thursday 10-2 pm, while:

  • See a demo of our latest products in the areas of cloud app and data security and data risk analytics
  • Learn more about how our suite of security solutions works in AWS environments

Imperva will also be at the AWS booth (1227 in the South Expo hall). There, you can:

  • Hear how one of our cloud customers, an U.S.-based non-profit with nearly 40 million members, uses AAP to detect and mitigate potential application attacks, Tuesday, March 5th from 3:30 – 4:00 pm in the AWS booth
  • See a demo of how our solutions work in cloud environments, Tuesday, March 5th 3:30-5 pm and Wednesday, March 6th, 11:30-2 pm

Finally – we will be participating in the webinar “Cyber Security Battles: How to Prepare and Win” at RSA. It will be first broadcast at 9:30 am on March 6th and feature George McGregor, vice-president of product marketing at Imperva, in a Q&A discussion with executives from several other vendors as they discuss the possibility of a cyber battle between AI systems, which experts predict might be on the horizon in the next three to five years. Register and watch for free!

The post Imperva Makes Major Expansion in Application Security appeared first on Blog.

AWS EC2 instance userData

In the effort to get me blogging again I'll be doing a few short posts to get the juices flowing (hopefully).

Today I learned about the userData instance attribute for AWS EC2.

In general I thought metadata was only things you can hit from WITHIN the instance via the metadata url:

However, if you read the link above there is an option to add metadata at boot time. 

You can also use instance metadata to access user data that you specified when launching your instance. For example, you can specify parameters for configuring your instance, or attach a simple script. 

That's interesting right?!?!  so if you have some AWS creds the easiest way to check for this (after you enumerate instance IDs) is with the aws cli.

$ aws ec2 describe-instance-attribute --attribute userData --instance-id i-0XXXXXXXX

An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstanceAttribute operation: The instance ID 'i-0XXXXXXXX' does not exist

ah crap, you need the region...

$ aws ec2 describe-instance-attribute --attribute userData --instance-id i-0XXXXXXXX --region us-west-1
    "InstanceId": "i-0XXXXXXXX",
    "UserData": {
        "Value": "bm90IHRvZGF5IElTSVMgOi0p"}

anyway that can get tedious especially if the org has a ton of things running.  This is precisely the reason @cktricky and I built weirdAAL.  Surely no one would be sticking creds into things at boot time via shell scripts :-)

The module loops trough all the regions and any instances it finds and queries for the userData attribute.  Hurray for automation.

That module is in the current version of weirdAAL. Enjoy.