Category Archives: unstructured data

How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs

Data is the lifeblood of digital businesses, and a key competitive advantage. The question is: how can you store your data cost-efficiently, access it quickly, while abiding by privacy laws?

At Imperva, we wanted to store our data for long-term access. Databases would’ve cost too much in disk and memory, especially since we didn’t know much it would grow, how long we would keep it, and which data we would actually access in the future. The only thing we did know? That new business cases for our data would emerge.

That’s why we deployed a data lake. It turned out to be the right decision, allowing us to store 1,000 times more data than before, even while slashing costs.

What is a data lake?

A data lake is a repository of files stored in a distributed system. Information is stored in its native form, with little or no processing. You simply store the data in its native formats, such as JSON, XML, CSV, or text.

Analytics queries can be run against both data lakes and databases. In a database you create a schema, plan your queries, and add indices to improve performance. In a data lake, it’s different — you simply store the data and it’s query-ready.

Some file formats are better than others, of course. Apache Parquet allows you to store records in a compressed columnar file. The compression saves disk space and IO, while the columnar format allows the query engine to scan only the relevant columns. This reduces query time and costs.

Using a distributed file system lets you store more data at a lower cost. Whether you use Hadoop HDFS, AWS S3, or Azure Storage, the benefits include:

  • Data replication and availability
  • Options to save more money – for example, AWS S3 has different storage options with different costs
  • Retention policy – decide how long you want to keep your data before it’s automatically deleted

No wonder experts such as Adrian Cockcroft, VP of cloud architecture strategy at Amazon Web Services, said this week that “cloud data lakes are the future.”

Analytic queries: data lake versus database

Let’s examine the capabilities, advantages and disadvantages of a data lake versus a database.

The data

A data lake supports structured and unstructured data and everything in-between. All data is collected and immediately ready for analysis. Data can be transformed to improve user experience and performance. For example, fields can be extracted from a data lake and data can be aggregated.

A database contains only structured and transformed data. It is impossible to add data without declaring tables, relations and indices. You have to plan ahead and transform the data according to your schema.

Figure 1: Data Lake versus Database

The Users

Most users in a typical organization are operational, using applications and data in predefined and repetitive ways. A database is usually ideal for these users. Data is structured and optimized for these predefined use-cases. Reports can be generated, and filters can be applied according to the application’s design.

Advanced users, by contrast, may go beyond an application to the data source and use custom tools to process the data. They may also bring in data from outside the organization.

The last group are the data experts, who do deep analysis on the data. They need the raw data, and their requirements change all the time.

Data lakes support all of these users, but especially advanced and expert users, due to the agility and flexibility of a data lake.

Figure 2: Typical user distribution inside an organization

Query engine(s)

In a database, the query engine is internal and is impossible to change. In a data lake, the query engine is external, letting users choose based on their needs. For example, you can choose Presto for SQL-based analytics and Spark for machine learning.

Figure 3: A data lake may have multiple external query engines. A database has a single internal query engine.

Support of new business use-case

Database changes may be complex. Data should be analyzed and formatted, while schema has to be created before data can be inserted. If you have a busy development team, users can wait months or a year to see the new data in their application.

Few businesses can wait this long. Data lakes solve this by letting users go beyond the structure to explore data. If this proves fruitful, than a formal schema can be applied. You get to results quickly, and fail fast. This agility lets organizations quickly improve their use cases, better know their data, and react fast to changes.

Figure 4: Support of new business use-case

Data lake structure

Here’s how data may flow inside a data lake.

Figure 5: Data lake structure and flow

In this example, CSV files are added to the data lake to a “current day” folder. This folder is the daily partition which allows querying a day’s data using a filter like day = ‘2018-1-1’. Partitions are the most efficient way to filter data.

The data under tables/events is an aggregated, sorted and formatted version of the CSV data. It uses the parquet format to improve query performance and for compression. It also has an additional “type” partition, because most queries work only on a single event type. Each file has millions of records inside, with metadata for efficiency. For example, you can know the count, min and max values for all of the columns without scanning the file.

This events table data has been added to the data lake after the raw data has been validated and analyzed.

Here is a simplified example of CSV to Parquet conversion:

Figure 6: Example for conversion of CSV to Parquet

Parquet files normally hold large number of records, and can be divided internally into “row groups” which have their own metadata. Repeating values improves compression and the columnar structure allows scanning only the relevant columns. The CSV data can be queried at any time, but it is not as efficient as querying the data under the tables/events data.

Flow and Architecture

General

Imperva’s data lake uses Amazon Web Services (AWS). Below shows the flow and services we used to build it.

Figure 7: Architecture and flow

Adding data (ETL – Extract -> Transform -> Load)

  • We use Kafka, which is a producer-consumer distributed streaming platform. Data is added to Kafka, and later read by a microservice which create raw Parquet files in S3.
  • Another microservice uses AWS Athena to hourly or daily process the data – filter, partition, and sort and aggregate it into new Parquet files
  • This flow is done on each of the AWS regions we support

Figure 8: SQL to Parquet flow example

Technical details:

  • Each partition creation is done by one or more Athena:
  • Each query result with one more more Parquet files
  • ETL microservices run on a Kubernetes cluster per region. They are developed and deployed using our development pipeline.

Uses:

  • Different microservices consume the aggregated data using Athena API through boto3 Python library
  • Day to day queries are done using SQL client like DBeaver with Athena JDBC driver. Athena AWS management console is also used for SQL queries
  • Apache Spark engine is used to run spark queries, including machine learning using the spark-ml Apache Zeppelin is used as a client to run scripts and display visualization. Both Spark and Zeppelin are installed as part of AWS EMR service.

Multi-region queries

Data privacy regulations such as GDPR add a twist, especially since we store data in multiple regions. There are two ways to perform multi-region queries:

  • Single query engine based in one of the regions
  • Query engine per region – get results per region and perform an aggregation

With a single query engine you can run SQL on data from multiple regions, BUT data is transferred between regions, which means you pay both in performance and cost.

With a query engine per region you have to aggregate the results, which may not be a simple task.

With AWS Athena – both options are available, since you don’t need to manage your own query engine.

Threat Analytics Data Lake – before and after

Before the data lake, we had several database solutions – relational and big data. The relational database couldn’t scale, forcing us to delete data or drop old tables. Eventually, we did analytics on a much smaller part of the data than we wanted.

With the big data solutions, the cost was high. We needed dedicated servers, and disks for storage and queries. That’s overkill: we don’t need server access 24/7, as daily batch queries work fine. We also did not have strong SQL capabilities, and found ourselves deleting data because we did not to pay for more servers.

With our data lake, we get better analytics by:

  • Storing more data (billions of records processed daily!), which is used by our queries
  • Using SQL capabilities on a large amount of data using Athena
  • Using multiple query engines with different capabilities, like Spark for machine learning
  • Allowing queries on multiple regions for an average, acceptable response time of just 3 seconds

In addition we also got the following improvements:

  • Huge cost reductions in storage and compute
  • Reduced server maintenance

In conclusion – a data lake worked for us. AWS services made it easier for us to get the results we wanted at an incredibly low cost. It could work for you, depending on factors such as the amount of data, its format, use cases, platform and more. We suggest learning your requirements and do a proof-of-concept with real data to find out!

The post How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs appeared first on Blog.

Deep Data Governance

One of the first things to catch my eye this week at RSA was a press release by STEALTHbits on their latest Data Governance release. They're a long time player in DG and as a former employee, I know them fairly well. And where they're taking DG is pretty interesting.

The company has recently merged its enterprise Data (files/folders) Access Governance technology with its DLP-like ability to locate sensitive information. The combined solution enables you to locate servers, identify file shares, assess share and folder permissions, lock down access, review file content to identify sensitive information, monitor activity to look for suspicious activity, and provide an audit trail of access to high-risk content.

The STEALTHbits solution is pragmatic because you can tune where it looks, how deep it crawls, where you want content scanning, where you want monitoring, etc. I believe the solution is unique in the market and a number of IAM vendors agree having chosen STEALTHbits as a partner of choice for gathering Data Governance information into their Enterprise Access Governance solutions.

Learn more at the STEALTHbits website.

Game-Changing Sensitive Data Discovery

I've tried not to let my blog become a place where I push products made by my employer. It just doesn't feel right and I'd probably lose some portion of my audience. But I'm making an exception today because I think we have something really compelling to offer. Would you believe me if I said we have game-changing DLP data discovery?

How about a data discovery solution that costs zero to install? No infrastructure and no licensing. How about a solution that you can point at specific locations and choose specific criteria to look for? And get results back in minutes. How about a solution that profiles file shares according to risk so you can target your scans according to need. And if you find sensitive content, you can choose to unlock the details by using credits which are bundle-priced.

Game Changing. Not because it's the first or only solution that can find sensitive data (credit card info, national ID numbers, health information, financial docs, etc.) but because it's so accessible. Because you can find those answers minutes after downloading. And you can get a sense for your problem before you pay a dime. There's even free credits to let you test the waters for a while.

But don't take our word for it. Here are a few of my favorite quotes from early adopters: 
“You seem to have some pretty smart people there, because this stuff really works like magic!”

"StealthSEEK is a million times better than [competitor]."

"We're scanning a million files per day with no noticeable performance impacts."

"I love this thing."

StealthSEEK has already found numerous examples of system credentials, health information, financial docs, and other sensitive information that weren't known about.

If I've piqued your interest, give StealthSEEK a chance to find sensitive data in your environment. I'd love to hear what you think. If you can give me an interesting use-case, I can probably smuggle you a few extra free credits. Let me know.



Unstructured Data into Identity & Access Governance

I've written before about the gap in identity and access management solutions related to unstructured data.

When I define unstructured data to people in the Identity Management space, I think the key distinguishing characteristic is that there is no entitlement store with which an IAM or IAG solution can connect to gather entitlement information. 

On File Systems, for example, the entitlements are distributed across shares & folders, inherited through the file tree structure, applied through group memberships that may be many levels deep, and there's no common security model to make sense of it.

STEALTHbits has the best scanner in the industry (I've seen it go head-to-head in POC's) to gather users, groups, and permissions across unstructured data environments and the most flexible ability to perform analysis that (1) uncovers high-risk conditions (such as open file shares, unused permissions, admin snooping, and more), (2) identifies content owners, and (3) makes it very simple to consume information on entitlements (by user, by group, or by resource).

It's a gap in the identity management landscape and it's beginning to show up on customer agendas. Let us know if we can help. Now, here's a pretty picture:

STEALTHbits adds unstructured data into IAM and IAG solutions.

Data Protection ROI

I came across a couple of interesting articles today related to ROI around data protection. I recently wrote a whitepaper for STEALTHbits on the Cost Justification of Data Access Governance. It's often top of mind for security practitioners who know they need help but have trouble justifying the acquisition and implementation costs of related solutions. Here's today's links:

KuppingerCole -
The value of information – the reason for information security

Verizon Business Security -
Ask the Data: Do “hacktivists” do it differently?

Visit the STEALTHbits site for information on Access Governance related to unstructured data and to track down the paper on cost justification.