Category Archives: database

Security Lapse Exposed Sensitive Customer Records In Gearbest Data Breach

Here is another report of a massive data leak from an online retailer. The Chinese e-commerce firm Gearbest inadvertently exposed

Security Lapse Exposed Sensitive Customer Records In Gearbest Data Breach on Latest Hacking News.

How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs

Data is the lifeblood of digital businesses, and a key competitive advantage. The question is: how can you store your data cost-efficiently, access it quickly, while abiding by privacy laws?

At Imperva, we wanted to store our data for long-term access. Databases would’ve cost too much in disk and memory, especially since we didn’t know much it would grow, how long we would keep it, and which data we would actually access in the future. The only thing we did know? That new business cases for our data would emerge.

That’s why we deployed a data lake. It turned out to be the right decision, allowing us to store 1,000 times more data than before, even while slashing costs.

What is a data lake?

A data lake is a repository of files stored in a distributed system. Information is stored in its native form, with little or no processing. You simply store the data in its native formats, such as JSON, XML, CSV, or text.

Analytics queries can be run against both data lakes and databases. In a database you create a schema, plan your queries, and add indices to improve performance. In a data lake, it’s different — you simply store the data and it’s query-ready.

Some file formats are better than others, of course. Apache Parquet allows you to store records in a compressed columnar file. The compression saves disk space and IO, while the columnar format allows the query engine to scan only the relevant columns. This reduces query time and costs.

Using a distributed file system lets you store more data at a lower cost. Whether you use Hadoop HDFS, AWS S3, or Azure Storage, the benefits include:

  • Data replication and availability
  • Options to save more money – for example, AWS S3 has different storage options with different costs
  • Retention policy – decide how long you want to keep your data before it’s automatically deleted

No wonder experts such as Adrian Cockcroft, VP of cloud architecture strategy at Amazon Web Services, said this week that “cloud data lakes are the future.”

Analytic queries: data lake versus database

Let’s examine the capabilities, advantages and disadvantages of a data lake versus a database.

The data

A data lake supports structured and unstructured data and everything in-between. All data is collected and immediately ready for analysis. Data can be transformed to improve user experience and performance. For example, fields can be extracted from a data lake and data can be aggregated.

A database contains only structured and transformed data. It is impossible to add data without declaring tables, relations and indices. You have to plan ahead and transform the data according to your schema.

Figure 1: Data Lake versus Database

The Users

Most users in a typical organization are operational, using applications and data in predefined and repetitive ways. A database is usually ideal for these users. Data is structured and optimized for these predefined use-cases. Reports can be generated, and filters can be applied according to the application’s design.

Advanced users, by contrast, may go beyond an application to the data source and use custom tools to process the data. They may also bring in data from outside the organization.

The last group are the data experts, who do deep analysis on the data. They need the raw data, and their requirements change all the time.

Data lakes support all of these users, but especially advanced and expert users, due to the agility and flexibility of a data lake.

Figure 2: Typical user distribution inside an organization

Query engine(s)

In a database, the query engine is internal and is impossible to change. In a data lake, the query engine is external, letting users choose based on their needs. For example, you can choose Presto for SQL-based analytics and Spark for machine learning.

Figure 3: A data lake may have multiple external query engines. A database has a single internal query engine.

Support of new business use-case

Database changes may be complex. Data should be analyzed and formatted, while schema has to be created before data can be inserted. If you have a busy development team, users can wait months or a year to see the new data in their application.

Few businesses can wait this long. Data lakes solve this by letting users go beyond the structure to explore data. If this proves fruitful, than a formal schema can be applied. You get to results quickly, and fail fast. This agility lets organizations quickly improve their use cases, better know their data, and react fast to changes.

Figure 4: Support of new business use-case

Data lake structure

Here’s how data may flow inside a data lake.

Figure 5: Data lake structure and flow

In this example, CSV files are added to the data lake to a “current day” folder. This folder is the daily partition which allows querying a day’s data using a filter like day = ‘2018-1-1’. Partitions are the most efficient way to filter data.

The data under tables/events is an aggregated, sorted and formatted version of the CSV data. It uses the parquet format to improve query performance and for compression. It also has an additional “type” partition, because most queries work only on a single event type. Each file has millions of records inside, with metadata for efficiency. For example, you can know the count, min and max values for all of the columns without scanning the file.

This events table data has been added to the data lake after the raw data has been validated and analyzed.

Here is a simplified example of CSV to Parquet conversion:

Figure 6: Example for conversion of CSV to Parquet

Parquet files normally hold large number of records, and can be divided internally into “row groups” which have their own metadata. Repeating values improves compression and the columnar structure allows scanning only the relevant columns. The CSV data can be queried at any time, but it is not as efficient as querying the data under the tables/events data.

Flow and Architecture

General

Imperva’s data lake uses Amazon Web Services (AWS). Below shows the flow and services we used to build it.

Figure 7: Architecture and flow

Adding data (ETL – Extract -> Transform -> Load)

  • We use Kafka, which is a producer-consumer distributed streaming platform. Data is added to Kafka, and later read by a microservice which create raw Parquet files in S3.
  • Another microservice uses AWS Athena to hourly or daily process the data – filter, partition, and sort and aggregate it into new Parquet files
  • This flow is done on each of the AWS regions we support

Figure 8: SQL to Parquet flow example

Technical details:

  • Each partition creation is done by one or more Athena:
  • Each query result with one more more Parquet files
  • ETL microservices run on a Kubernetes cluster per region. They are developed and deployed using our development pipeline.

Uses:

  • Different microservices consume the aggregated data using Athena API through boto3 Python library
  • Day to day queries are done using SQL client like DBeaver with Athena JDBC driver. Athena AWS management console is also used for SQL queries
  • Apache Spark engine is used to run spark queries, including machine learning using the spark-ml Apache Zeppelin is used as a client to run scripts and display visualization. Both Spark and Zeppelin are installed as part of AWS EMR service.

Multi-region queries

Data privacy regulations such as GDPR add a twist, especially since we store data in multiple regions. There are two ways to perform multi-region queries:

  • Single query engine based in one of the regions
  • Query engine per region – get results per region and perform an aggregation

With a single query engine you can run SQL on data from multiple regions, BUT data is transferred between regions, which means you pay both in performance and cost.

With a query engine per region you have to aggregate the results, which may not be a simple task.

With AWS Athena – both options are available, since you don’t need to manage your own query engine.

Threat Analytics Data Lake – before and after

Before the data lake, we had several database solutions – relational and big data. The relational database couldn’t scale, forcing us to delete data or drop old tables. Eventually, we did analytics on a much smaller part of the data than we wanted.

With the big data solutions, the cost was high. We needed dedicated servers, and disks for storage and queries. That’s overkill: we don’t need server access 24/7, as daily batch queries work fine. We also did not have strong SQL capabilities, and found ourselves deleting data because we did not to pay for more servers.

With our data lake, we get better analytics by:

  • Storing more data (billions of records processed daily!), which is used by our queries
  • Using SQL capabilities on a large amount of data using Athena
  • Using multiple query engines with different capabilities, like Spark for machine learning
  • Allowing queries on multiple regions for an average, acceptable response time of just 3 seconds

In addition we also got the following improvements:

  • Huge cost reductions in storage and compute
  • Reduced server maintenance

In conclusion – a data lake worked for us. AWS services made it easier for us to get the results we wanted at an incredibly low cost. It could work for you, depending on factors such as the amount of data, its format, use cases, platform and more. We suggest learning your requirements and do a proof-of-concept with real data to find out!

The post How Our Threat Analytics Multi-Region Data Lake on AWS Stores More, Slashes Costs appeared first on Blog.

“BreedReady” database of 1.8m Chinese women surfaced online

By Waqas

Another day, another data breach; this time, Victor Gevers, a Dutch security researcher from GDI Foundation has discovered a publically exposed database containing a massive trove of highly sensitive data of millions of Chinese women. According to Gevers, the database which is open for anyone to access contains personal details of 1.8 million women from China […]

This is a post from HackRead.com Read the original post: “BreedReady” database of 1.8m Chinese women surfaced online

NBlog Feb 22 – classification versus tagging



I'm not happy with the idea of 'levels' in many contexts, including information classification schemes. The term 'level' implies a stepped progression in one dimension. Information risk and security is more nuanced or fine-grained than that, and multidimensional too.
The problems with 'levels' include:
  • Boundary/borderline cases, when decisions about which level is appropriate are arbitrary but the implications can be significant; 
  • Dynamics - something that is a medium level right now may turn into a high or a low at some future point, perhaps when certain event occurs; 
  • Context e.g. determining the sensitivity of information for deliberate internal distribution is not the same as for unauthorized access, especially external leakage and legal discovery (think: internal email); 
  • Dependencies and linkages e.g. an individual data point has more value as part of a time sequence or data set ... 
  • ... and aggregation e.g. a structured and systematic compilation of public information aggregated from various sources can be sensitive; 
  • Differing perspectives, biases and prejudices, plus limited knowledge, misunderstandings, plain mistakes and secret agendas of those who classify stuff, almost inevitably bringing an element of subjectivity to the process despite the appearance of objectivity; 
  • And the implicit "We've classified it and [maybe] done something about securing it ... so we're done here. Next!". It's dismissive. 
The complexities are pretty obvious if you think about it, especially if you have been through the pain of developing and implementing a practical classification scheme. Take a blood pressure reading, for instance, or an annual report or a system security log. How would you classify them? Whatever your answer, I'm sure I can think of situations where those classifications are inappropriate. We might agree on the classification for a particular situation, hence a specific level or label might be appropriate right there, but information and situations are constantly changing, in general, hence in the real world the classification can be misleading and unhelpful. And if you insist on narrowing the classification criteria, we're moving away from the main advantage of classification which is to apply broadly similar risk treatments to each level. Ultimately, every item needs its own unique classification, so why bother?

Another issue with classification schemes is that they over-emphasize one aspect or feature of information - almost always that's confidentiality. What about integrity, availability, utility, value and so forth? I prefer a conceptually different approach using several tags or parameters rather than single classification 'levels'. A given item of information, or perhaps a collection of related items, might usefully be measured and tagged according to several parameters such as:
  • Sensitivity, confidentiality or privacy expectations; 
  • Source e.g. was it generated internally, found on the web, or supplied by a third party?; 
  • Trustworthiness, credibility and authenticity - could it have been faked?; 
  • Accuracy and precision which matters for some applications, quite a lot really; 
  • Criticality for the business, safety, stakeholders, the world ...; 
  • Timeliness or freshness, age and history, hinting at the information lifecycle; 
  • Extent of distribution, whether known and authorized or not; 
  • Utility and value to various parties - not just the current or authorized possessors; 
  • Probability and impact of various incidents i.e. the information risks; 
  • Etc. 
The tags or parameters required depend on what needs to be done. If we're determining access rights, for instance, access-related tags are more relevant than the others. If we're worried about fraud and deception, those integrity aspects are of interest. In other words, there's no need to attempt to fully assess and tag or measure everything, right now: a more pragmatic approach (measuring and tagging whatever is needed for the job in hand) works fine.

Within each parameter, you might consider the different tags or labels to represent levels but I'm more concerned with the broader concept of taking into account a number of relevant parameters in parallel, not just sensitivity or whatever. 

All that complexity can be hidden within Gary's Little World, handled internally within the information risk and security function and related colleagues. Beyond that in the wider organization, things get messy in practice but, generally speaking, people working routinely with information "just know" how important/valuable it is, what's important about it, and so on. They may express it in all sorts of ways (not just in words!), and that's fine. They may need a little guidance here and there but I'm not keen on classification as a method for managing information risk. It's too crude for me, except perhaps as a basic starting point. More useful is the process of getting people to think about this stuff and do whatever is appropriate under the circumstances. It's one of those situations where the journey is more valuable than the destination. The analysis generates understanding and insight which are more important that the 'level'.