Data fabric Archives - Tiger Analytics

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

How to Design your own Data Lake Framework in AWS

TA@2023 — Mon, 29 Aug 2022 17:09:37 +0000

“Data is a precious thing and will last longer than the systems themselves.”– Tim Berners-Lee, inventor of the World Wide Web

Organizations spend a lot of time and effort building pipelines to consume and publish data coming from disparate sources within their Data Lake. Most of the time and effort in large data initiatives are consumed in data ingestion development.

What’s more, with an increasing number of businesses migrating to the cloud, factors like breaking data silos and enhancing data discoverability of data environments have become a business priority.

While Data Lake is the heart of data operations, one should carefully tie capabilities like data security, data quality, metadata-store, etc within the ecosystem.

Properties of an Enterprise Data Lake solution

In a large-scale organization, the Data Lake should possess these characteristics:

Data Ingestion- the ability to consume structured, semi-structured, and unstructured data
Supports push (batch and streaming systems) and pull (DBs, APIs, etc.) mechanisms
Data security through sensitive data masking, tokenization, or redaction
Natively available rules through the Data Quality framework to filter impurities
Metadata Store, Data Dictionary for data discoverability and auditing capability
Data standardization for common data format

A common reusable framework is needed to reduce the time and effort in collecting and ingesting data. At Tiger Analytics, we are solving these problems by building a scalable platform within AWS using AWS’s native services and open-source tools. We’ve adopted a modular design and loosely coupled multi-layered architecture. Each layer provides a distinctive capability and communicates with each other via APIs, messages, and events. The platform abstracts complex processes in the backend and provides a simple easy-to-use UI for the stakeholders

Self-service UI to quickly configure data workflows
Configuration-based backend processing
AWS cloud native and open-source technologies
Data Provenance: data quality, data masking, lineage, recovery and replay audit trail, logging, notification

Before exploring the architecture, let’s understand a few logical components referenced in the blog.

Sources are individual entities that are registered with the framework. They align with systems that own one or more data assets. The system could be a database, a vendor, or a social media website. Entities registered within the framework store various system properties. For instance, if it is a database, then DB Type, DB URL, host, port, username, etc.
Assets are the entries within the framework. They hold the properties of individual files from various sources. Metadata of source files include column names, data types, security classifications, DQ rules, data obfuscation properties, etc.
Targets organize data as per enterprise needs. There are various domains/sub-domains to store the data assets. Based on the subject area of the data, the files can be stored in their specific domains.

The Design Outline

With the demands to manage large volumes of data increasing year on year, our data fabric was designed to be modular, multi-layered, customizable, and flexible enough to suit individual needs and use cases. Whether it is a large banking organization with millions of transactions per day and a strong focus on data security or a start-up that needs clean data to extract business insights, the platform can help everyone.

Following the same modular and multi-layered design principle, we, at Tiger, have put together the architecture with the provision of swapping out components or tools if needed. Keeping in mind that the world of technology is ever-changing and volatile we’ve built flexibility into the system.

UI Portal provides a user-friendly self-service interface to set up and configure sources, targets, and data assets. These elements drive the data consumption from the source to Data Lake. These self-service applications allow the federation of data ingestion. Here, data owners manage and support their data assets. Teams can easily onboard data assets without building individual pipelines. The interface is built using ReactJS with Material-UI, for high-quality front-end graphics. The portal is hosted on an AWS EC2 instance, for resizable compute capacity and scalability.
API Layer is a set of APIs which invokes various functionalities, including CRUD operations and AWS service setup. These APIs create the source, asset, and target entities. The layer supports both synchronous and asynchronous APIs. API Gateway and Lambda functions provide the base of this component. Moreover, DynamoDB captures events requested for audit and support purposes.
Config and Metadata DB is the data repository to capture the control and configuration information. It holds the framework together through a complex data model which reduces data redundancy and provides quick query retrieval. The framework uses AWS RDS PostgreSQL which natively implements Multiversion Concurrency Control (MVCC). It provides point-in-time consistent views without read locks, hence avoiding contentions.
Orchestration Layer strings together various tasks with dependencies and relationships within a data pipeline. These pipelines are built on Apache Airflow. Every data asset has its pipeline, thereby providing more granularity and control over individual flows. Individual DAGs are created through an automated process called DAG-Generator. It is a python-based program tied to the API that registers data assets. Every time a new asset is registered, the DAG-Generator creates a DAG based on the configurations. Later, they are uploaded to the Airflow Server. These DAGs may be time-driven or event-driven, based on the source system.
Execution Layer is the final layer where the magic happens. It comprises various individual python-based programs within AWS Glue jobs. We will be seeing more about this in the following section.

Data Pipeline (Execution Layer)

A data pipeline is a set of tools and processes that automate the data flow from the source to a target repository. The data pipeline moves data from the onboarded source to the target system.

Figure 3: Concept Model – Execution Layer

Data Ingestion

Several patterns affect the way we consume/ingest data. They vary depending on the source systems and consuming frequency. For instance, ingesting data from a database requires additional capabilities compared to consuming a file dropped by a third-party vendor.

Figure 4: Data Ingestion Quadrant

The Data Ingestion Quadrant is our base outline to define consuming patterns. Depending on the properties of the data asset, the framework has the intelligence to use the appropriate pipeline for processing. To achieve this, we have individual S3 buckets for time-driven and event-driven sources. Driver lambda function externally invokes event-driven Airflow DAGs and CRON expressions within DAGs invoke time-driven schedules.

These capabilities consume different file formats like CSV, JSON, XML, parquet, etc. Connector libraries are used to pull data from various databases like MySQL, Postgres, Oracle, and so on.

Data Quality

Data is the core component of any business operation. Data Quality (DQ) in any enterprise system determines its success. The data platform requires a robust DQ framework that promises quality data in enterprise repositories.

For this framework, we have the AWS open-source library called DEEQU. It is a library built on top of Apache Spark. Its Python interface is called PyDeequ. DEEQU provides data profiling capabilities, suggests DQ rules, and executes several checks. We have divided DQ checks into two categories:

– Default Checks are the DQ rules that automatically apply to the attributes. For instance, Length Check, Datatype Check, and Primary Key Check. These data asset properties are defined while registering in the system.

– Advanced Checks are the additional DQ rules. They are applied to various attributes based on the user’s needs. The user defines these checks and stores them in the metadata.

The DQ framework pulls these checks from the metadata store, and it identifies the default checks through data asset properties. Eventually, it constructs a bulk check module for data execution. DQ Results are stored in the backend database. The logs are stored in the S3 bucket for detailed analysis. DQ summary available in the UI provides additional transparency to business users.

Data Obfuscation/Masking

Data masking is the capability of dealing with sensitive information. While registering a data asset, the framework has a provision to enable tokenization on sensitive columns. The Data Masking task uses an internal algorithm and a key (associated with the framework and stored in the AWS Secret Manager). It tokenizes those specific columns before storing them in the Data Lake. These attributes can be detokenized through user-defined functions. It also requires additional key access to control attempts by unauthorized users.

The framework also supports other forms of irreversible data obfuscation, such as Redaction and Data Perturbation.

Data Standardization

Data standardization brings data into a common format. It allows data accessibility using a common set of tools and libraries. The framework executes standardized operations for data consistency. The framework can, therefore:

Standardize target column names.
Support file conversion to parquet format.
Remove leading zeroes from integer/decimal columns.
Standardize target column datatypes.
Add partitioning column.
Remove leading and trailing white spaces from string columns.
Support date format standardization.
Add control columns to target data sets.

Through this blog, we’ve shared insights on our generic architecture to build a Data Lake within the AWS ecosystem. While we can keep adding more capabilities to solve real-world problems, this is just a glimpse of data challenges that can be addressed efficiently through layered and modular design. You can use these learnings to put together the outline of a design that works for your use case while following the same core principles.

The post How to Design your own Data Lake Framework in AWS appeared first on Tiger Analytics.