Data Lake Archives - Tiger Analytics

A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake

TA@2023 — Tue, 09 May 2023 18:00:36 +0000

Today, most large enterprises collect huge amounts of data in various forms – structured, semi-structured, and unstructured. While Enterprise Data Warehouses are ACID compliant and well-suited to BI use cases, the kind of data that is collected today for ML use cases require much more flexibility in data structure and much more scalability in data volume than they can currently provide

The initial solution to this issue came with advancements in cloud computing and the advent of data lakes in the late 2010s. Lakes are built on top of cloud-based storage such as AWS S3 buckets and Azure Blob/ADLS. They are flexible and scalable, but many of the original benefits, such as ACID compliance, are lost.

Data Lakehouse has evolved to address this gap. A Data Lakehouse combines the best of Data Lakes and Data Warehouses – it has the ability to store data at scale while providing the benefits of structure.

At Tiger Analytics, we’ve worked across cloud platforms like AWS, GCP, Snowflake, and more to build a Lakehouse for our clients. Here’s a comparative study of how to implement a Lakehouse pattern across major available platforms.

Questions you may have:

All the major cloud platforms in the market today can be used to implement a Lakehouse pattern. The first question that comes to mind is – how do they compare, i.e., what services are available across different stages of the data flow? The next – Given many organizations are moving toward a multi-cloud setup, how do we set up Lakehouses across platforms?

Finding the right Lakehouse architecture

Given that the underlying cloud platform may vary, let’s look at a platform-agnostic architecture pattern, followed by details of platform-specific implementations. The pattern detailed below enables data ingestion from a variety of data sources and supports all forms of data like structured (tables, CSV files, etc.), semi-structured (JSON, YAML, etc.), and unstructured (blobs, images, audio, video, PDFs, etc.).

It comprises four stages based on the data flow:

1. Data Ingestion – To ingest the data from the data sources to a Data Lake storage:

This can be done using cloud-native services like AWS DMS, Azure DMS, GCP Data Transfer service or third-party tools like Airbyte or Fivetran, etc.

2. Data Lake Storage – Provide durable storage for all forms of data:

Options depend on the Cloud service providers. For example – S3 on AWS, ADLS/Blob on Azure, and Google Cloud storage on GCP, etc.

3. Data Transformation – Transform the data stored in the data lake:

Languages like Python, Spark, Pyspark & SQL can be used based on requirements.
Data can also be loaded to the data warehouse without performing any transformations.
Services like AWS Glue Data Studio and GCP Data Flow provide workflow based/no-code options to load the data from the data lake to the data warehouse.
Transformed data can then be stored in the data warehouse/data marts.

4. Cloud Data Warehouse – Provision OLAP with support for Massive Parallel Processing (MPP) and columnar data storage:

Complex data aggregations are made possible with support for SQL querying.
Support for structured and semi-structured data formats.
Curated data is stored and made available for consumption by BI applications/data scientists/data engineers using role-based (RBAC) or attribute-based access controls (ABAC).

Let’s now get into the details of how this pattern can be implemented on different cloud platforms. Specifically, we’ll look at how a lakehouse can be set up by migrating from an on-premise data warehouse or other data sources by leveraging cloud-based warehouses like AWS Redshift, Azure Synapse, GCP Big query, or Snowflake.

How can a Lakehouse be implemented on different cloud platforms?

Lakehouse design using AWS

We deployed an AWS-based Data Lakehouse for a US-based Capital company with a Lakehouse pattern consisting of:

Python APIs for Data Ingestion
S3 for the Data Lake
Glue for ETL
Redshift as a Data Warehouse

Here’s how you can implement a Lakehouse design using AWS

1. Data Ingestion:

AWS Database Migration Service to migrate data from the on-premise data warehouse to the cloud.
AWS Snow family/Transfer family to load data from other sources.
AWS Kinesis (Streams, Firehose, Data Analytics) for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

AWS S3 as the data lake to store all forms of data.
AWS Lake Formation is helpful in creating secured and managed data lakes within a short span of time.

3. Data Transformation:

Spark-based transformations can be done using EMR or Glue.
No code/workflow-based transformations are done using Glue Data Studio/Glue Databrew. Third-party tools like DBT can also be used.

4. Data Warehouse:

AWS Redshift as the data warehouse which supports both structured (table formats) and semi-structured data (SUPER datatype).

Lakehouse design using Azure

One of our clients was an APAC-based automotive company. After evaluating their requirements, we deployed a Lakehouse pattern consisting of:

ADF for Data Ingestion
ADLS Gen 2 for the Data Lake
Databricks for ETL
Synapse as a Data Warehouse

Let’s look at how we can deploy a Lakehouse design using Azure

1. Data Ingestion:

Azure Database Migration Service and SQL Server Migration Assistant (SSMA) to migrate the data from an on-premise data warehouse to the cloud.
Azure Data Factory (ADF) and Azure Data Box can be used for loading data from other data sources.
Azure stream analytics for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

ADLS as the data lake to store all forms of data.

3. Data Transformation:

Spark-based transformations can be done using Azure Databricks.
Azure Synapse itself supports various transformation options using Data Explorer/Spark/Serverless pools.
Third-party tools like Fivetran can also be used for moving data. Third-party tools like Fivetran are also an option.

4. Data Warehouse:

Azure Synapse as the data warehouse.

Lakehouse design using GCP

Our US-based retail client needed to manage large volumes of data. We deployed a GCP-based Data Lakehouse with a Lakehouse pattern consisting of:

Pub/sub for Data Ingestion
GCS for the Data Lake
Data Proc for ETL
BigQuery as a Data Warehouse

Here’s how you can deploy a Lakehouse design using GCP

1. Data Ingestion:

BigQuery data transfer service to migrate the data from an on-premise data warehouse to the cloud.
GCP Data Transfer service can be used for loading data from other sources.
Pub/Sub for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

Google Cloud Storage as the data lake to store all forms of data.

3. Data Transformation:

Spark based transformations can be done using Dataproc.
No code/workflow-based transformations are done using DataFlow. Third party tools like DBT can also be used.

4. Data Warehouse:

GCP BigQuery as the data warehouse which supports both structured and semi-structured data.

Lakehouse design using Snowflake:

We successfully deployed a Snowflake-based Data Lakehouse for our US-based Real estate and supply chain logistics client with a Lakehouse pattern consisting of:

AWS native services for Data Ingestion
AWS S3 for the Data Lake
Snowpark for ETL
Snowflake as a Data Warehouse

Here’s how you can deploy a Lakehouse design using Snowflake

Implementing a lakehouse on Snowflake is quite unique as the underlying cloud platform could be any of the big 3 (AWS, Azure, GCP), and Snowflake can run on top of it.

1. Data Ingestion:

Cloud native migration services can be used to store the data on the respective cloud storage.
Third party tools like Airbyte and Fivetran can also be used to ingest the data.

2. Data Lake:

Depends on the cloud platform: AWS – S3, Azure – ADLS and GCP – GCS.
Data can also be directly loaded onto Snowflake. However, a data lake storage is needed to store unstructured data.
Dictionary tables can be used to catalog the staged files in cloud storage.

3. Data Transformation:

Spark-based transformations can be done using Snowpark with support for Python. Snowflake natively supports some of the transformations while using the copy command.
SnowPipe – Supports transformations on streaming data as well.
Third-party tools like DBT can also be leveraged.

4. Data Warehouse:

Snowflake as the data warehouse which supports both structured (table formats) and semi-structured data (VARIENT datatype). Other options like internal/external stages can also be utilized to reference the data stored on cloud-based storage systems.

Integrating enterprise data into a modern storage architecture is key to realizing value from BI and ML use cases. At Tiger Analytics, we have seen that implementing the architecture detailed above has streamlined data access and storage for our clients. Using this blueprint, you can migrate from your legacy data warehouse onto a cloud-based lakehouse setup.

The post A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake appeared first on Tiger Analytics.

How to Design your own Data Lake Framework in AWS

TA@2023 — Mon, 29 Aug 2022 17:09:37 +0000

“Data is a precious thing and will last longer than the systems themselves.”– Tim Berners-Lee, inventor of the World Wide Web

Organizations spend a lot of time and effort building pipelines to consume and publish data coming from disparate sources within their Data Lake. Most of the time and effort in large data initiatives are consumed in data ingestion development.

What’s more, with an increasing number of businesses migrating to the cloud, factors like breaking data silos and enhancing data discoverability of data environments have become a business priority.

While Data Lake is the heart of data operations, one should carefully tie capabilities like data security, data quality, metadata-store, etc within the ecosystem.

Properties of an Enterprise Data Lake solution

In a large-scale organization, the Data Lake should possess these characteristics:

Data Ingestion- the ability to consume structured, semi-structured, and unstructured data
Supports push (batch and streaming systems) and pull (DBs, APIs, etc.) mechanisms
Data security through sensitive data masking, tokenization, or redaction
Natively available rules through the Data Quality framework to filter impurities
Metadata Store, Data Dictionary for data discoverability and auditing capability
Data standardization for common data format

A common reusable framework is needed to reduce the time and effort in collecting and ingesting data. At Tiger Analytics, we are solving these problems by building a scalable platform within AWS using AWS’s native services and open-source tools. We’ve adopted a modular design and loosely coupled multi-layered architecture. Each layer provides a distinctive capability and communicates with each other via APIs, messages, and events. The platform abstracts complex processes in the backend and provides a simple easy-to-use UI for the stakeholders

Self-service UI to quickly configure data workflows
Configuration-based backend processing
AWS cloud native and open-source technologies
Data Provenance: data quality, data masking, lineage, recovery and replay audit trail, logging, notification

Before exploring the architecture, let’s understand a few logical components referenced in the blog.

Sources are individual entities that are registered with the framework. They align with systems that own one or more data assets. The system could be a database, a vendor, or a social media website. Entities registered within the framework store various system properties. For instance, if it is a database, then DB Type, DB URL, host, port, username, etc.
Assets are the entries within the framework. They hold the properties of individual files from various sources. Metadata of source files include column names, data types, security classifications, DQ rules, data obfuscation properties, etc.
Targets organize data as per enterprise needs. There are various domains/sub-domains to store the data assets. Based on the subject area of the data, the files can be stored in their specific domains.

The Design Outline

With the demands to manage large volumes of data increasing year on year, our data fabric was designed to be modular, multi-layered, customizable, and flexible enough to suit individual needs and use cases. Whether it is a large banking organization with millions of transactions per day and a strong focus on data security or a start-up that needs clean data to extract business insights, the platform can help everyone.

Following the same modular and multi-layered design principle, we, at Tiger, have put together the architecture with the provision of swapping out components or tools if needed. Keeping in mind that the world of technology is ever-changing and volatile we’ve built flexibility into the system.

UI Portal provides a user-friendly self-service interface to set up and configure sources, targets, and data assets. These elements drive the data consumption from the source to Data Lake. These self-service applications allow the federation of data ingestion. Here, data owners manage and support their data assets. Teams can easily onboard data assets without building individual pipelines. The interface is built using ReactJS with Material-UI, for high-quality front-end graphics. The portal is hosted on an AWS EC2 instance, for resizable compute capacity and scalability.
API Layer is a set of APIs which invokes various functionalities, including CRUD operations and AWS service setup. These APIs create the source, asset, and target entities. The layer supports both synchronous and asynchronous APIs. API Gateway and Lambda functions provide the base of this component. Moreover, DynamoDB captures events requested for audit and support purposes.
Config and Metadata DB is the data repository to capture the control and configuration information. It holds the framework together through a complex data model which reduces data redundancy and provides quick query retrieval. The framework uses AWS RDS PostgreSQL which natively implements Multiversion Concurrency Control (MVCC). It provides point-in-time consistent views without read locks, hence avoiding contentions.
Orchestration Layer strings together various tasks with dependencies and relationships within a data pipeline. These pipelines are built on Apache Airflow. Every data asset has its pipeline, thereby providing more granularity and control over individual flows. Individual DAGs are created through an automated process called DAG-Generator. It is a python-based program tied to the API that registers data assets. Every time a new asset is registered, the DAG-Generator creates a DAG based on the configurations. Later, they are uploaded to the Airflow Server. These DAGs may be time-driven or event-driven, based on the source system.
Execution Layer is the final layer where the magic happens. It comprises various individual python-based programs within AWS Glue jobs. We will be seeing more about this in the following section.

Data Pipeline (Execution Layer)

A data pipeline is a set of tools and processes that automate the data flow from the source to a target repository. The data pipeline moves data from the onboarded source to the target system.

Figure 3: Concept Model – Execution Layer

Data Ingestion

Several patterns affect the way we consume/ingest data. They vary depending on the source systems and consuming frequency. For instance, ingesting data from a database requires additional capabilities compared to consuming a file dropped by a third-party vendor.

Figure 4: Data Ingestion Quadrant

The Data Ingestion Quadrant is our base outline to define consuming patterns. Depending on the properties of the data asset, the framework has the intelligence to use the appropriate pipeline for processing. To achieve this, we have individual S3 buckets for time-driven and event-driven sources. Driver lambda function externally invokes event-driven Airflow DAGs and CRON expressions within DAGs invoke time-driven schedules.

These capabilities consume different file formats like CSV, JSON, XML, parquet, etc. Connector libraries are used to pull data from various databases like MySQL, Postgres, Oracle, and so on.

Data Quality

Data is the core component of any business operation. Data Quality (DQ) in any enterprise system determines its success. The data platform requires a robust DQ framework that promises quality data in enterprise repositories.

For this framework, we have the AWS open-source library called DEEQU. It is a library built on top of Apache Spark. Its Python interface is called PyDeequ. DEEQU provides data profiling capabilities, suggests DQ rules, and executes several checks. We have divided DQ checks into two categories:

– Default Checks are the DQ rules that automatically apply to the attributes. For instance, Length Check, Datatype Check, and Primary Key Check. These data asset properties are defined while registering in the system.

– Advanced Checks are the additional DQ rules. They are applied to various attributes based on the user’s needs. The user defines these checks and stores them in the metadata.

The DQ framework pulls these checks from the metadata store, and it identifies the default checks through data asset properties. Eventually, it constructs a bulk check module for data execution. DQ Results are stored in the backend database. The logs are stored in the S3 bucket for detailed analysis. DQ summary available in the UI provides additional transparency to business users.

Data Obfuscation/Masking

Data masking is the capability of dealing with sensitive information. While registering a data asset, the framework has a provision to enable tokenization on sensitive columns. The Data Masking task uses an internal algorithm and a key (associated with the framework and stored in the AWS Secret Manager). It tokenizes those specific columns before storing them in the Data Lake. These attributes can be detokenized through user-defined functions. It also requires additional key access to control attempts by unauthorized users.

The framework also supports other forms of irreversible data obfuscation, such as Redaction and Data Perturbation.

Data Standardization

Data standardization brings data into a common format. It allows data accessibility using a common set of tools and libraries. The framework executes standardized operations for data consistency. The framework can, therefore:

Standardize target column names.
Support file conversion to parquet format.
Remove leading zeroes from integer/decimal columns.
Standardize target column datatypes.
Add partitioning column.
Remove leading and trailing white spaces from string columns.
Support date format standardization.
Add control columns to target data sets.

Through this blog, we’ve shared insights on our generic architecture to build a Data Lake within the AWS ecosystem. While we can keep adding more capabilities to solve real-world problems, this is just a glimpse of data challenges that can be addressed efficiently through layered and modular design. You can use these learnings to put together the outline of a design that works for your use case while following the same core principles.

The post How to Design your own Data Lake Framework in AWS appeared first on Tiger Analytics.

Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability

TA@2023 — Wed, 22 Jun 2022 12:53:21 +0000

Modern Data Lake

Data Lake solutions started emerging out of technology innovations such as Big Data but have been propelled by Cloud to a greater extent. The prevalence of Data Lake can be attributed to its ability in bringing better speed to data retrieval compared to Data Warehouses, the elimination of a significant amount of modeling effort, unlocking advanced analytics capabilities for an enterprise, and bringing in storage and compute scalability to handle different kinds of workloads and enable data driven decisions.

Data Democratization is one of the key outcomes that is sought after with the data platforms today. The need to bring reliable and trusted data in a self-service manner to end-users such as data analysts, data scientists, and business users, is the top priority of any data platform. This blog discusses the key trends we see with our clients and in the industry that are being created to aid data lakes cater to wider audiences and to increase their adoption with consumers.

Self-Service Management of Modern Data Lake

“…self-service is basically enabling all types of users (IT or business), to easily manage and govern the data on the lake themselves – in a low/no-code manner”

Building robust, scalable data pipelines is the first step in leveraging data to its full potential. However, it is the incorporation of automation and self-serving capabilities that really helps one achieve that goal. It will also help in democratizing the data, platform, and analytics capabilities to all types of users and reducing the burden on IT teams significantly, so they can focus on high-value tasks.

Building Self-Service Capabilities

Self-serving capabilities are built on top of robust, scalable data pipelines. Any data lake implementation will involve building various reusable frameworks and components for acquiring data from the storage systems (components that can understand and infer the schema, check data quality, implement certain functionalities when bringing in or transforming the data), and loading it into the target zone.

Data pipelines are built using these reusable components and frameworks. These pipelines ingest, wrangle, transform and perform the egress of the data. Stopping at this point would rob an organization of the opportunity to leverage the data to its full potential.

In order to maximize the result, APIs (following microservices architecture) for data and platform management are created to perform CRUD operations and for monitoring purposes. They can also be used to schedule and trigger pipelines, discover and manage datasets, cluster management, security, and user management. Once the APIs are set, you can build a web UI-based interface that can orchestrate all these operations and help any user to navigate it, bring in the data, transform the data, send out the data, or manage the pipelines.

Tiger has also taken self-servicing on Data Lake to another level by building a virtual assistant that interacts with the user to perform the above-mentioned tasks.

Data Catalog Solution

Another common trend we see in modern data lakes is the increasing adoption of next-gen Data Catalog Solutions. A Data Catalog Solution comes in handy when we’re dealing with huge volumes of data and multiple data sets. It can extract and understand technical metadata from different datasets, link them together, understand their health, reliability, and usage patterns, and help any consumer, whether it is a data scientist or a data analyst, or a business analyst, with insight generation.

Data Catalogs have been around for quite some time, but now they are becoming more information intelligent. It is no longer just about bringing in the technical metadata.

Data Catalog Implementation

Some of the vital parts of building a data catalog are using knowledge graphs and powerful search technologies. A knowledge graph solution can bring in the information of a dataset like schema, data quality, profiling statistics, PII, and classification. It can also figure out who’s the owner of the particular dataset, who are the users consuming this data set from various logs, and which department the person belongs to.

This knowledge graph can be used to carry out search and filter operations, graph queries, recommendations, and visual explorations.

Data and Platform Observability

Tiger looks at Observability in three different stages:

1. Basic Data Health Monitoring
2. Advanced Data Health Monitoring with Predictions
3. Extending to Platform Observability

Basic Data Health Monitoring

Identifying critical data elements (CDE) and monitoring them is the most basic aspect of data health monitoring. We configure rule-based data checks against these CDEs and capture the results on a periodic basis and provide visibility through dashboards. These issues are also tracked through ticketing systems and then fixed in source as much as possible. This process constitutes the first stage in ensuring Data Observability.

The key capabilities that are required to achieve this level of maturity are shown below

Advanced Data Health Monitoring with Predictions

Most of the enterprise clients we work with have reached the Basic Data Health Monitoring stage and are looking to progress forward. The observability ecosystem needs to be enhanced with some important capabilities that will help to move from a reactive response to a more proactive one. Artificial Intelligence and Machine Learning are the latest technologies being leveraged to this end. Some of the key capabilities include measuring the data drift and schema drift, classifying the incoming information automatically with AI/ML, detecting the PII information automatically and processing them appropriately, assigning security entitlements automatically based on similar elements in the data platform, etc. These capabilities will elevate the health of the data to the next level also giving early warnings when some data patterns are changing.

Extending to Platform Observability

The end goal of Observability Solutions is to deliver reliable data to consumers in a timely fashion. This goal can be achieved only when we move beyond data observability into the actual platform that delivers the data itself. This platform has to be modern and state of the art so that it can deliver the data in a timely manner while also allowing engineers and administrators to understand and debug if things are not going well. Following are some of the key capabilities that we need to think about to improve the platform-level observability.

Monitoring Data Flows & Environment: Ability to monitor job performance degradation, server health, and historical resource utilization trends in real-time
Monitor Performance: Understanding how data flows from one system to another and looking for bottlenecks in a visual manner would be very helpful in complex data processing environments
Monitor Data Security: Query logs, access patterns, security tools, etc need to be monitored in order to ensure there is no misuse of data
Analyze Workloads: Automatically detecting issues and constraints in large data workloads that make them slow and building tools for Root Cause Analysis
Predict Issues, Delays, and Find Resolutions: Comparing historical performance to current operational efficiency, in terms of speed and resource usage, to predict issues and offer solutions
Optimize Data Delivery: Building tools into the system that continuously adjust resource allocation based on data-volume spike predictions and thus optimizing TCO

Conclusion

The Modern Data Lake environment is driven by the value of data democratization. It is important to make data management and insight gathering accessible to end-users of all kinds. Self Service aided by Intelligent Data Catalogs is the most promising solution for effective data democratization. Moreover, enabling trust in the data for the data consumers is also of utmost importance. The capabilities discussed such as Data and Platform Observability gives users real under-the-hood control over onboarding, processing, and delivering data to different consumers. Companies are striving to create end-to-end observability solutions and wanting to enable data driven decisions today and these will be the solutions that will take data platforms to the next level of adoption and democratization.

We spoke more on this at the DES22 event. To catch our full talk, click here.

The post Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability appeared first on Tiger Analytics.