Data Warehouse Archives - Tiger Analytics

A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake

TA@2023 — Tue, 09 May 2023 18:00:36 +0000

Today, most large enterprises collect huge amounts of data in various forms – structured, semi-structured, and unstructured. While Enterprise Data Warehouses are ACID compliant and well-suited to BI use cases, the kind of data that is collected today for ML use cases require much more flexibility in data structure and much more scalability in data volume than they can currently provide

The initial solution to this issue came with advancements in cloud computing and the advent of data lakes in the late 2010s. Lakes are built on top of cloud-based storage such as AWS S3 buckets and Azure Blob/ADLS. They are flexible and scalable, but many of the original benefits, such as ACID compliance, are lost.

Data Lakehouse has evolved to address this gap. A Data Lakehouse combines the best of Data Lakes and Data Warehouses – it has the ability to store data at scale while providing the benefits of structure.

At Tiger Analytics, we’ve worked across cloud platforms like AWS, GCP, Snowflake, and more to build a Lakehouse for our clients. Here’s a comparative study of how to implement a Lakehouse pattern across major available platforms.

Questions you may have:

All the major cloud platforms in the market today can be used to implement a Lakehouse pattern. The first question that comes to mind is – how do they compare, i.e., what services are available across different stages of the data flow? The next – Given many organizations are moving toward a multi-cloud setup, how do we set up Lakehouses across platforms?

Finding the right Lakehouse architecture

Given that the underlying cloud platform may vary, let’s look at a platform-agnostic architecture pattern, followed by details of platform-specific implementations. The pattern detailed below enables data ingestion from a variety of data sources and supports all forms of data like structured (tables, CSV files, etc.), semi-structured (JSON, YAML, etc.), and unstructured (blobs, images, audio, video, PDFs, etc.).

It comprises four stages based on the data flow:

1. Data Ingestion – To ingest the data from the data sources to a Data Lake storage:

This can be done using cloud-native services like AWS DMS, Azure DMS, GCP Data Transfer service or third-party tools like Airbyte or Fivetran, etc.

2. Data Lake Storage – Provide durable storage for all forms of data:

Options depend on the Cloud service providers. For example – S3 on AWS, ADLS/Blob on Azure, and Google Cloud storage on GCP, etc.

3. Data Transformation – Transform the data stored in the data lake:

Languages like Python, Spark, Pyspark & SQL can be used based on requirements.
Data can also be loaded to the data warehouse without performing any transformations.
Services like AWS Glue Data Studio and GCP Data Flow provide workflow based/no-code options to load the data from the data lake to the data warehouse.
Transformed data can then be stored in the data warehouse/data marts.

4. Cloud Data Warehouse – Provision OLAP with support for Massive Parallel Processing (MPP) and columnar data storage:

Complex data aggregations are made possible with support for SQL querying.
Support for structured and semi-structured data formats.
Curated data is stored and made available for consumption by BI applications/data scientists/data engineers using role-based (RBAC) or attribute-based access controls (ABAC).

Let’s now get into the details of how this pattern can be implemented on different cloud platforms. Specifically, we’ll look at how a lakehouse can be set up by migrating from an on-premise data warehouse or other data sources by leveraging cloud-based warehouses like AWS Redshift, Azure Synapse, GCP Big query, or Snowflake.

How can a Lakehouse be implemented on different cloud platforms?

Lakehouse design using AWS

We deployed an AWS-based Data Lakehouse for a US-based Capital company with a Lakehouse pattern consisting of:

Python APIs for Data Ingestion
S3 for the Data Lake
Glue for ETL
Redshift as a Data Warehouse

Here’s how you can implement a Lakehouse design using AWS

1. Data Ingestion:

AWS Database Migration Service to migrate data from the on-premise data warehouse to the cloud.
AWS Snow family/Transfer family to load data from other sources.
AWS Kinesis (Streams, Firehose, Data Analytics) for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

AWS S3 as the data lake to store all forms of data.
AWS Lake Formation is helpful in creating secured and managed data lakes within a short span of time.

3. Data Transformation:

Spark-based transformations can be done using EMR or Glue.
No code/workflow-based transformations are done using Glue Data Studio/Glue Databrew. Third-party tools like DBT can also be used.

4. Data Warehouse:

AWS Redshift as the data warehouse which supports both structured (table formats) and semi-structured data (SUPER datatype).

Lakehouse design using Azure

One of our clients was an APAC-based automotive company. After evaluating their requirements, we deployed a Lakehouse pattern consisting of:

ADF for Data Ingestion
ADLS Gen 2 for the Data Lake
Databricks for ETL
Synapse as a Data Warehouse

Let’s look at how we can deploy a Lakehouse design using Azure

1. Data Ingestion:

Azure Database Migration Service and SQL Server Migration Assistant (SSMA) to migrate the data from an on-premise data warehouse to the cloud.
Azure Data Factory (ADF) and Azure Data Box can be used for loading data from other data sources.
Azure stream analytics for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

ADLS as the data lake to store all forms of data.

3. Data Transformation:

Spark-based transformations can be done using Azure Databricks.
Azure Synapse itself supports various transformation options using Data Explorer/Spark/Serverless pools.
Third-party tools like Fivetran can also be used for moving data. Third-party tools like Fivetran are also an option.

4. Data Warehouse:

Azure Synapse as the data warehouse.

Lakehouse design using GCP

Our US-based retail client needed to manage large volumes of data. We deployed a GCP-based Data Lakehouse with a Lakehouse pattern consisting of:

Pub/sub for Data Ingestion
GCS for the Data Lake
Data Proc for ETL
BigQuery as a Data Warehouse

Here’s how you can deploy a Lakehouse design using GCP

1. Data Ingestion:

BigQuery data transfer service to migrate the data from an on-premise data warehouse to the cloud.
GCP Data Transfer service can be used for loading data from other sources.
Pub/Sub for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

Google Cloud Storage as the data lake to store all forms of data.

3. Data Transformation:

Spark based transformations can be done using Dataproc.
No code/workflow-based transformations are done using DataFlow. Third party tools like DBT can also be used.

4. Data Warehouse:

GCP BigQuery as the data warehouse which supports both structured and semi-structured data.

Lakehouse design using Snowflake:

We successfully deployed a Snowflake-based Data Lakehouse for our US-based Real estate and supply chain logistics client with a Lakehouse pattern consisting of:

AWS native services for Data Ingestion
AWS S3 for the Data Lake
Snowpark for ETL
Snowflake as a Data Warehouse

Here’s how you can deploy a Lakehouse design using Snowflake

Implementing a lakehouse on Snowflake is quite unique as the underlying cloud platform could be any of the big 3 (AWS, Azure, GCP), and Snowflake can run on top of it.

1. Data Ingestion:

Cloud native migration services can be used to store the data on the respective cloud storage.
Third party tools like Airbyte and Fivetran can also be used to ingest the data.

2. Data Lake:

Depends on the cloud platform: AWS – S3, Azure – ADLS and GCP – GCS.
Data can also be directly loaded onto Snowflake. However, a data lake storage is needed to store unstructured data.
Dictionary tables can be used to catalog the staged files in cloud storage.

3. Data Transformation:

Spark-based transformations can be done using Snowpark with support for Python. Snowflake natively supports some of the transformations while using the copy command.
SnowPipe – Supports transformations on streaming data as well.
Third-party tools like DBT can also be leveraged.

4. Data Warehouse:

Snowflake as the data warehouse which supports both structured (table formats) and semi-structured data (VARIENT datatype). Other options like internal/external stages can also be utilized to reference the data stored on cloud-based storage systems.

Integrating enterprise data into a modern storage architecture is key to realizing value from BI and ML use cases. At Tiger Analytics, we have seen that implementing the architecture detailed above has streamlined data access and storage for our clients. Using this blueprint, you can migrate from your legacy data warehouse onto a cloud-based lakehouse setup.

The post A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake appeared first on Tiger Analytics.

Building Data Engineering Solutions: A Step-by-Step Guide with AWS

onemg — Thu, 14 Feb 2019 18:10:43 +0000

Introduction:

Lots of small to midsize companies use Analytics to understand business activity, lower their costs and increase their reach. Some of these companies may intend to build and maintain an Analytics pipeline but change their mind when they see how much money and tech know-how it takes. For any enterprise, data is an asset. And they are unwilling to share this asset with external players: they might end up risking their market advantage. To extract maximum value from intelligence harvesting, enterprises need to build and maintain their own data warehouses and surrounding infrastructure.

The Analytics field is buzzing with talks on applications related to Machine Learning, which have complex requirements like storing and processing unstructured streaming data. Instead of pushing themselves towards advanced analytics, companies can extract a lot of value simply by using good reporting infrastructure. This is because currently a lot of SME activity is still at the batch data level. From an infrastructure POV, cloud players like Amazon Web Services (AWS) and Microsoft Azure have taken away a lot of complexity. This has enabled companies to implement an accurate, robust reporting infrastructure (more or less) independently and economically. This article is about a specific lightweight implementation of Data Engineering using AWS, which would be perfect for an SME. By the time you finish reading this, you will:

1) Understand the basics of a simple Data Engineering pipeline
2) Know the details of a specific kind of AWS-based Analytics pipeline
3) Apply this design thinking to a similar problem you may come across

Analytics Data Pipeline:

SMEs have their business activity data stored in different places. Getting it all together so that a broad picture of the business’s health emerges is one of the big challenges in analytics. Gathering data from sources, storing it in a structured and accurate manner, then using that data to create reports and visualizations can give SMEs relatively large gains. From a process standpoint, this is what it might look like:

Figure 1: Simple Data Pipeline

But from a business activity effort standpoint, it’s more like:

Figure 2: Business Activity involved in a Data Pipeline

Here’s what’s interesting: although the first two components of the process consume most time and effort, when you look at it from a value chain standpoint, value is realized in the Analyze component.

Figure 3: Analytics Value Chain

The curiously inverse relationship between effort and value keeps SMEs wondering if they will realize the returns they expect on their investment and minimize costs. Analytics today might seem to be all about Machine Learning and cutting-edge technology, but SMEs can realize a lot of value by using relatively simple analytics like:

1) Time series graph on business activity for leadership
2) Bar graph visualization for sales growth over the years
3) For the Sales team: a refreshed, filterable dashboard showing the top ten clients over a chosen time period
4) For the Operations team: an email blast every morning at eight depicting business activity expense over a chosen time period

Many strategic challenges that SMEs face, like business reorganization, controlling operating costs, crisis management, require accurate data to solve. Having an Analytics data pipeline in the cloud allows enterprises to take cost-optimized, data-driven decisions. These can include both strategic decision-making for C-Suite and business-as-usual metrics for the Operations and Sales teams, allowing executives to track their progress. In a nutshell, an Analytics data pipeline makes company information accessible to executives. This is valuable in itself because it enables metrics monitoring (including the derived benefits like forecasting predictions). There you have it, folks: a convincing case for SMEs to experiment with building an in-house Analytics pipeline.

Mechanics of the pipeline:

Before we get into vendors and the value they bring, here’s something for you to think about: there are as many ways to build an Analytics pipeline as there are stars in the sky. The challenge here is to create a data pipeline that is hosted on a secure cloud infrastructure. It’s important to use cloud-native compute and storage components so that the infrastructure is easy to build and operate for an SME.

Usually, source data for SMEs are in the following formats:

1) Payment information stored in Excel
2) Business activity information coming in as API
3) Third-party interaction exported as a .CSV to a location like S3

Using AWS as a platform enables SMEs to leverage the serverless compute feature of AWS Lambda when ingesting the source data into an Aurora Postgres RDBMS. Lambda allows many programming interfaces including Python, a widely used language. Back in 2016-17, the total runtime for Lambda was at five minutes, which was not nearly enough for ETL. Two years later, the limit was increased to 15 minutes. This is still too little time to execute most ETL jobs, but enough for the batch data ingestion requirements of SMEs.

Lambda is usually hosted within a private subnet in the enterprise Virtual Private Cloud (VPC), but it can communicate with third-party source systems through a Network Address Translator (NAT) and Internet Gateways (IG). Python’s libraries (like Pandas) make tabular data quick and easy to process. Once processed, the output dataframe from Lambda is stored onto a table in the Aurora Postgres Database. Aurora prefix is for the AWS flavor of the Postgres Database offering. It makes sense to choose a vanilla relational database because most data is in Excel-type rows and columns format anyway, and reporting engines like Tableau and other BI tools work well with RDBMS engines.

Mapping the components to the process outlined in Figure 1, we get:

Figure 4: Revisiting Analytics pipeline

AWS Architecture:

Let’s take a deeper look into AWS architecture.

Figure 5: AWS-based batch data processing architecture using Serverless Lambda function and RDS database

Figure 5 adds more details to the AWS aspects of a Data Engineering pipeline. Operating on AWS requires companies to share security responsibilities such as:

1) Hosting AWS components with a VPC
2) Identifying public and private subnets
3) Ensuring IG and NAT Gateways can allow components hosted within private subnets to communicate with the internet
4) Provisioning the Database as publicly not accessible
5) Setting aside a dedicated EC2 to route web traffic to this publicly inaccessible database
6) Provisioning security groups for EC2’s public subnet (Lambda in private subnet and Database in DB subnet)
7) Provisioning subnets for app and DB tier in two different Availability Zones (AZ) to ensure (a) DB tier provisioning requirements are met, and (b) Lambda doesn’t run out of IPs when triggered

Running the pipeline:

New data is ingested by timed invocation of Lambda using CloudWatch rules. CloudWatch monitors AWS resources and invokes services at set times using Chron expression. CloudWatch can also be effectively used as a SQL Server Job agent to trigger Lambda events. This accommodates activities with different frequencies like:

1) Refreshing sales activity (daily)
2) Operating Costs information (weekly)
3) Payment activity (biweekly)
4) Tax information (monthly)

CloudWatch can deploy a specific Python script (that takes data from the source, does necessary transformations, and loads it onto a table with known structure) to Lambda once the respective source file or data refresh frequency is known.

Moving on to Postgres, its unique Materialized View and SQL Stored procedure feature (that allows further processing) can also be invoked using a combination of Lambda and CloudWatch. This workflow is helpful to propagate base data after refresh into denormalized, wide tables which can store company-wide sales and operations information.

Figure 6: An example of data flow for building aggregate metrics

Once respective views are refreshed with the latest data, we can connect to the Database using a BI tool for reporting and analysis. It’s important to remember that because we are operating on the AWS ecosystem, the Database must be provisioned as publicly inaccessible and be hosted within a private subnet. Users should only be able to reach it through a web proxy, like nginx or httpd, that is set up on an EC2 on the public subnet to route traffic within the VPC.

Figure 7: BI Connection flow to DB

Access to data can be controlled at the Database level (by granting or denying access to a specific schema) and at the connection level (by whitelisting specific IPs to allow connections and denying connect access by default).

Accuracy is the name of the game:

So you have a really secure and robust AWS architecture, a well-tested Python code for Lambda executions, and a not-so-cheap BI tool subscription. Are you all set? Not really. You might just miss the bus if inaccuracy creeps into the tables during data refresh. A dashboard is only as good as the accuracy of the numbers it displays. Take extra care to ensure that the schema tables you have designed include metadata columns required to identify inaccurate and duplicate data.

Conclusion:

In this article, we took a narrow-angle approach to a specific Data Engineering example. We saw the Effort vs Return spectrum in the Analytics value chain and the value that can be harvested by taking advantage of the available Cloud options. We noted the value in empowering C-suite leaders and company executives with descriptive interactive dashboards.

We looked at building a specific AWS cloud-based Data Engineering pipeline that is relatively uncomplicated and can be implemented by SMEs. We went over the architecture and its different components and briefly touched on the elements of running a pipeline and finally, on the importance of accuracy in reporting and analysis.

Although we saw one specific implementation in this article, the attempt here is to convey the idea that getting value out of an in-house Analytics pipeline is easier than what it used to be say a decade ago. With open source and cloud tools here to make the journey easy, it doesn’t take long to explore and exploit the value hidden in data.

[References:

Disruptive Analytics, Apress, 2016]

The post Building Data Engineering Solutions: A Step-by-Step Guide with AWS appeared first on Tiger Analytics.