Data Engineering Archives - Tiger Analytics

What is Data Observability Used For?

TA@2023 — Fri, 27 Sep 2024 10:35:54 +0000

Imagine you’re managing a department that handles account openings in a bank. All services seem fine, and the infrastructure seems to be working smoothly. But one day, it becomes clear that no new account has been opened in the last 24 hours. On investigation, you find that this is because one of the microservices involved in the account opening process is taking a very long time to respond.

For such a case, the data analyst examining the problem can use traces with triggers based on processing time. But there must be an easier way to spot anomalies.
Traditional monitoring involves recording the performance of the infrastructure and applications. Data observability allows you to track your data flows and find faults in them (may even extend to business processes). While traditional tools analyze infrastructure and applications using metrics, logs, and traces, data observability uses data analysis in a broader sense.

So, how do we tackle the case of no new account creation in 24 hours?

The data analyst could use traces with time-based triggers. There has to be an easier way of detecting potential anomalies on site.

A machine learning model is used to predict future events, such as the volume of future sales, by utilizing regularly updated historical data. However, because the input data may not always be of perfect quality, the model can sometimes produce inaccurate forecasts. These inaccuracies can lead to either excess inventory for the retailer or, worse, out-of-stock situations when there is consumer demand.

Classifying and Addressing Unplanned Events

The point of Data Observability is to identify so-called data downtime. Data Downtime refers to a sudden unplanned event in your business/infrastructure/code that leads to a sudden change in the data. In other words, it is the process of finding anomalies in data.

How can you classify these events?

Exceeding a given metric value or an abnormal jump in a given metric. This type is the simplest. Imagine that you add 80-120 clients every day (confidence interval with some probability), and in one day, only 20. Perhaps something caused it to drop suddenly, and it’s worth looking into.
Abrupt change in data structure. Let’s take a past example with clients. Everything was fine, but one day, the contact information field began to receive empty values. Perhaps something has broken in your data pipeline, and it’s better to check.
The occurrence of a certain condition or deviation from it. Just as GPS coordinates should not show a truck in the ocean, banking transactions should not suddenly appear in unexpected locations or in unusual amounts that deviate significantly from the norm.
Statistical anomalies. During a routine check, the bank’s analysts notice that on a particular day, the average ATM withdrawal per customer spiked to $500, which is significantly higher than the historical average.

On the one hand, it seems that there is nothing new in this approach of classifying abnormal events and taking the necessary remedial action. But on the other hand, previously there were no comprehensive and specialized tools for these tasks.

Data Observability is Essential for Ensuring Fresh, Accurate, and Smooth Data Flow

Data observability serves as a checkup for your systems. It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

Persona	Why Question	Observability Use case	Business Outcome
Business User	WHY Data quality metrics are in Amber/Red WHY is my dataset/report not accurate WHY do I see a sudden demand for my product and what is the root cause	Data Quality, Anomaly Detection and RCA	Improve the quality of insights Boost trust and confidence in decision making
Data Engineers/Data Reliability Engineers	WHY there is data downtime WHY did the pipeline fail WHY there is an SLA breach in Data Freshness	Data Pipeline Observability, Troubleshooting and RCA	Better Productivity Speed up MTTR Enhance Pipeline efficiency Intelligent Triaging
Data Scientists	WHY the model predictions are not accurate	Data Quality Model	Improve Model Reliability

Tiger Analytics’ Continuous Observability Solution

Continuous monitoring and alerting of potential issues (gathered from various sources) before a customer/operations reports an issue. Consists of Set of tools, patterns and practices to build Data Observability components for your big data workloads in Cloud platform to reduce DATA DOWNTIME.

Select examples of our experience in Data observability and Quality

Tools and Technology

Tiger Analytics Data Observability is set of tools, patterns and best practices to:

Ingest MELT(Metrics, Events, Logs, Traces) data
Enrich, Store MELT for getting insights on Event & Log Correlations, Data Anomalies, Pipeline Failures, Performance Metrics
Configure Data Quality rules using a Self Service UI
Monitor Operational Metrics like Data quality, Pipeline health, SLAs
Alert Business team when there is Data Downtime
Perform Root cause analysis
Fix broken pipelines and data quality issues

Which will help:

Minimize data downtime using automated data quality checks
Discover data problems before they impact the business KPIs
Accelerate Troubleshooting and Root Cause Analysis
Boost productivity and reduce operational cost
Improve Operational Excellence, QoS, Uptime

Data observability and Generative AI (GenAI) can play crucial roles in enhancing data-driven decision-making and machine learning (ML) model performance.

The combination of data observability primes the pump by instilling confidence with smooth sailing, high-quality and always available data which forms a foundation for any data-driven initiative while GenAI enables to realize what is achievable through it, opening up new avenues into how we can simulate, generate or even go beyond innovate. Organizations can use both to improve their data capabilities, decision-making processes, and innovation with different areas.

Thus, Monte Carlo, a company that produces a tool for data monitoring, raised $135 million, Observe – $112 million, Acceldata – $100 million have an excellent technology medium in the Data Observability space.

To summarize

Data Observability is an approach to identifying anomalies in business processes and the operation of applications and infrastructure, allowing users to quickly respond to emerging incidents.It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

And if there is no particular novelty in technology, there is certainly novelty in the approach, tools and new terms that make it possible to better convince investors and clients. The next few years will show how successful new players will be in the market.

References

https://www.oreilly.com/library/view/data-observability-for/9781804616024/
https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/

The post What is Data Observability Used For? appeared first on Tiger Analytics.

In Digital, We Trust: A Deep Dive into Modern Data Privacy Practices

TA@2023 — Wed, 12 Jul 2023 14:43:16 +0000

How can leaders make better, more trustworthy decisions regarding technology? According to the World Economic Forum, that’s where Digital Trust can help steer both companies and customers toward a win-win outcome.

“Privacy serves as a requirement to respect individuals’ rights regarding their personal information and a check on organizational momentum towards processing personal data autonomously and without restriction. A focus on this goal ensures that organizations can unlock the benefits and value of data while protecting individuals from the harms of privacy loss. It effectuates inclusive, ethical, and responsible data use – or digital dignity – by ensuring that personal data is collected and processed for a legitimate purpose(s) (e.g., consent, contractual necessity, public interest, etc.)”

The issue with today’s technology is how to gather insights that can help make better decisions that follow privacy regulations. With the privacy regulations enforcing principles like ‘Right to be Forgotten,’ ‘Privacy by Design,’ ‘Data Portability,’ etc., a checkbox approach to Data Privacy is not sustainable. A paradigm shift in privacy mindset is necessary to mitigate the risks arising due to the use of disruptive technologies.

Organizations need to keep in mind the following aspects of the Data privacy approach while building privacy considerations into their software or services:

Data Classification – Confidential, sensitive, sensitive but need re-use, sensitive but not re-use, etc., based on data classification methodology

Anonymization – This is a process of data de-identification that produces data where individual records cannot be linked back to an original as they do not include the required translation variables to do so. This is an irreversible process.

Re-Identification – The process of re-identifying the de-identified data using the referential values by using the same methods as of De-Identification.

Format Preserving Encryption – The process of transforming data in such a way that the output is in the same format as the input using cipher keys and algorithms.

De-Identification – The process of removing or obscuring any personally identifiable information from individual records in a way that minimizes the risk of unintended disclosure of the identity of individuals and information about them. The data can be reversible. However, it may also include the required translation variables to link back the data to the original data using different mechanisms based on the below table.

Encryption – The process of transforming data using cipher keys and algorithms to make it unreadable cipher text to anyone except those possessing a key. Restoring the data needs both algorithm and cipher keys.

The Data Privacy Approach – From Theory to Implementation

At Tiger Analytics, we built a globally scaled Data and Analytics platform for a US-based Life sciences org, with data discovery and classification as a core component – with the help of AWS Macie, which classifies data based on content, Regex, file extension, and PII classifier.

Masking sensitive data by partially or fully replacing characters with symbols, such as an asterisk (*) or hash (#).
Replacing each instance of sensitive data with a token, or surrogate, string.
Encrypting and replacing sensitive data using a randomly generated or pre-determined key.

By leveraging best-in-the-class encryption and masking solutions, we were able to protect sensitive data elements in hyperscalers/cloud natives of our clients and their customers.

Deep diving into De-Identification

Different De-identification methods are described in the table below with an example:

Understanding the Encryption Approach

Data Privacy Encryption is achieved through leveraging a new encryption technique, FPE (Format Preserving Encryption), which preserves the format of the sensitive data fields while providing an advanced encryption standard level of encryption strength.

In a typical scenario, data from the source systems will land on the landing zone over a secure channel, and data encryption, masking, and other security measures will be applied, depending on data structure and other library integrations.

Data encryption involves the installation and configuration of 3rd party encryption and key management solution on the Cloud (AWS/GCP) platform. Encryption keys are stored and managed within the Key Management Server, and only authorized users/resources are granted access to the same.

In the case of structured data, sensitive data (specific PII / PHI attributes) will be encrypted without altering the original format. This will be done using the 3rd party Format Preserving Encryption (FPE) solution to preserve business value and referential integrity across distributed data sets when the data moves from un-trusted to a trusted zone at the landing layer. In the case of unstructured data, either the entire file or specific PII/PHI data will be encrypted during the transition from un-trusted to trusted zone on a case-by-case basis, based on the requirements.

Users requiring access to sensitive data (PII/PHI) will be made part of the relevant IAM roles, user credentials will be validated against the IAM, and the data will be transparently decrypted using the keys. Validated users will be able to access the sensitive data in the clear, whereas users who do not have the necessary privileges will see the data only in an encrypted format.

Data Encryption on AWS and GCP – How they Differ

Whenever data is written to the storage platform, AWS will apply encryption on it, and conversely, when the data is read from the storage, decryption will happen transparently. In addition to encryption of specific PII/PHI data, AWS native transparent encryption & KMS features shall be leveraged to protect the data in the AWS cloud. These features will provide protection to data stored in any potential storage mechanism, such as S3, Kinesis, Redshift, Dynamo DB, etc.

Google adds differential privacy to Google SQL for BigQuery, building on the open-source differential privacy library that is used by Ads Data Hub and the COVID-19 Community Mobility Reports. Differential privacy is an anonymization technique that limits the personal information that is revealed by an output. Differential privacy is commonly used to allow inferences and to share data while preventing someone from learning information about an entity in that dataset.

With BigQuery differential privacy, we can:

Anonymize results with individual-record privacy.
Anonymize results without copying or moving your data, including data from AWS and Azure with BigQuery Omni.
Anonymize results that are sent to Dataform pipelines so that they can be consumed by other applications.
Anonymize results that are sent to Apache Spark stored procedures.

Let’s take the example of a person opening a bank account on a web portal. They have to fill in their age, telephone number, and country. Let’s look at how their Data Privacy can be protected while gathering the necessary information.

Points to note:

The age field is sensitive, and its actual value is usually not required for any analysis/processing by downstream systems.
The telephone number is sensitive in nature, and its value in the same format (not actual) is required by the data analytics platform for further analysis. The actual value of the telephone number is required only on-premise.
Country value is not classified as sensitive; however, its value is encrypted while sending the data to the cloud.
Data is transmitted between two systems on-premise, on cloud and on-premise to cloud or cloud to on-premise over HTTPS.
BU-specific encryption keys are used for encryption while moving to the cloud. Data gets decrypted on the premise using the BU-specific keys.
On-Cloud data at rest is implemented using the cloud provider’s key by applying the techniques of transparent DB encryption, Volume encryption, or Disk encryption.
Required governance controls at process (for example, approvals for access), people (for example, trainings, background checks, etc.), and technical tools (for example, authentication and access control) are created on-premise and as well as cloud.

It is assumed that age is not required for further processing, and telephone number is required for analysis and processing by the Data Analytics platform.
Telephone number is sensitive in nature, and its value in the same format (not actual) is required by the Data Analytics platform for further analysis. The actual value of the Telephone number is required only on-premise.
The ‘Country’ value is not classified as sensitive; however, its value is encrypted while sending the data to the cloud.
Based on the data classification, the age value gets anonymized. This is a one-way process.
The Telephone number gets De-identified using the de-identification method or algorithm by preserving the referential value.
The country value is encrypted using format preserving encryption algorithm and BU-specific encryption key.
Then age (anonymized), telephone number (de-identified), and country (encrypted) will be sent to Data Analytics Platform on cloud over HTTPS.
The data then gets stored in the cloud platform in an encrypted format using the cloud provider’s server-side encryption keys.
The data at rest on cloud is always in an encrypted format by using the cloud provider’s features like Transparent DB encryption, Volume Encryption, and/or Disk encryption techniques.
If the data needs to processing by the Analytics platform – the data first gets decrypted using the Cloud Provider’s specific key – complete processing
Once processing is completed, the data again gets encrypted and stored on Cloud.
The decryption and re-identification techniques are applied on-premise to retrieve the original values to be consumed by other applications such as call center, ESB, etc.

Final thoughts

“Digital trust is individuals’ expectation that digital technologies and services – and the organizations providing them – will protect all stakeholders’ interests and uphold societal expectations and values.” And ensuring the right privacy considerations, transparent communication, and intent will go a long way in building a mutually trustworthy exchange between organizations and individuals.

Sources:

The Digital Trust report: https://initiatives.weforum.org/digital-trust/about

https://fpf.org/blog/a-visual-guide-to-practical-data-de-identification/

https://docs.aws.amazon.com/whitepapers/latest/logical-separation/encrypting-data-at-rest-and–in-transit.html

https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-differential-privacy-with-tumult-labs?utm_source=twitter&utm_medium=unpaidsoc&utm_campaign=fy23q2-googlecloudtech-blog-data-in_feed-no-brand-global&utm_content=-&utm_term=-

https://github.com/priyankavergadia/GCPSketchnote

The post In Digital, We Trust: A Deep Dive into Modern Data Privacy Practices appeared first on Tiger Analytics.

A Comprehensive Guide: Optimizing Azure Databricks Operations with Unity Catalog

TA@2023 — Wed, 21 Jun 2023 17:15:52 +0000

For a Data engineer/Admins, making sure that all operations run smoothly is a priority.

That’s where Unity Catalog can help ensure that stored information is managed correctly, especially for those working with Azure Databricks. The Unity Catalog (UC) is a powerful metadata management system that is built into Delta Lake. It provides a centralized location to help users manage the metadata information on the data stored in Delta Lake. It also helps to simplify data management by providing a unified view of data across different data sources and formats.

Before Unity Catalog, every ADB (Azure Databricks) workspace had its own metastore, user management, and access controls, which led to duplication of efforts when maintaining consistency across all workspaces. To overcome these challenges, Databricks developed Unity Catalog, a unified governance solution for data and AI assets on the Lakehouse. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

At Tiger Analytics, we’ve worked to enable Unity Catalog for clients with new Databricks deployment and upgrading existing hive metastore to Unity Catalog to leverage all benefits Unity Catalog provides.

Making the Most of Unity Catalog’s Key Features

Centralized metadata and user management

Unity Catalog provides a centralized metadata layer to enable sharing data objects such as catalogs/ schema/ tables across multiple workspaces. It introduces two new built-in admin roles (Account Admins and Metastore Admins) to manage key features.

Account Admin: manages account-level resources like metastore, assigns metastore to workspaces, and assigns principals to the workspace.
Metastore Admin: manages metastore objects and grants identities access to securable objects (catalog/ schema/ tables/ views).

Centralized data access controls
Unity Catalog permits the use of Standard SQL-based commands to provide access to data objects.

GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;

GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >

TO < group_name >;

GRANT

SELECT

ON < catalog_name >.< schema_name >.< table_name >

TO < group_name >;

Data lineage Data access auditing

Unity Catalog automatically captures user-level audit logs that record access to user data. It also captures lineage data that tracks how data assets are created and used across all languages and personas.

Data search and discovery

Unity Catalog lets you tag and document data assets and provides a search interface to help data consumers find data.

Delta Sharing

Unity Catalog allows users in Databricks to share data securely outside the organization, which can be managed, governed, audited, and tracked.

Managing Users and Access Control

Account admins can sync users (groups) to workspaces from Azure Active Directory (Azure AD) tenant to Azure Databricks account using a SCIM provisioning connector.
Azure Databricks recommends using account-level SCIM provisioning to create, update, and delete all users (groups) from the account.

Unity Catalog Objects

Metastore

A metastore is the top-level container of objects in the Unity Catalog. It stores data assets (tables and views) and the permissions that govern access. UC metastore is mapped to an ADLS container, this container stores the Unity Catalog metastore’s metadata and managed tables. You can only create one UC metastore per region. Each workspace can only be attached to one UC metastore at any point in time. Unity Catalog has a 3-tier structure (catalog.schema.table/view) for referencing objects.

External Location and Storage Credential

Storage credential created either as managed identity or service principal provides access to the underlying ADLS path.
Storage credentials (managed identity/ service principal) should be authorized to that external storage account location by providing IAM role at storage account level.
External Location is an object that combines a cloud storage path with storage credentials to authorize access to the cloud storage path.
Each cloud storage path can be associated with only one external location. If you attempt to create a second external location that references the same path, the command fails.

Managed and External Tables

Unity Catalog manages the lifecycle of managed tables. This means that if you drop managed tables, both metadata and data are dropped.
By default, UC metastore ADLS container (the root storage location) will store the managed tables’ data as well, but you can override this default location at the catalog or schema level. Managed tables are in Delta format only.
External tables are tables whose data is stored outside of the managed storage location specified for the metastore, catalog, or schema. Dropping them will only delete the metadata of the table.

How to Create a UC Metastore and Link Workspaces

This flow diagram explains the sub-tasks needed to create a Metastore.

Step 1: Create an ADLS storage account and container

This Storage account container will store Unity Catalog metastore’s metadata and managed tables.

Step 2: Create an access connector for Databricks

Create Access Connector for Azure Databricks, and when deployment is done, make a note of Resource ID.

Step 3: Provide RBAC to access the connector

Add role assignment: Storage Blob Data Contributor to the managed identity (access connector) in Step#2

Step 4: Create metastore and assign workspaces

Once a UC metastore has been attached to a workspace, this will be visible under the workspace data tab:

If Unity Catalog is enabled for any existing workspace which had tables stored under hive_metastore catalog, those existing tables can be upgraded using SYNC command or UI, or they can be accessed using hive_metastore..

Enabling Unity Catalog as part of lakehouse architecture helps in achieving a centralized metadata layer for more enterprise-level governance without sacrificing the ability to manage and share data effectively. It helps in planning workspace deployments with limits in mind. This helps eliminate the risks of not being able to share the data and govern the project. With Unity Catalog, we can overcome the limitations and constraints of the existing Hive metastore, enabling us to better collaborate and leverage the power of data according to specific business needs.

The post A Comprehensive Guide: Optimizing Azure Databricks Operations with Unity Catalog appeared first on Tiger Analytics.

Enabling Cross Platform Data Observability in Lakehouse Environment

TA@2023 — Tue, 13 Jun 2023 19:05:05 +0000

Imagine a world where organizations effortlessly unlock their data ecosystem’s full potential as data lineage, cataloging, and quality seamlessly flow across platforms. As we rely more and more on data, the technology for uncovering valuable insights has grown increasingly nuanced and complex. While we’ve made significant progress in collecting, storing, aggregating, and visualizing data to meet the needs of modern data teams, one crucial factor defines the success of enterprise-level data platforms — Data observability.

Data observability is often conflated with data monitoring, and it’s easy to see why. The two concepts are interconnected, blurring the lines between them. However, data monitoring is the first step towards achieving true observability; it acts as a subset of observability.

Some of the industry-level challenges are:

The proliferation of data sources with varying tools and technologies involved in a typical data pipeline diminishes the visibility of the health of IT applications.
Data is consumed in various forms, making it harder for data owners to understand the data lineage.
The complexity of debugging pipeline failures poses major hurdles with a multi-cloud data services infrastructure.
Nonlinearity between creation, curation, and usage of data makes data lineage tough to track.

What is Data Observability?

To grasp the concept of data observability, the first step is to understand what it entails. Data observability focuses on examining the health of enterprise data environments by focusing on:

Design Lineage: Providing contextual data observations such as the job’s name, code location, version from Git, environment (Dev/QA/Prod), and data source metadata like location and schema.
Operational Lineage: Generating synchronous data observations by computing metrics like size, null values, min/max, cardinality, and more custom measures like skew, correlation, and data quality validation. It also includes usage attributes such as infrastructure and resource information.
Tracing and Continuous Validation: Generating data observations with continuously validated data points and sources for efficient tracing. It involves business thresholds, the absence of skewed categories, input and output data tracking, lineage, and event tracing.

Implementing Data Observability in Your Lakehouse Environment

Benefits of Implementation

Capturing critical metadata: Observability solutions capture essential design, operational, and runtime metadata, including data quality assertions.
Seamless integration: Technical metadata, such as pipeline jobs, runs, datasets, and quality assertions, can seamlessly integrate into your enterprise data governance tool.
End-to-end data lineage: Gain insights into the versions of pipelines, datasets, and more by establishing comprehensive data lineage across various cloud services.

Essential Elements of Data Observability Events

Job details: Name, owner, version, description, input dependencies, and output artifacts.
Run information: Immutable version of the job, event type, code version, input dataset, and output dataset.
Dataset information: Name, owner, schema, version, description, data source, and current version.
Dataset versions: Immutable versions of datasets.
Quality facets: Data quality rules, results, and other relevant quality facets.

Implementation Process

A robust and holistic approach to data observability requires a centralized interface in data. So, end-to-end data observability consists of implementing the below four layers in any of the Lakehouse environments.

Observability Agent: A listener setup that depends on the sources/data platform.
Data Reconciliation API: An endpoint for transforming the event to fit into the data model.
Metadata Repository: A data model created in a relational database.
Data Traceability Layer: A web-based interface or existing data governance tool.

By implementing these four layers and incorporating the essential elements of data observability, organizations can achieve improved visibility, traceability, and governance over their data in a Lakehouse environment.

The core data model for data observability prioritizes immutability and timely processing of datasets, which are treated as first-class values generated by job runs. Each job run is associated with a versioned code and produces one or more immutable versioned outputs. Changes to datasets are captured at various stages during job execution.

Technical Architecture: Observability in a Modern Platform

The depicted technical architecture exemplifies the implementation of a data observability layer for a modern medallion-based data platform. It helps enable a data observability layer for an enterprise-level data platform that collects and correlates metrics.

In data observability, this robust architecture effectively captures and analyzes critical metadata. Let’s dive into the technical components that make this architecture shine and understand their roles in the observability layer.

OpenLineage Agent

This observability agent bridges data sources, processing frameworks, and the observability layer. Its mission is to communicate seamlessly, ingesting custom facets to enhance event understanding. The OpenLineage agent’s compatibility with a wide range of data sources, processing frameworks, and orchestration tools makes it remarkable. In addition, it offers the flexibility needed to accommodate the diverse technological landscape of modern data environments.

OpenLineage API Server

As the conduit for custom events, the OpenLineage API Server allows ingesting these events into the metadata repository.

Metadata Repository

The metadata repository is at the heart of the observability layer. This data model, carefully crafted within a relational data store, captures essential information such as jobs, datasets, and runs.

Databricks

Azure Databricks offers powerful data processing engine with various types of clusters. Setting up OpenLineage agent in Databricks cluster enables to capture the dataset lineage tracking events based on the data processing jobs triggered in the workspace.

Azure Data Factory

With its powerful data pipeline orchestration capabilities, Azure Data Factory (ADF) takes center stage. ADF enables the smooth flow of data pipeline orchestration events, seamlessly sending them to the OpenLineage API. ADF seamlessly integrates with the observability layer, further enhancing data lineage tracking.

Great Expectations

Quality is of paramount importance in any data-driven ecosystem. Great Expectations ensures that quality facets are seamlessly integrated into each dataset version. Also, by adding custom facets through the OpenLineage API, Great Expectations fortifies the observability layer with powerful data quality monitoring and validation capabilities.

EventHub

As the intricate events generated by the OpenLineage component must be seamlessly integrated into the Apache Atlas API from Purview, EventHub takes center stage. As an intermediate queue, EventHub diligently parses and prepares these events for further processing, ensuring smooth and efficient communication between the observability layer and Purview.

Function API

To facilitate this parsing and preparation process, Azure Functions come into play. Purpose-built functions are created to handle the OpenLineage events and transform them into Atlas-supported events. These functions ensure compatibility and coherence between the observability layer and Purview, enabling seamless data flow.

Purview

Finally, we have Purview, the ultimate destination for all lineage and catalog events. Purview’s user interface becomes the go-to hub for tracking and monitoring the rich lineage and catalog events captured by the observability layer. With Purview, users can gain comprehensive insights, make informed decisions, and unlock the full potential of their data ecosystem.

Making Observability Effective on the ADB Platform

At Tiger Analytics, we’ve worked with a varied roster of clients, across sectors to help them achieve better data observability. So, we crafted an efficient solution that bridges the gap between Spark operations in Azure Databricks and Azure Purview. It transfers crucial observability events, enabling holistic data management. This helps organizations thrive with transparent, informed decisions and comprehensive data utilization.

The relationship is simple. Azure Purview and Azure Databricks complement each other. Azure Databricks offers powerful data processing and collaboration, while Azure Purview helps manage and govern data assets. Integrating them allows you to leverage Purview’s data cataloging capabilities to discover, understand, and access data assets within your Databricks workspace.

How did we implement the solution? Let’s dive in and find out.

Step 1: Setting up the Environment: We began by configuring the Azure Databricks environment, ensuring the right runtime version was in place. To capture observability events, we attached the OpenLineage jar to the cluster, laying a solid foundation for the journey ahead.

Step 2: Cluster Configuration: Smooth communication between Azure Databricks and Azure Purview was crucial. To achieve this, we configured the Spark settings at the cluster level, creating a bridge between the two platforms. By specifying the OpenLineage host, namespace, custom app name, version, and extra listeners, we solidified this connection.

Sample code snippet:

spark.openlineage.host https://<<>>:5000
spark.openlineage.namespace <<>>
spark.app.name <<>>
spark.openlineage.version v1
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener

Step 3: Spark at Work: With Spark’s power, the OpenLineage listeners came into action, capturing the Spark logical plan. This provided us with a comprehensive view of data operations within the cluster.

Step 4: Enter the Service Account: This account, created using a service principle, took center stage in authenticating the Azure Functions app and Azure Purview. Armed with owner/ contributor access, this service account became the key to a seamless connection.

Step 5: Azure Purview Unleashed: To unlock the full potential of Azure Purview, we created an Azure Purview service. Within Purview Studio, we assigned the roles of data curator, data source administrator, and collection admin to the service account. This granted the necessary permissions for a thrilling data management adventure.

Step 6: Seamless Deployment: Leveraging the deployment JSON provided in the OpenLineage GitHub repository, we embarked on a smooth AZ deployment. This process created essential services such as storage accounts, blob services, server farms, and websites, laying the foundation for a robust data lineage and cataloging experience –

Microsoft.Storage/storageAccounts
Microsoft.Storage/storageAccounts/blobServices/containers
Microsoft.Web/serverfarms
Microsoft.Web/sites
olToPurviewMappings
Microsoft.EventHub/namespaces
Microsoft.EventHub/namespaces/eventhubs
Microsoft.KeyVault/vaults
Microsoft.KeyVault/vaults/secrets

Step 7: Access Granted: An authentication token was added, granting seamless access to the Purview API. This opened the door to a treasure trove of data insights, empowering us to delve deeper into the observability journey.

Step 8: Spark and Azure Functions United: In the final step, we seamlessly integrated Azure Databricks with Azure Functions. By adding the Azure Function App URL and key to the Spark properties, a strong connection was established. This enabled the capture of observability events during Spark operations, effortlessly transferring them to Azure Purview, resulting in a highly effective data lineage.

Sample code snippet:

spark.openlineage.host https://.azurewebsites.net
spark.openlineage.url.param.code

By following the steps outlined above, our team successfully provided a comprehensive and highly effective data lineage and observability solution. By linking parent and child job IDs between Azure Data Factory (ADF) and Databricks, this breakthrough solution enabled correlation among cross-platform data pipeline executions. As a result, the client could leverage accurate insights that flowed effortlessly. This empowered them to make informed decisions, ensure data quality, and unleash the true power of data.

Extending OpenLineage Capabilities Across Data Components

Enterprises require data observability across multiple platforms. OpenLineage, a powerful data observability solution, offers out-of-the-box integration with various data sources and processing frameworks. However, what if you want to extend its capabilities to cover other data platform components? Let’s explore two simple methodologies to seamlessly integrate OpenLineage with additional data platforms, enabling comprehensive observability across your entire data ecosystem:

1. Custom event through OpenLineage API: Maintain a custom function to generate OpenLineage-supported JSON with observability events and trigger the function with required parameters wherever the event needs to be logged.

2. Leveraging API provided by target governance portal: Another option is to leverage the APIs provided by target governance portals. These portals often offer APIs for Data observability event consumption. By utilizing these APIs, you can extend OpenLineage’s solution to integrate with other data platforms. For example, Azure Purview has an API enabled with Apache Atlas for event ingestion. You can use Python packages such as PyApacheAtlas to create observability events in the format supported by the target API.

Data observability continues to be an important marker in evaluating the health of enterprise data environments. It provides organizations with a consolidated source of technical metadata, including data lineage, execution information, data quality attributes, dataset and schema changes generated by diverse data pipelines, and operational runtime metadata. This helps operational teams conduct precise RCA.

As various data consumption tools are in demand, along with the increasing use of multi-cloud data platforms, the data observability layer should be platform-agnostic and effortlessly adapt to available data sources and computing frameworks.

Sources:

https://www.usgs.gov/faqs/what-are-differences-between-data-dataset-and-database#:~:text=A%20dataset%20is%20a%20structured,data%20stored%20as%20multiple%20datasets.

https://learn.microsoft.com/en-us/azure/purview/register-scan-azure-databricks

https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

https://learn.microsoft.com/en-us/azure/databricks/introduction/

https://www.linkedin.com/pulse/what-microsoft-azure-purview-peter-krolczyk/

The post Enabling Cross Platform Data Observability in Lakehouse Environment appeared first on Tiger Analytics.

Unleash the Full Potential of Data Processing: A Roadmap to Leveraging Databricks

TA@2023 — Wed, 07 Jun 2023 18:36:42 +0000

Scenario # 1:

Thousands of files flood the data lake every day. These files are dumped in parallel by the source system, resulting in a massive influx of data.

Scenario #2:

There’s a continuous influx of incremental data from a transactional table in SAP. Every 15 minutes, a massive file containing millions of records has to be extracted and sent to the data lake landing zone. This critical dataset is essential for the business, but the sheer volume of data and the complexity of the transactional system poses significant challenges.

How would you tackle these situations?

In today’s data-driven world, organizations heavily rely on efficient data processing to extract valuable insights. Streamlined data processing directly impacts decision-making – enabling them to unlock hidden patterns, optimize operations, and drive innovation. But often, as businesses keep growing, they are faced with the uphill task of managing data velocity, variety, and volume.

Can data ingestion services help simplify the data loading process, so that business teams can focus on analyzing the data rather than managing the intricate loading process?

Leveraging Databricks to elevate data processing

The process of collecting, transforming, and loading data into a data lake can be complex and time-consuming. At Tiger Analytics, we’ve used Databricks Auto Loader across our clients and various use cases to make the data ingestion process hassle-free.

Here’s how we tackled the two problem statements for our clients:

Scenario 1: Multiple File Ingestion Based on Control File Trigger

Thousands of files flooded our client’s data lake daily. The source system would dump them in parallel, resulting in a massive influx of data. Then, to indicate the completion of the extraction process, the source system dropped a control file named ‘finish.ctrl’. The primary challenge was to trigger the ingestion process based on this control file and efficiently load all the files dropped by the source system.

The challenges:

Large number of files: The daily extract consisted of a staggering 10,000 to 20,000 text files, making manual processing impractical and time-consuming.
Volume of records: Each file contained hundreds of thousands of records, further complicating the data processing task.
Timely refresh of silver and gold layers: The business required their Business Intelligence (BI) reports to be refreshed within an hour, necessitating a streamlined and efficient data ingestion process.
Duplicate file loading: In cases where the extraction process failed at the source, the entire process would start again, resulting in the redundant loading of previously processed files.

How we effectively used Databricks to streamline the ingestion process:

We worked with Databricks Auto Loader by automating the detection and ingestion of thousands of files. With the help of these efforts, the team experienced increased efficiency, improved data quality, and accelerated data processing times – revolutionizing their entire data ingestion process.

The implementation involved the following steps:

Setting up a data factory orchestrator: The team leveraged Azure Data Factory as an orchestrator to trigger a Databricks notebook based on the event trigger. Specifically, they configured the event trigger to be activated when the source system dropped the control file ‘finish.ctrl’.
Configuring the Auto Loader notebook: The team coded a Databricks notebook to run Auto Loader with the trigger once option. This configuration ensured that the notebook would run once, ingesting all the files into the bronze table before automatically terminating them.

Sample code snippet:

df = (spark.readStream

.format(“cloudFiles”)

.option(“cloudFiles.format”, )

.schema())

df.writeStream

.format(‘delta’)

.trigger(once = True)

.outputMode(‘append’)

.option(“checkpointLocation”, )

.option(“path”,)

.table()

query.awaitTermination()

Business impact:

Increased efficiency: Manually processing thousands of files became a thing of the past. The client saved significant time and valuable resources by automating the data ingestion process.
Improved data quality: Ingesting data into the data lake using Databricks Delta Lake ensured enhanced data quality and consistency. This, in turn, mitigated the risk of data errors and inconsistencies.
Faster data processing: With the automation of data ingestion and improved data quality, the client could achieve lightning-fast file processing times. Files that previously took hours to process were now handled within minutes, empowering the team to make data-driven decisions swiftly.

Scenario 2: Streamlining the Data Ingestion Pipeline

Our client was dealing with managing a continuous influx of incremental data from a transactional table in SAP. Every 15 minutes, a massive file containing millions of records had to be extracted and sent to the data lake landing zone. While the critical dataset was essential for their business, the sheer volume of data and the complexity of the transactional system posed huge challenges.

The challenges:

Managing a large volume of data: The transactional system generated millions of transactions per hour, resulting in an overwhelming volume of data that needed to be ingested, processed, and analyzed.
Ordered file processing: It was crucial to process the incremental files in the correct order to maintain data consistency and accuracy with the source system.
Near real-time data processing: Due to the critical nature of the data, the business required immediate ingestion of the files as soon as they arrived in the landing zone, enabling near real-time data processing.

Using Databricks to enable efficient incremental file processing

The team strategically decided to implement Databricks Auto Loader streaming. This feature allowed them to process new data files incrementally and effectively as they arrived in the cloud storage.

The implementation involved the following steps:

Leveraging file notification and queue services: The team configured Auto Loader to use the file notification service and queue service, which subscribed to file events from the input directory. This setup ensured that new data files were promptly detected and processed.
Custom role creation for service principle: To enable the file notification service, the team created a custom role for the service principle. This role encompassed the necessary permissions to create the queue and event subscription required for seamless file notification.

Sample code snippet:

“permissions”: [

{

“actions”: [

“Microsoft.EventGrid/eventSubscriptions/write”,

“Microsoft.EventGrid/eventSubscriptions/read”,

“Microsoft.EventGrid/eventSubscriptions/delete”,

“Microsoft.EventGrid/locations/eventSubscriptions/read”,

“Microsoft.Storage/storageAccounts/read”,

“Microsoft.Storage/storageAccounts/write”,

“Microsoft.Storage/storageAccounts/queueServices/read”,

“Microsoft.Storage/storageAccounts/queueServices/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/read”,

“Microsoft.Storage/storageAccounts/queueServices/queues/delete”

],

“notActions”: [],

“dataActions”: [

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/delete”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/read”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/process/action”

],

“notDataActions”: []

}

]

df = ( spark.readStream

.format(“cloudFiles”)

.option(“cloudFiles.format”, )

.option(“cloudFiles.useNotifications”,”true”)

.option(“cloudFiles.resourceGroup”,)

.option(“cloudFiles.subscriptionId”,)

.option(“cloudFiles.tenantId”,)

.option(“cloudFiles.clientId”,)

.option(“cloudFiles.clientSecret”,)

.option(“cloudFiles.maxFilesPerTrigger”, 1)

.schema()

.load())

input_df.writeStream

.format(“delta”)

.foreachBatch()

.outputMode(“update”)

.option(“checkpointLocation”, )

.start()

Business impact:

Automated data discovery and loading: Auto Loader automated the process of identifying new data files as they arrived in the data lake and automatically loaded the data into the target tables. This eliminated the manual effort required for managing the data loading process.
Enhanced focus on data analysis: The client could shift from managing the loading process to analyzing the data by streamlining the data ingestion process. Hence, they derived valuable insights and could make informed business decisions promptly.

Making Databricks Auto Loader Work for You

If you’re using Databricks to manage data ingestion, keep these things in mind so that you can create maximum value for your clients:

Data discovery: Since Databricks Auto Loader automatically detects new data files as they arrive in the data lake, it eliminates the need for manual scanning thus saving time while ensuring no data goes unnoticed.
Automatic schema inference: The Auto Loader can automatically infer the schema of incoming files based on the file format and structure. It also supports changes in the schema. This means that you can choose to drop new columns, fail on change, or rescue new columns and store them separately. It facilitates smooth data ingestion without delays during schema changes. There’s also no need to define the schema manually, making the loading process more seamless and less error-prone.
Parallel processing: Databricks Auto Loader is designed to load data into target tables in parallel. This will come in handy when you need to handle large volumes of data efficiently.
Delta Lake integration: Databricks Auto Loader seamlessly integrates with Delta Lake – open-source data storage and management system optimized for data processing and analytics workloads. You can therefore access leverage Delta Lake’s unique features like ACID transactions, versioning, time travel, and more.
Efficient job restarts: The Auto Loader stores metadata about the processed data in RocksDB as key-value pairs, enabling seamless job restarts without the need to log failures in the check-point location.
Spark structured streaming: The Auto Loader leverages Spark structured streaming for immediate data processing, providing real-time insights.
Flexible file identification: The Auto Loader provides two options for identifying new files – directory listing and file notification. The directory list mode allows the quick launching of an Auto Loader stream without additional permissions. At the same time, file notification and queue services eliminate the need for directory listing in cases of large input directories or unordered file volumes.
Batch workloads compatibility: While the Auto Loader excels in streaming and processing hundreds of files, it can also be used for batch workloads. This eliminates the need for running continuous clusters. In addition, with check-pointing, you can start and stop streams efficiently. The Auto Loader can also be scheduled for regular batch loads using the trigger once option, leveraging all its features.

Data ingestion and processing are crucial milestones in the Data Management journey. While organizations can generate vast amounts of data, it’s important to ingest and process that data correctly for accurate insights. With services like Databricks, the data-loading process becomes simpler and more efficient, improving output accuracy and empowering organizations to make data-driven decisions.

The post Unleash the Full Potential of Data Processing: A Roadmap to Leveraging Databricks appeared first on Tiger Analytics.

A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake

TA@2023 — Tue, 09 May 2023 18:00:36 +0000

Today, most large enterprises collect huge amounts of data in various forms – structured, semi-structured, and unstructured. While Enterprise Data Warehouses are ACID compliant and well-suited to BI use cases, the kind of data that is collected today for ML use cases require much more flexibility in data structure and much more scalability in data volume than they can currently provide

The initial solution to this issue came with advancements in cloud computing and the advent of data lakes in the late 2010s. Lakes are built on top of cloud-based storage such as AWS S3 buckets and Azure Blob/ADLS. They are flexible and scalable, but many of the original benefits, such as ACID compliance, are lost.

Data Lakehouse has evolved to address this gap. A Data Lakehouse combines the best of Data Lakes and Data Warehouses – it has the ability to store data at scale while providing the benefits of structure.

At Tiger Analytics, we’ve worked across cloud platforms like AWS, GCP, Snowflake, and more to build a Lakehouse for our clients. Here’s a comparative study of how to implement a Lakehouse pattern across major available platforms.

Questions you may have:

All the major cloud platforms in the market today can be used to implement a Lakehouse pattern. The first question that comes to mind is – how do they compare, i.e., what services are available across different stages of the data flow? The next – Given many organizations are moving toward a multi-cloud setup, how do we set up Lakehouses across platforms?

Finding the right Lakehouse architecture

Given that the underlying cloud platform may vary, let’s look at a platform-agnostic architecture pattern, followed by details of platform-specific implementations. The pattern detailed below enables data ingestion from a variety of data sources and supports all forms of data like structured (tables, CSV files, etc.), semi-structured (JSON, YAML, etc.), and unstructured (blobs, images, audio, video, PDFs, etc.).

It comprises four stages based on the data flow:

1. Data Ingestion – To ingest the data from the data sources to a Data Lake storage:

This can be done using cloud-native services like AWS DMS, Azure DMS, GCP Data Transfer service or third-party tools like Airbyte or Fivetran, etc.

2. Data Lake Storage – Provide durable storage for all forms of data:

Options depend on the Cloud service providers. For example – S3 on AWS, ADLS/Blob on Azure, and Google Cloud storage on GCP, etc.

3. Data Transformation – Transform the data stored in the data lake:

Languages like Python, Spark, Pyspark & SQL can be used based on requirements.
Data can also be loaded to the data warehouse without performing any transformations.
Services like AWS Glue Data Studio and GCP Data Flow provide workflow based/no-code options to load the data from the data lake to the data warehouse.
Transformed data can then be stored in the data warehouse/data marts.

4. Cloud Data Warehouse – Provision OLAP with support for Massive Parallel Processing (MPP) and columnar data storage:

Complex data aggregations are made possible with support for SQL querying.
Support for structured and semi-structured data formats.
Curated data is stored and made available for consumption by BI applications/data scientists/data engineers using role-based (RBAC) or attribute-based access controls (ABAC).

Let’s now get into the details of how this pattern can be implemented on different cloud platforms. Specifically, we’ll look at how a lakehouse can be set up by migrating from an on-premise data warehouse or other data sources by leveraging cloud-based warehouses like AWS Redshift, Azure Synapse, GCP Big query, or Snowflake.

How can a Lakehouse be implemented on different cloud platforms?

Lakehouse design using AWS

We deployed an AWS-based Data Lakehouse for a US-based Capital company with a Lakehouse pattern consisting of:

Python APIs for Data Ingestion
S3 for the Data Lake
Glue for ETL
Redshift as a Data Warehouse

Here’s how you can implement a Lakehouse design using AWS

1. Data Ingestion:

AWS Database Migration Service to migrate data from the on-premise data warehouse to the cloud.
AWS Snow family/Transfer family to load data from other sources.
AWS Kinesis (Streams, Firehose, Data Analytics) for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

AWS S3 as the data lake to store all forms of data.
AWS Lake Formation is helpful in creating secured and managed data lakes within a short span of time.

3. Data Transformation:

Spark-based transformations can be done using EMR or Glue.
No code/workflow-based transformations are done using Glue Data Studio/Glue Databrew. Third-party tools like DBT can also be used.

4. Data Warehouse:

AWS Redshift as the data warehouse which supports both structured (table formats) and semi-structured data (SUPER datatype).

Lakehouse design using Azure

One of our clients was an APAC-based automotive company. After evaluating their requirements, we deployed a Lakehouse pattern consisting of:

ADF for Data Ingestion
ADLS Gen 2 for the Data Lake
Databricks for ETL
Synapse as a Data Warehouse

Let’s look at how we can deploy a Lakehouse design using Azure

1. Data Ingestion:

Azure Database Migration Service and SQL Server Migration Assistant (SSMA) to migrate the data from an on-premise data warehouse to the cloud.
Azure Data Factory (ADF) and Azure Data Box can be used for loading data from other data sources.
Azure stream analytics for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

ADLS as the data lake to store all forms of data.

3. Data Transformation:

Spark-based transformations can be done using Azure Databricks.
Azure Synapse itself supports various transformation options using Data Explorer/Spark/Serverless pools.
Third-party tools like Fivetran can also be used for moving data. Third-party tools like Fivetran are also an option.

4. Data Warehouse:

Azure Synapse as the data warehouse.

Lakehouse design using GCP

Our US-based retail client needed to manage large volumes of data. We deployed a GCP-based Data Lakehouse with a Lakehouse pattern consisting of:

Pub/sub for Data Ingestion
GCS for the Data Lake
Data Proc for ETL
BigQuery as a Data Warehouse

Here’s how you can deploy a Lakehouse design using GCP

1. Data Ingestion:

BigQuery data transfer service to migrate the data from an on-premise data warehouse to the cloud.
GCP Data Transfer service can be used for loading data from other sources.
Pub/Sub for real-time streaming.
Third-party tools like Fivetran can also be used for moving data.

2. Data Lake:

Google Cloud Storage as the data lake to store all forms of data.

3. Data Transformation:

Spark based transformations can be done using Dataproc.
No code/workflow-based transformations are done using DataFlow. Third party tools like DBT can also be used.

4. Data Warehouse:

GCP BigQuery as the data warehouse which supports both structured and semi-structured data.

Lakehouse design using Snowflake:

We successfully deployed a Snowflake-based Data Lakehouse for our US-based Real estate and supply chain logistics client with a Lakehouse pattern consisting of:

AWS native services for Data Ingestion
AWS S3 for the Data Lake
Snowpark for ETL
Snowflake as a Data Warehouse

Here’s how you can deploy a Lakehouse design using Snowflake

Implementing a lakehouse on Snowflake is quite unique as the underlying cloud platform could be any of the big 3 (AWS, Azure, GCP), and Snowflake can run on top of it.

1. Data Ingestion:

Cloud native migration services can be used to store the data on the respective cloud storage.
Third party tools like Airbyte and Fivetran can also be used to ingest the data.

2. Data Lake:

Depends on the cloud platform: AWS – S3, Azure – ADLS and GCP – GCS.
Data can also be directly loaded onto Snowflake. However, a data lake storage is needed to store unstructured data.
Dictionary tables can be used to catalog the staged files in cloud storage.

3. Data Transformation:

Spark-based transformations can be done using Snowpark with support for Python. Snowflake natively supports some of the transformations while using the copy command.
SnowPipe – Supports transformations on streaming data as well.
Third-party tools like DBT can also be leveraged.

4. Data Warehouse:

Snowflake as the data warehouse which supports both structured (table formats) and semi-structured data (VARIENT datatype). Other options like internal/external stages can also be utilized to reference the data stored on cloud-based storage systems.

Integrating enterprise data into a modern storage architecture is key to realizing value from BI and ML use cases. At Tiger Analytics, we have seen that implementing the architecture detailed above has streamlined data access and storage for our clients. Using this blueprint, you can migrate from your legacy data warehouse onto a cloud-based lakehouse setup.

The post A Practical Guide to Setting Up Your Data Lakehouse across AWS, Azure, GCP and Snowflake appeared first on Tiger Analytics.

How to Design your own Data Lake Framework in AWS

TA@2023 — Mon, 29 Aug 2022 17:09:37 +0000

“Data is a precious thing and will last longer than the systems themselves.”– Tim Berners-Lee, inventor of the World Wide Web

Organizations spend a lot of time and effort building pipelines to consume and publish data coming from disparate sources within their Data Lake. Most of the time and effort in large data initiatives are consumed in data ingestion development.

What’s more, with an increasing number of businesses migrating to the cloud, factors like breaking data silos and enhancing data discoverability of data environments have become a business priority.

While Data Lake is the heart of data operations, one should carefully tie capabilities like data security, data quality, metadata-store, etc within the ecosystem.

Properties of an Enterprise Data Lake solution

In a large-scale organization, the Data Lake should possess these characteristics:

Data Ingestion- the ability to consume structured, semi-structured, and unstructured data
Supports push (batch and streaming systems) and pull (DBs, APIs, etc.) mechanisms
Data security through sensitive data masking, tokenization, or redaction
Natively available rules through the Data Quality framework to filter impurities
Metadata Store, Data Dictionary for data discoverability and auditing capability
Data standardization for common data format

A common reusable framework is needed to reduce the time and effort in collecting and ingesting data. At Tiger Analytics, we are solving these problems by building a scalable platform within AWS using AWS’s native services and open-source tools. We’ve adopted a modular design and loosely coupled multi-layered architecture. Each layer provides a distinctive capability and communicates with each other via APIs, messages, and events. The platform abstracts complex processes in the backend and provides a simple easy-to-use UI for the stakeholders

Self-service UI to quickly configure data workflows
Configuration-based backend processing
AWS cloud native and open-source technologies
Data Provenance: data quality, data masking, lineage, recovery and replay audit trail, logging, notification

Before exploring the architecture, let’s understand a few logical components referenced in the blog.

Sources are individual entities that are registered with the framework. They align with systems that own one or more data assets. The system could be a database, a vendor, or a social media website. Entities registered within the framework store various system properties. For instance, if it is a database, then DB Type, DB URL, host, port, username, etc.
Assets are the entries within the framework. They hold the properties of individual files from various sources. Metadata of source files include column names, data types, security classifications, DQ rules, data obfuscation properties, etc.
Targets organize data as per enterprise needs. There are various domains/sub-domains to store the data assets. Based on the subject area of the data, the files can be stored in their specific domains.

The Design Outline

With the demands to manage large volumes of data increasing year on year, our data fabric was designed to be modular, multi-layered, customizable, and flexible enough to suit individual needs and use cases. Whether it is a large banking organization with millions of transactions per day and a strong focus on data security or a start-up that needs clean data to extract business insights, the platform can help everyone.

Following the same modular and multi-layered design principle, we, at Tiger, have put together the architecture with the provision of swapping out components or tools if needed. Keeping in mind that the world of technology is ever-changing and volatile we’ve built flexibility into the system.

UI Portal provides a user-friendly self-service interface to set up and configure sources, targets, and data assets. These elements drive the data consumption from the source to Data Lake. These self-service applications allow the federation of data ingestion. Here, data owners manage and support their data assets. Teams can easily onboard data assets without building individual pipelines. The interface is built using ReactJS with Material-UI, for high-quality front-end graphics. The portal is hosted on an AWS EC2 instance, for resizable compute capacity and scalability.
API Layer is a set of APIs which invokes various functionalities, including CRUD operations and AWS service setup. These APIs create the source, asset, and target entities. The layer supports both synchronous and asynchronous APIs. API Gateway and Lambda functions provide the base of this component. Moreover, DynamoDB captures events requested for audit and support purposes.
Config and Metadata DB is the data repository to capture the control and configuration information. It holds the framework together through a complex data model which reduces data redundancy and provides quick query retrieval. The framework uses AWS RDS PostgreSQL which natively implements Multiversion Concurrency Control (MVCC). It provides point-in-time consistent views without read locks, hence avoiding contentions.
Orchestration Layer strings together various tasks with dependencies and relationships within a data pipeline. These pipelines are built on Apache Airflow. Every data asset has its pipeline, thereby providing more granularity and control over individual flows. Individual DAGs are created through an automated process called DAG-Generator. It is a python-based program tied to the API that registers data assets. Every time a new asset is registered, the DAG-Generator creates a DAG based on the configurations. Later, they are uploaded to the Airflow Server. These DAGs may be time-driven or event-driven, based on the source system.
Execution Layer is the final layer where the magic happens. It comprises various individual python-based programs within AWS Glue jobs. We will be seeing more about this in the following section.

Data Pipeline (Execution Layer)

A data pipeline is a set of tools and processes that automate the data flow from the source to a target repository. The data pipeline moves data from the onboarded source to the target system.

Figure 3: Concept Model – Execution Layer

Data Ingestion

Several patterns affect the way we consume/ingest data. They vary depending on the source systems and consuming frequency. For instance, ingesting data from a database requires additional capabilities compared to consuming a file dropped by a third-party vendor.

Figure 4: Data Ingestion Quadrant

The Data Ingestion Quadrant is our base outline to define consuming patterns. Depending on the properties of the data asset, the framework has the intelligence to use the appropriate pipeline for processing. To achieve this, we have individual S3 buckets for time-driven and event-driven sources. Driver lambda function externally invokes event-driven Airflow DAGs and CRON expressions within DAGs invoke time-driven schedules.

These capabilities consume different file formats like CSV, JSON, XML, parquet, etc. Connector libraries are used to pull data from various databases like MySQL, Postgres, Oracle, and so on.

Data Quality

Data is the core component of any business operation. Data Quality (DQ) in any enterprise system determines its success. The data platform requires a robust DQ framework that promises quality data in enterprise repositories.

For this framework, we have the AWS open-source library called DEEQU. It is a library built on top of Apache Spark. Its Python interface is called PyDeequ. DEEQU provides data profiling capabilities, suggests DQ rules, and executes several checks. We have divided DQ checks into two categories:

– Default Checks are the DQ rules that automatically apply to the attributes. For instance, Length Check, Datatype Check, and Primary Key Check. These data asset properties are defined while registering in the system.

– Advanced Checks are the additional DQ rules. They are applied to various attributes based on the user’s needs. The user defines these checks and stores them in the metadata.

The DQ framework pulls these checks from the metadata store, and it identifies the default checks through data asset properties. Eventually, it constructs a bulk check module for data execution. DQ Results are stored in the backend database. The logs are stored in the S3 bucket for detailed analysis. DQ summary available in the UI provides additional transparency to business users.

Data Obfuscation/Masking

Data masking is the capability of dealing with sensitive information. While registering a data asset, the framework has a provision to enable tokenization on sensitive columns. The Data Masking task uses an internal algorithm and a key (associated with the framework and stored in the AWS Secret Manager). It tokenizes those specific columns before storing them in the Data Lake. These attributes can be detokenized through user-defined functions. It also requires additional key access to control attempts by unauthorized users.

The framework also supports other forms of irreversible data obfuscation, such as Redaction and Data Perturbation.

Data Standardization

Data standardization brings data into a common format. It allows data accessibility using a common set of tools and libraries. The framework executes standardized operations for data consistency. The framework can, therefore:

Standardize target column names.
Support file conversion to parquet format.
Remove leading zeroes from integer/decimal columns.
Standardize target column datatypes.
Add partitioning column.
Remove leading and trailing white spaces from string columns.
Support date format standardization.
Add control columns to target data sets.

Through this blog, we’ve shared insights on our generic architecture to build a Data Lake within the AWS ecosystem. While we can keep adding more capabilities to solve real-world problems, this is just a glimpse of data challenges that can be addressed efficiently through layered and modular design. You can use these learnings to put together the outline of a design that works for your use case while following the same core principles.

The post How to Design your own Data Lake Framework in AWS appeared first on Tiger Analytics.

Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability

TA@2023 — Wed, 22 Jun 2022 12:53:21 +0000

Modern Data Lake

Data Lake solutions started emerging out of technology innovations such as Big Data but have been propelled by Cloud to a greater extent. The prevalence of Data Lake can be attributed to its ability in bringing better speed to data retrieval compared to Data Warehouses, the elimination of a significant amount of modeling effort, unlocking advanced analytics capabilities for an enterprise, and bringing in storage and compute scalability to handle different kinds of workloads and enable data driven decisions.

Data Democratization is one of the key outcomes that is sought after with the data platforms today. The need to bring reliable and trusted data in a self-service manner to end-users such as data analysts, data scientists, and business users, is the top priority of any data platform. This blog discusses the key trends we see with our clients and in the industry that are being created to aid data lakes cater to wider audiences and to increase their adoption with consumers.

Self-Service Management of Modern Data Lake

“…self-service is basically enabling all types of users (IT or business), to easily manage and govern the data on the lake themselves – in a low/no-code manner”

Building robust, scalable data pipelines is the first step in leveraging data to its full potential. However, it is the incorporation of automation and self-serving capabilities that really helps one achieve that goal. It will also help in democratizing the data, platform, and analytics capabilities to all types of users and reducing the burden on IT teams significantly, so they can focus on high-value tasks.

Building Self-Service Capabilities

Self-serving capabilities are built on top of robust, scalable data pipelines. Any data lake implementation will involve building various reusable frameworks and components for acquiring data from the storage systems (components that can understand and infer the schema, check data quality, implement certain functionalities when bringing in or transforming the data), and loading it into the target zone.

Data pipelines are built using these reusable components and frameworks. These pipelines ingest, wrangle, transform and perform the egress of the data. Stopping at this point would rob an organization of the opportunity to leverage the data to its full potential.

In order to maximize the result, APIs (following microservices architecture) for data and platform management are created to perform CRUD operations and for monitoring purposes. They can also be used to schedule and trigger pipelines, discover and manage datasets, cluster management, security, and user management. Once the APIs are set, you can build a web UI-based interface that can orchestrate all these operations and help any user to navigate it, bring in the data, transform the data, send out the data, or manage the pipelines.

Tiger has also taken self-servicing on Data Lake to another level by building a virtual assistant that interacts with the user to perform the above-mentioned tasks.

Data Catalog Solution

Another common trend we see in modern data lakes is the increasing adoption of next-gen Data Catalog Solutions. A Data Catalog Solution comes in handy when we’re dealing with huge volumes of data and multiple data sets. It can extract and understand technical metadata from different datasets, link them together, understand their health, reliability, and usage patterns, and help any consumer, whether it is a data scientist or a data analyst, or a business analyst, with insight generation.

Data Catalogs have been around for quite some time, but now they are becoming more information intelligent. It is no longer just about bringing in the technical metadata.

Data Catalog Implementation

Some of the vital parts of building a data catalog are using knowledge graphs and powerful search technologies. A knowledge graph solution can bring in the information of a dataset like schema, data quality, profiling statistics, PII, and classification. It can also figure out who’s the owner of the particular dataset, who are the users consuming this data set from various logs, and which department the person belongs to.

This knowledge graph can be used to carry out search and filter operations, graph queries, recommendations, and visual explorations.

Data and Platform Observability

Tiger looks at Observability in three different stages:

1. Basic Data Health Monitoring
2. Advanced Data Health Monitoring with Predictions
3. Extending to Platform Observability

Basic Data Health Monitoring

Identifying critical data elements (CDE) and monitoring them is the most basic aspect of data health monitoring. We configure rule-based data checks against these CDEs and capture the results on a periodic basis and provide visibility through dashboards. These issues are also tracked through ticketing systems and then fixed in source as much as possible. This process constitutes the first stage in ensuring Data Observability.

The key capabilities that are required to achieve this level of maturity are shown below

Advanced Data Health Monitoring with Predictions

Most of the enterprise clients we work with have reached the Basic Data Health Monitoring stage and are looking to progress forward. The observability ecosystem needs to be enhanced with some important capabilities that will help to move from a reactive response to a more proactive one. Artificial Intelligence and Machine Learning are the latest technologies being leveraged to this end. Some of the key capabilities include measuring the data drift and schema drift, classifying the incoming information automatically with AI/ML, detecting the PII information automatically and processing them appropriately, assigning security entitlements automatically based on similar elements in the data platform, etc. These capabilities will elevate the health of the data to the next level also giving early warnings when some data patterns are changing.

Extending to Platform Observability

The end goal of Observability Solutions is to deliver reliable data to consumers in a timely fashion. This goal can be achieved only when we move beyond data observability into the actual platform that delivers the data itself. This platform has to be modern and state of the art so that it can deliver the data in a timely manner while also allowing engineers and administrators to understand and debug if things are not going well. Following are some of the key capabilities that we need to think about to improve the platform-level observability.

Monitoring Data Flows & Environment: Ability to monitor job performance degradation, server health, and historical resource utilization trends in real-time
Monitor Performance: Understanding how data flows from one system to another and looking for bottlenecks in a visual manner would be very helpful in complex data processing environments
Monitor Data Security: Query logs, access patterns, security tools, etc need to be monitored in order to ensure there is no misuse of data
Analyze Workloads: Automatically detecting issues and constraints in large data workloads that make them slow and building tools for Root Cause Analysis
Predict Issues, Delays, and Find Resolutions: Comparing historical performance to current operational efficiency, in terms of speed and resource usage, to predict issues and offer solutions
Optimize Data Delivery: Building tools into the system that continuously adjust resource allocation based on data-volume spike predictions and thus optimizing TCO

Conclusion

The Modern Data Lake environment is driven by the value of data democratization. It is important to make data management and insight gathering accessible to end-users of all kinds. Self Service aided by Intelligent Data Catalogs is the most promising solution for effective data democratization. Moreover, enabling trust in the data for the data consumers is also of utmost importance. The capabilities discussed such as Data and Platform Observability gives users real under-the-hood control over onboarding, processing, and delivering data to different consumers. Companies are striving to create end-to-end observability solutions and wanting to enable data driven decisions today and these will be the solutions that will take data platforms to the next level of adoption and democratization.

We spoke more on this at the DES22 event. To catch our full talk, click here.

The post Unlocking the Potential of Modern Data Lakes: Trends in Data Democratization, Self-Service, and Platform Observability appeared first on Tiger Analytics.

Revolutionizing Business Intelligence: Trends, Tools, and Success Stories Unveiled by Tiger’s BI Framework

TA@2023 — Wed, 01 Jun 2022 18:35:00 +0000

Recently, at the DES22 Summit held virtually on the 30th of April 2022, we spoke in-depth about modern BI, its trends, and its applications across various sectors. We characterized modern BI according to the need for obtaining insights in the shortest possible period of time. The multiple streams of data in the current atmosphere and the sheer speed at which this large volume of data has to be obtained and processed demand cutting-edge solutions. These solutions ideally serve the consumers by providing them with the competitive edge required to adapt to the ever-changing conditions of the market.

More businesses are expected to lean into modern BI solutions to revolutionize their analytics strategy, making it easier for the business and the stakeholders to act in real-time.

In this article, we shall look at the trends, tools, and drivers of Modern BI through the lens of two Industry Use Cases.

A Single Source of Truth From a Scattered BI Stack

One of the most notable partners of Tiger’s is a large logistics company based out of the US. Geospatial information is a large part of their operations. However, their biggest issue was the fact that their BI stack was scattered across a multitude of applications.

Geospatial Analytics’ market share is expected to reach $209.47 billion 2030.
It is used in a variety of sectors like agriculture, utilities, healthcare, insurance, real estate, supply chain optimization etc.
Deployment of 5G network services is expected to create new opportunities for geospatial analytics vendors.

This situation comes with a number of drawbacks. A scattered BI stack means that data needs to be compiled for presentation and insight gathering, which requires a technically proficient team working on it. That is not an ideal state of any company’s data infrastructure, given the speed at which businesses need to operate with their data to make decisions. Easy adaptation by non-technical users is a key marker for a democratized and efficient BI environment. Therefore, a unified user-facing application was the appropriate solution for this project.

Embedded BI’s goal is to integrate data analytics, insights, and visualization within an application with clarity and ease of access. It was the key to integrating the client’s multi-BI environment into a single platform.

Tiger needed to build an integrated BI stack for a unified customer experience and eliminate data redundancies with a single source of truth. There were many sources of data within the organization since they had multiple departments. Using Tiger’s solution, all of their data were presented in a single, centralized application called the ‘Digital Dashboard.’ BI reports were embedded in the application with the use of an API interface.

All departments of the company had access to the application to obtain insights from a single source of truth, which enabled better network coverage and agile performance.

Big Data, Precise Action

Big Data has been a huge talking point in the last decade, but there has been very little conversation on modern BI handling Big Data. The last few years have seen a massive increase in the production of data. Legacy BI tools have very limited capability to handle this huge volume of data, and it impacts the user’s ability to derive actionable insights. However, modern BI tools have evolved to handle petabytes of data and provide data insights with ease.

One of the other esteemed partners of Tiger is a huge media house in India. They wanted to place precisely targeted ads in order to increase their ad revenue. They also wanted insights into viewership behavior for smarter programming. The data input they had at their disposal were user data, demographic data, and data from social media.

Considering the variety of data and data sources, we needed a BI tool that can handle the large volume of data and prepare reports. The solution was to integrate TBs of heterogeneous sources onto Data Lake and consume reports in live/direct quest mode through native Spark-based BI connectors to get close to real-time insights. This achieved personalized recommendations for ad placements and programming based on user activity and demographic inputs, and in the procurement of a great share of $5 billion in brand advertising dollars.

Tiger’s BI Framework

Tiger has a prebuilt framework to keep pace with the evolution of BI. It consists of code repositories, context templates, tech requirement templates, chart accommodations, and visuals that meet the requirement of any modern BI need. It also consists of a number of templates for embedded BI, Big Data, and geospatial BI.

The Dashboarding Program Guide helps in a comprehensive discovery of business requirements, and the development and deployment of impactful dashboards that can steer the users to success.

Metadata Extractor

The Metadata Extractor has been a part of traditional BI tools, where they had their own external metadata store, mostly in an XML format. However, it is not efficient for engineers and architects to go deep into metadata and get information. Tiger has gone a step ahead and built a Metadata Extractor tool to simplify and automate the process to collect metadata information of various reports and dashboards. The necessity of this tool is dictated by the prevalence of data redundancy in modern BI systems, which is quite common given that the abundance of data is only increasing by the day.

Conclusion

Data democratization is key to reaping the benefits of technological evolution, which is constantly putting technologies that are capable of gathering petabytes of data in the hands of users across the world. Modern BI tools need to be time and cost-efficient, extensive in terms of data processing yet precise in their synthesis, and most importantly, accessible to the non-technical user. Tiger is at the forefront of crafting such tools that gives its partners the ability to arrive at prompt, actionable insights with its extensive modern BI frameworks and an enriched, secure data culture in place.

To catch our full talk, click here.

The post Revolutionizing Business Intelligence: Trends, Tools, and Success Stories Unveiled by Tiger’s BI Framework appeared first on Tiger Analytics.

Managing Parallel Threads: Techniques with Apache NiFi

onemg — Thu, 24 Oct 2019 16:33:39 +0000

As a data engineering enthusiast, you must be aware that Apache NiFi is designed to automate the data flow between multiple software systems. NiFi makes it possible to understand quickly the various dataflow operations that would otherwise take a significant amount of time.

In this blog, we deal with a specific problem encountered while dealing with NiFi. It has a feature to control the number of concurrent threads at an individual processor level. But there is no direct approach to control the number of concurrent threads at a process group level. We provide you with an approach to help resolve this challenge.

A Look at the Existing Concurrency Feature in NiFi

As mentioned, NiFi provides concurrency at an individual processor level. It is available for most processors and this option is available in the Scheduling tab called “Concurrent Tasks”. But, at the same time, there are also certain types of single-threaded processors that do not allow concurrency.

Concurrency set to 5 on Processor Level

This option allows the processor to run concurrent threads by using system resources at a higher priority when compared to other processors. In addition to this processor-level concurrency setting, NiFi has global maximum timer and event-driven thread settings. Its default values are 10 and 5 respectively. It controls the maximum number of threads NiFi can request from the server for fulfilling concurrent task requests from NiFi processor components. These global values can be adjusted in controller settings (Located via the hamburger menu in the upper right corner of the NiFi UI.)

Controller Settings

NiFi sets the Max Timer Thread Counts relatively low to support operating on commodity hardware. This default setting can limit performance when there is a very large and high volume data flow that must perform a lot of concurrent processing. The general guidance for setting this value is two to four times the number of cores available to the hardware on which the NiFi service is running.

NOTE: Thread Count applied within the NiFi UI is applied to every node in a NiFi cluster. The cluster UI can be used to see how the total active threads are being used per node.

Custom Approach to Set the Concurrency at Processor-group Level

To customize and control the number of concurrent threads to be executed within a processor group, use the NiFi Wait and Notify processors. Using notify signals created by the Notify processor, the number of threads to be executed within a processor group can be controlled.

Here is a sample use case – create a NiFi flow that processes input flow files in batches. For example, process five inputs at a time within a processor group and always keep those five concurrent threads active until any of them get completed. As soon as one or more flow files complete their processing, those available slots should be used for the queued input flow files. To explain further, if there are 40 tables to be processed and only five threads can be in parallel within the actual processor group, it would mean initially 5 tables have to run concurrently by taking 5 threads from the system resources and the remaining 35 tables will get a chance only when any of the previously running threads gets completed.

Now, to set the concurrency, design two processor groups PG-1 and PG-2,

Thread Controlling Processor Group(PG-1):This is the core controller that manages the number of concurrent threads. It decides how many concurrent threads can run within the processor group PG-2.

Actual Processor Group(PG-2):This is the processor group that performs the functionality that we want to parallelize. For example, it can be a cleansing/transformation operation that runs on all input tables.

Mechanism to control the concurrency

Code base (XML file) for this NiFi template is available in GitHub — https://github.com/karthikeyan9475/Control-Number-of-Threads-In-Processor-Group-Level-in-NiFi

How does this Work?

As mentioned, this functionality is achieved using Wait and Notify NiFi processors. PG-1 controls the number of flow files that get into PG-2 using Gates (Wait processor). This is nothing but signals created by Notify. In this NiFi template, there are three Gates and you will see how they work below.

PG-1 triggers 3 flow files via the input port, and each of them performs certain actions.

1st flow file: It triggers Notify processor to open the Gate-1 (Wait processor) and allows 5 input flow files (configurable) and triggers the Notify processor to close the Gate-1.

2nd flow file: It triggers 5 input flow files to Gate-1 which was opened by the previous step and reaches PG-2.

3rd flow file: It triggers any remaining input flow files to Gate-2 (Wait processor) and it acts as a queue. This Gate-2 will get notifications if any of the previously triggered flow files complete their processing in PG-2. This Gate-2 releases remaining input files one at a time for each Notify signal it receives.

Initially, 5 flow files will be released from Gate-1, leaving 35 flow files queued in Gate-2. When one or more flow files get completed, it will send the notification signal to Gate-2. Each notify signal releases one flow file from Gate-2. This way, it ensures only 5 concurrent threads are running for PG-2.

Screenshot of NiFi Flow that Shows Controls for Concurrency:

Complete flow to control the threads inside the actual processor group

Screenshot that Shows Concurrent Threads on Top of PG-2:

Concurrency in the actual processor group

Handling Signals Created by Last Few Inputs

Once the last few input flow files (36, 37, 38, 39 and 40 in the above example) get processed by PG-2, it triggers signals to Gate-2 (which is a queue) that there are no further input files to be processed. These additional signals can lead to additional runs within PG-2 when a new set of inputs arrive. This is avoided using a special Neutralization gate that by-passes all these additional flow files from getting into Gate-2.

Enhancing this Solution to Multiple Applications

The above example was to just for one requirement i.e., processing various input tables that are received concurrently. What if at an enterprise level, this cleansing process has started being recognized and all source tables from the various applications are asked to be processed in this processor group PG-2.

Let us say, App 1 is for processing sales tables and App 2 is for processing finance tables. How do we achieve 5 concurrent threads maintained within PG-2 for each application?

All one would need is a small modification in the Wait/Notify release signal identifier. This would involve making the name of signals as attribute-driven instead of hardcoded signal names. It supports NiFi expression language. Hence by making the release signal identifier as attribute driven, one can control threads within a processor group at an application-level granularity.

Attribute driven signal identifier in Wait Processor Properties

By default, the expiration duration will be 10 minutes, which means queued flow files will wait in the queue for the notify signal until 10 mins, then these flow files will get expired. The expiration duration needs to be configured in order to avoid the flow file expiration.

Attribute-driven signal identifier in Notify Processor Properties

Debug vs Actual Execution Mode:

During development, there will be situations to simulate signal files using generate flow file processors and that can lead to orphan signals waiting for processing when there are no input flow files. In the subsequent run, input flow files will get processed using orphan signals that were already there during the debugging stage.

Debugging

In the above scenario, if any of the flow files go into the success queue, there is some unused signal from the Notify processor which is mistakenly stored in Wait processor while developing or testing. In the above image, it is clear that 18 notify signal are triggered by mistake. To make sure that Wait processor is working as expected, all the simulated flow files should go to the Wait queue, not to the success queue. If the debugging step is skipped, it may lead to run a higher number of parallel threads inside the actual processor group than expected.

Conclusion

By using Wait/Notify processors, the number of concurrent threads can be controlled inside the Processor Group in NiFi. This will help one to build a complex data flow pipeline with controlled concurrent threads. As you will certainly know by now, Apache NiFi is a powerful tool for end-to-end data management. It can be used easily and quickly to build advanced applications with flexible architectures and advanced features.

You can read about real-time NiFi Alerts and log processing here.

The post Managing Parallel Threads: Techniques with Apache NiFi appeared first on Tiger Analytics.