Databricks Archives - Tiger Analytics

A Complete Guide to Enabling SAP Data Analytics on Azure Databricks

TA@2023 — Fri, 22 Sep 2023 14:22:34 +0000

Effectively enabling SAP data analytics on Azure Databricks empowers organizations with a powerful and scalable platform that seamlessly integrates with their existing SAP systems. They can efficiently process and analyze vast amounts of data, enabling faster insights and data-driven decision-making.

So, if you’re a senior technical decision-maker in your organization, choosing the proper strategy to consume and process SAP data with Azure Databricks is critical.

First, let’s explore the types of SAP data and objects from the source system. Then, let’s see how the SAP BW can be accessed and made available to Azure Databricks for further processing, analytics, and storing in a Databricks Lakehouse.

Why Enable SAP Data Analytics on Azure Databricks

While SAP provides its data warehouse (SAP BW), it can still add value to extract operational SAP data from the SAP sources (S4/Hana or ECC) or SAP BW into an Azure Databricks Lakehouse – integrating it with other ingested data, which may not originate from SAP.

As a practical example, let’s consider a large and globally operating manufacturer that utilizes SAP for its production planning and supply chain management. It must integrate IoT machinery telemetry data stored outside the SAP system with supply chain data in SAP. This is a common use case, and we will show how to design a data integration solution for these challenges.

We will start with a more common pattern to access and ingest SAP BW data into the Lakehouse by using Azure Databricks and Azure Data Factory/ Axure Synapse Pipelines.

This first part will also lay important foundations to better illustrate the material presented. For example, imagine a company like a manufacturer mentioned earlier currently operating an SAP System on-premises with an SAP BW system. This company needs to copy some of its SAP BW objects to a Databricks Lakehouse in an Azure ADLS Gen 2 account and use Databricks to process, model, and analyze it.

There will likely be many questions for the scenario mentioned with the import from SAP BW, but the main questions we would like to focus on are:

How should we copy this data from the SAP BW to the Lakehouse?
What services and connectivity are going to be required?
What connector options are available to us?
Can we use a reusable pattern that is easy to implement for similar needs?

The reality is that not every company has the same goals, needs, and skillsets, so there isn’t one solution that fits all needs and customers.

SAP Data Connection and Extraction: What to Consider

We can connect and extract many ways, tools, and approaches, respectively, pull and push SAP Data from its different layers (Database tables, ABAB Layer, CDS Views, IDOCS, etc.) into the Databricks Lakehouse. For instance, we can use SAP tooling such as SAP SLT or other SAP tooling or another third-party tooling.

It makes sense for customers already in Azure to use the different sets of connectors provided in Azure Data Factory/ Azure Synapse to connect from Azure to the respective SAP BW instance running on-premises or in some other data center.

Depending on the different ways and tools of connecting, SAP licensing also plays an important role and needs to be considered. Also, we highly recommend an incremental stage and load approach, especially for large tables or tables with many updates.

It may be obvious, but in our experience, each organization has a unique set of needs and requirements when it comes to accessing SAP Data, and an approach that works for one organization may not be a good fit for another given its needs, requirements, and available skills. Some important factors to ensure smooth operations are licensing, table size, acceptable ingestion latency, expected operational cost, express route throughput, and size and number of nodes of the integrated runtime.

Let us look at the available connectors in ADF/ Synapse and when to use each.

How to Use the Right SAP Connector for Databricks and ADF/ Synapse

As of May 2023, ADF & Synapse currently provides seven different types of connectors for SAP-type sources. The image below shows all of them filtered by the term “SAP” in the linked services dialog in the pipeline designer.

Now, we would like to extract from SAP BW. We can use options 1, 2, 3, and 7 for that task. As mentioned earlier, for larger tables or tables with many updates, we recommend options 1, 7, or option 3. The SAP CDC connector (Option 3) is still in public preview, and we recommend not using it in production until it is labeled “generally available.” However, the SAP CDC connector can be utilized in lower environments (Dev, QA, etc.). Option 2 will not be fast enough and likely take too much time to move the data.

There’s certainly more to be understood about these ADF connectors and approaches of directly connecting Azure Databricks to relevant SAP sources and data push approaches initiated by SAP tooling such as SAP SLT. But, for now, the pattern introduced in Azure using connectors in ADF and Synapse is very common and reliable.

SAP ODP: What You Should Know

So far, only the SAP CDC connector from above is fully integrated with SAP ODP (Operational Data Provisioning). As of May 2023, we may still use other connectors, given that all other connectors are stable and have been in production already for years. However, it is recommended to slowly plan for more use of ODP connectors, especially when it is a green field development. The project is about to start within the next month or so. So let us look closely at SAP ODP and its place in the process.

Image 2: SAP ODP Role and Architecture

As shown in Image 2, SAP ODP-based connector is like a broker between the different SAP data providers and the SAP Data Consumers. An SAP Data Consumer like ADF, for instance, does not directly connect to an SAP data provider once connected via SAP ODP connector. Instead, it connects to the SAP ODP Service and sends a request to provide data to the service. The service then connects to the respective data provider to serve the requested data to the data consumer.

Depending on the configuration of the connection and the respective data provider, data will be served as an incremental or full load and stored in ODP internal queues until a consumer like ADF can fetch it from the queues and write it back out to the Databricks Lakehouse.

SAP ODP-based connectors are the newer approach to accessing SAP data, and as of May 2023, the SAP ODP-based CDC connector is in public preview. SAP ODP isn’t new within the SAP world and has been used internally by SAP for data migration for many years.

One major advantage is that the SAP ODP CDC connector provides a delta mechanism built into ODP, and developers no longer have to maintain their watermarking logic when using this connector. But, given that it isn’t yet generally available, it should be applied with care and, at this stage, possibly just planned for.

Obviously, we also recommend testing the watermarking mechanism to ensure it fits your specific scenarios.

How to Put It All Together with Azure Databricks

Now, we are ready to connect all the parts and initially land the data into a landing zone within the Lakehouse. This process is shown in Image 3 below. But, before that step, we ingested the SAP operational data from its applications and backend databases (HANA DB, SQL Server, or Oracle Server) to SAP BW.

Later, we will also look at these SAP BW ingestion jobs and how to migrate their logic to Azure Databricks to be able to refactor them and apply these transformations on top of the source tables imported from SAP ECC or S4/ Hana to the Databricks Lakehouse.

We need to set up an Integrate Runtime for ADF to connect to SAP BW via the SAP BW Open Hub. ADF requires this service to connect to SAP BW. We also need to install the proper SAP BW drivers on the SAP BW server. Typically, in a production environment, the Integrated Runtime is installed on separate machines – hosted in Azure or on-premises and uses at least two nodes for higher levels of availability.

Up to four nodes are possible in a cluster, which is an important consideration for large table loads to set up like that to account for any performance SLAs.

Multiple integrated runtimes will be required for organizations with very large SAP deployments and SAP data ingestion needs. We highly recommend careful capacity and workload planning to ensure the clusters are properly sized and the type of connectivity between Azure and on-premises is sufficient, assuming SAP BW runs on-premises.

From the landing zone, which is just a copy of the original data with typically some batch ID added to the flow, we can utilize the power of the Databricks Delta Lake and Databricks Notebooks containing the code with many language options available (Spark SQL, Python, Scala are the most common choices) to process further and “model” the data from the landing zone into the respective next zones bronze, silver and finally into the gold layer.

Please note this is a slightly simplified view. Typically, the Data Ingestion is part of an overall Data Processing Framework that provides metadata information on tables and sources to be ingested and the type of ingestion desired, such as full or incremental, just as an example. These implementation details aren’t in scope here, but please remember them.

For secure deployment in Azure, every Azure Service part of the solution should be configured to use private endpoints, and no traffic would ever traverse the public Internet. All the services must be correctly configured since this private endpoint configuration is not the default setup with PaaS Services in Azure.

Image 3: Complete Dataflow from SAP BW to the Azure Lakehouse

We recommend integrating data test steps typically executed before the data lands in the gold layer. Tiger’s Data Processing Framework (Tiger Data Fabric) provides a semi-automated process to utilize AI capabilities from Databricks to mature data quality through interaction with a data subject matter expert.

Typically, the Databricks Lakehouse has a clearly defined folder structure within the different layers, and plenty of documentation is available on how to do so. But at a higher level, the bronze and silver layer is typically organized by source systems or areas in contrast to the gold layer, which is typically organized towards consumption aspects and models the data by expected downstream consumption needs.

Why Use Azure Databricks

1. Smooth integration of SAP data with data originating from other sources.

2. Databricks’ complete arsenal of advanced ML algorithms and strong data processing capabilities compared to the limited availability of advanced analytics in SAP BW.

3. Highly customizable data processing pipelines – easier to enrich the data from bronze and silver to the gold layer in the Lakehouse.

4. Significantly lower total cost of ownership.

5. Easy-to-scale for large data volume processing.

6. Industry-standard and mature for large-scale batch and low-latency event processing.

Undoubtedly, effectively leveraging Azure Databricks ensures that organizations can harness the power of SAP data analytics. We hope this article provided some insights on how you can enable SAP data analytics on Azure Databricks and integrate it with your SAP systems.

After all, this integration can empower your organization to work with a robust and scalable platform that allows for efficient processing and analysis of large data volumes.

The post A Complete Guide to Enabling SAP Data Analytics on Azure Databricks appeared first on Tiger Analytics.

Harmonizing Azure Databricks and Azure Synapse for Enhanced Analytics

TA@2023 — Thu, 22 Jun 2023 14:32:44 +0000

Azure Databricks and Azure Synapse are powerful analytical services that complement each other. But choosing the best-fit analytical Azure services can be a make-or-break moment. When done right, it ensures fulfilling end-user experiences while balancing maintainability, cost-effectiveness, security, etc.

So, let’s find out the considerations to pick the right one so that it helps deliver a complete solution to bridge some serious, prevalent gaps in the world of data analytics.

Delivering Tomorrow’s Analytical Needs Today

Nowadays, organizations’ analytical needs are vast and quite demanding in many aspects. For organizations of any size, it is vital to invest in platforms that deliver to these needs and are open enough, secure, cost-effective, and extendible. Some of these needs may include complex, time-consuming tasks such as:

Integrating up to 500 single databases representing different source systems located in different domains and clouds into a single location, a Lakehouse, for instance.
Processing terabytes or petabytes of data in batches and more frequently in near real-time.
Training Machine Learning models on large datasets.
Quickly performing explorative data analysis with environments provisioned on the fly.
Quickly visualizing some business-related data and easily share it with a consumer community.
Executing and monitoring thousands of pipelines daily in a scalable and cost-effective manner.
Providing self-service capabilities around data governance.
Self-optimizing queries over time.
Integrating a comprehensive data quality processing layer into the Lakehouse processing.
Easily sharing critical business data securely with peers or partner employees.

Why Azure Databricks Is Great for Enterprise-Grade Data Solutions

According to Microsoft Docs, Azure Databricks is defined as a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. The Azure Databricks Lakehouse Platform integrates cloud storage and security in your cloud account and manages and deploys cloud infrastructure on your behalf.

From a developer’s point of view, Azure Databricks makes it easy to write Python, Scala, and SQL code and execute this code on a cluster to process the data – with many different features. We recommend reviewing the “used for” section of the above link for further details.

Azure Databricks originates from Apache Spark but has many specific optimizations that the open-source version of Spark doesn’t provide. For example, the Photon engine can speed up processing by up to 30 % without code optimization or refactoring.

Initially, Azure Databricks was more geared toward data scientists and ML workloads. However, over time Databricks added data engineering and general data analytics capabilities to the platform. It provides metadata management via a tool called “Unity,” which is a part of the Databricks platform. Azure Databricks also provides a data-sharing feature allowing secure data sharing across company boundaries.

Azure Databricks is extremely effective in ML processing with an enormous amount of ML libraries built in and its Data Engineering/ Lakehouse processing capabilities. Languages such as Python, Scala, and SQL are popular among data professionals – providing many APIs to interact and process data into any desired output shape.

Azure Databricks provides Delta Live Tables for developers to generate ingestion and processing pipelines with significantly lower effort. So, it is a major platform bound to see wider adoption as an integral part of any large-scale analytical platform.

How Azure Synapse Speeds Up Data-Driven Time-To-Insights

According toMicrosoft Docs, Azure Synapse is defined as an enterprise analytics service that accelerates time-to-insight across data warehouses and big data systems. Azure Synapse brings together the following:

The best SQL technologies used in enterprise data warehousing.
Spark technologies used for big data.
Data Explorer for log and time series analytics.
Pipelines for data integration and ETL/ELT.
Deep integration with other Azure services such as Power BI, Cosmos DB & Azure ML.

Azure Synapse integrates several independent services like Azure Data Factory, SQL DW, Power BI, and others under one roof, called Synapse Studio. From a developer’s point of view, Azure Synapse Studio provides the means to write Synapse Pipelines and SQL scripts and execute this code on a cluster to process the data. It also easily integrates many other Azure Services into the development process.

Due to its deep integration with Azure, Azure Synapse effortlessly allows using other related Azure Services, such as Azure Cognitive Services and Cosmos DB. Architecturally, this is important since easy integration of capabilities is a critical criterion when considering platforms.

Azure Synapse shines in the areas of data and security integration. If existing workloads already use many other related Azure Services like Azure SQL, then integration is likely easier than other solutions. Synapse Pipelines can also act as an orchestration layer to invoke other compute solutions within Synapse or Azure Databricks.

This integration from Synapse Pipelines to invoke Databricks Notebooks will be a key area to review further in the next section.

It is vital to note that an integrated runtime is required for Synapse Pipelines to access on-premises resources or resources behind a firewall. This integrated runtime acts as an agent – enabling pipelines to access the data and copy them to a destination defined in the pipeline.

Azure Databricks and Azure Synapse: Better Together

As mentioned earlier (and shown in Image 1), Databricks Notebooks and the code included in the Notebooks Spark-SQL, Python, or Scala) can be invoked through ADF/Synapse Pipelines and therefore orchestrated. It is where Databricks and Synapse sync up great. In image 1, we show how a Synapse Pipeline looks like that moves data from Bronze to Silver.

When completed, it continues to process the data into the gold layer. It is just a basic pattern, and many more patterns can be implemented to increase the reuse and flexibility of the pipeline.

Image 1: Synapse Pipeline invoking Databricks Notebooks

For instance, we can use parameters supplied from a configuration database (Azure SQL or similar) and have Synapse Pipelines pass the parameters to the respective Databricks Notebooks. It is for the parametrized execution of Notebooks – allowing for code reuse and reducing the time required to implement the solution.

Furthermore, the configuration database can supply source system connections and destinations such as databases or Databricks Lakehouses at runtime.

It is also possible to break down a large pipeline into multiple pieces and work with a Parent-Child pattern. The complete pipeline of several Parent—Child patterns could exist, one for each layer, just as an example. Defining these structures at the implementation’s beginning is vital to a maintainable and cost-effective system in the long run. Further abstractions can be added to increase code reuse and integrate a structured and effective testing framework.

While it is an additional effort to set up Azure Service (Databricks and Synapse), we recommend it as a good investment – especially for larger-scale analytical projects dealing with DE or ML-based workloads.

Also, providing technical implementation teams with options regarding the tooling and language the team would prefer for a given task. Typically, it positively impacts the timeline while reducing implementation risks.

Final Thoughts

You can easily take the ideas and concepts described here to build a metadata-driven data ingestion system based on Azure Databricks and Synapse.

These discussed concepts can also be applied to ML workloads using Databricks with ML Flow and Azure Synapse, and Azure ML.

Also, integrating Databricks Lakehouse and Unity is another crucial consideration in designing these solutions.

We hope this article gave some necessary insights on the power of Azure Databricks and Azure Synapse – and how they can be used to deliver modularized, flexible, and maintainable data ingestion and processing solutions.

The post Harmonizing Azure Databricks and Azure Synapse for Enhanced Analytics appeared first on Tiger Analytics.

A Comprehensive Guide: Optimizing Azure Databricks Operations with Unity Catalog

TA@2023 — Wed, 21 Jun 2023 17:15:52 +0000

For a Data engineer/Admins, making sure that all operations run smoothly is a priority.

That’s where Unity Catalog can help ensure that stored information is managed correctly, especially for those working with Azure Databricks. The Unity Catalog (UC) is a powerful metadata management system that is built into Delta Lake. It provides a centralized location to help users manage the metadata information on the data stored in Delta Lake. It also helps to simplify data management by providing a unified view of data across different data sources and formats.

Before Unity Catalog, every ADB (Azure Databricks) workspace had its own metastore, user management, and access controls, which led to duplication of efforts when maintaining consistency across all workspaces. To overcome these challenges, Databricks developed Unity Catalog, a unified governance solution for data and AI assets on the Lakehouse. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

At Tiger Analytics, we’ve worked to enable Unity Catalog for clients with new Databricks deployment and upgrading existing hive metastore to Unity Catalog to leverage all benefits Unity Catalog provides.

Making the Most of Unity Catalog’s Key Features

Centralized metadata and user management

Unity Catalog provides a centralized metadata layer to enable sharing data objects such as catalogs/ schema/ tables across multiple workspaces. It introduces two new built-in admin roles (Account Admins and Metastore Admins) to manage key features.

Account Admin: manages account-level resources like metastore, assigns metastore to workspaces, and assigns principals to the workspace.
Metastore Admin: manages metastore objects and grants identities access to securable objects (catalog/ schema/ tables/ views).

Centralized data access controls
Unity Catalog permits the use of Standard SQL-based commands to provide access to data objects.

GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;

GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >

TO < group_name >;

GRANT

SELECT

ON < catalog_name >.< schema_name >.< table_name >

TO < group_name >;

Data lineage Data access auditing

Unity Catalog automatically captures user-level audit logs that record access to user data. It also captures lineage data that tracks how data assets are created and used across all languages and personas.

Data search and discovery

Unity Catalog lets you tag and document data assets and provides a search interface to help data consumers find data.

Delta Sharing

Unity Catalog allows users in Databricks to share data securely outside the organization, which can be managed, governed, audited, and tracked.

Managing Users and Access Control

Account admins can sync users (groups) to workspaces from Azure Active Directory (Azure AD) tenant to Azure Databricks account using a SCIM provisioning connector.
Azure Databricks recommends using account-level SCIM provisioning to create, update, and delete all users (groups) from the account.

Unity Catalog Objects

Metastore

A metastore is the top-level container of objects in the Unity Catalog. It stores data assets (tables and views) and the permissions that govern access. UC metastore is mapped to an ADLS container, this container stores the Unity Catalog metastore’s metadata and managed tables. You can only create one UC metastore per region. Each workspace can only be attached to one UC metastore at any point in time. Unity Catalog has a 3-tier structure (catalog.schema.table/view) for referencing objects.

External Location and Storage Credential

Storage credential created either as managed identity or service principal provides access to the underlying ADLS path.
Storage credentials (managed identity/ service principal) should be authorized to that external storage account location by providing IAM role at storage account level.
External Location is an object that combines a cloud storage path with storage credentials to authorize access to the cloud storage path.
Each cloud storage path can be associated with only one external location. If you attempt to create a second external location that references the same path, the command fails.

Managed and External Tables

Unity Catalog manages the lifecycle of managed tables. This means that if you drop managed tables, both metadata and data are dropped.
By default, UC metastore ADLS container (the root storage location) will store the managed tables’ data as well, but you can override this default location at the catalog or schema level. Managed tables are in Delta format only.
External tables are tables whose data is stored outside of the managed storage location specified for the metastore, catalog, or schema. Dropping them will only delete the metadata of the table.

How to Create a UC Metastore and Link Workspaces

This flow diagram explains the sub-tasks needed to create a Metastore.

Step 1: Create an ADLS storage account and container

This Storage account container will store Unity Catalog metastore’s metadata and managed tables.

Step 2: Create an access connector for Databricks

Create Access Connector for Azure Databricks, and when deployment is done, make a note of Resource ID.

Step 3: Provide RBAC to access the connector

Add role assignment: Storage Blob Data Contributor to the managed identity (access connector) in Step#2

Step 4: Create metastore and assign workspaces

Once a UC metastore has been attached to a workspace, this will be visible under the workspace data tab:

If Unity Catalog is enabled for any existing workspace which had tables stored under hive_metastore catalog, those existing tables can be upgraded using SYNC command or UI, or they can be accessed using hive_metastore..

Enabling Unity Catalog as part of lakehouse architecture helps in achieving a centralized metadata layer for more enterprise-level governance without sacrificing the ability to manage and share data effectively. It helps in planning workspace deployments with limits in mind. This helps eliminate the risks of not being able to share the data and govern the project. With Unity Catalog, we can overcome the limitations and constraints of the existing Hive metastore, enabling us to better collaborate and leverage the power of data according to specific business needs.

The post A Comprehensive Guide: Optimizing Azure Databricks Operations with Unity Catalog appeared first on Tiger Analytics.

Enabling Cross Platform Data Observability in Lakehouse Environment

TA@2023 — Tue, 13 Jun 2023 19:05:05 +0000

Imagine a world where organizations effortlessly unlock their data ecosystem’s full potential as data lineage, cataloging, and quality seamlessly flow across platforms. As we rely more and more on data, the technology for uncovering valuable insights has grown increasingly nuanced and complex. While we’ve made significant progress in collecting, storing, aggregating, and visualizing data to meet the needs of modern data teams, one crucial factor defines the success of enterprise-level data platforms — Data observability.

Data observability is often conflated with data monitoring, and it’s easy to see why. The two concepts are interconnected, blurring the lines between them. However, data monitoring is the first step towards achieving true observability; it acts as a subset of observability.

Some of the industry-level challenges are:

The proliferation of data sources with varying tools and technologies involved in a typical data pipeline diminishes the visibility of the health of IT applications.
Data is consumed in various forms, making it harder for data owners to understand the data lineage.
The complexity of debugging pipeline failures poses major hurdles with a multi-cloud data services infrastructure.
Nonlinearity between creation, curation, and usage of data makes data lineage tough to track.

What is Data Observability?

To grasp the concept of data observability, the first step is to understand what it entails. Data observability focuses on examining the health of enterprise data environments by focusing on:

Design Lineage: Providing contextual data observations such as the job’s name, code location, version from Git, environment (Dev/QA/Prod), and data source metadata like location and schema.
Operational Lineage: Generating synchronous data observations by computing metrics like size, null values, min/max, cardinality, and more custom measures like skew, correlation, and data quality validation. It also includes usage attributes such as infrastructure and resource information.
Tracing and Continuous Validation: Generating data observations with continuously validated data points and sources for efficient tracing. It involves business thresholds, the absence of skewed categories, input and output data tracking, lineage, and event tracing.

Implementing Data Observability in Your Lakehouse Environment

Benefits of Implementation

Capturing critical metadata: Observability solutions capture essential design, operational, and runtime metadata, including data quality assertions.
Seamless integration: Technical metadata, such as pipeline jobs, runs, datasets, and quality assertions, can seamlessly integrate into your enterprise data governance tool.
End-to-end data lineage: Gain insights into the versions of pipelines, datasets, and more by establishing comprehensive data lineage across various cloud services.

Essential Elements of Data Observability Events

Job details: Name, owner, version, description, input dependencies, and output artifacts.
Run information: Immutable version of the job, event type, code version, input dataset, and output dataset.
Dataset information: Name, owner, schema, version, description, data source, and current version.
Dataset versions: Immutable versions of datasets.
Quality facets: Data quality rules, results, and other relevant quality facets.

Implementation Process

A robust and holistic approach to data observability requires a centralized interface in data. So, end-to-end data observability consists of implementing the below four layers in any of the Lakehouse environments.

Observability Agent: A listener setup that depends on the sources/data platform.
Data Reconciliation API: An endpoint for transforming the event to fit into the data model.
Metadata Repository: A data model created in a relational database.
Data Traceability Layer: A web-based interface or existing data governance tool.

By implementing these four layers and incorporating the essential elements of data observability, organizations can achieve improved visibility, traceability, and governance over their data in a Lakehouse environment.

The core data model for data observability prioritizes immutability and timely processing of datasets, which are treated as first-class values generated by job runs. Each job run is associated with a versioned code and produces one or more immutable versioned outputs. Changes to datasets are captured at various stages during job execution.

Technical Architecture: Observability in a Modern Platform

The depicted technical architecture exemplifies the implementation of a data observability layer for a modern medallion-based data platform. It helps enable a data observability layer for an enterprise-level data platform that collects and correlates metrics.

In data observability, this robust architecture effectively captures and analyzes critical metadata. Let’s dive into the technical components that make this architecture shine and understand their roles in the observability layer.

OpenLineage Agent

This observability agent bridges data sources, processing frameworks, and the observability layer. Its mission is to communicate seamlessly, ingesting custom facets to enhance event understanding. The OpenLineage agent’s compatibility with a wide range of data sources, processing frameworks, and orchestration tools makes it remarkable. In addition, it offers the flexibility needed to accommodate the diverse technological landscape of modern data environments.

OpenLineage API Server

As the conduit for custom events, the OpenLineage API Server allows ingesting these events into the metadata repository.

Metadata Repository

The metadata repository is at the heart of the observability layer. This data model, carefully crafted within a relational data store, captures essential information such as jobs, datasets, and runs.

Databricks

Azure Databricks offers powerful data processing engine with various types of clusters. Setting up OpenLineage agent in Databricks cluster enables to capture the dataset lineage tracking events based on the data processing jobs triggered in the workspace.

Azure Data Factory

With its powerful data pipeline orchestration capabilities, Azure Data Factory (ADF) takes center stage. ADF enables the smooth flow of data pipeline orchestration events, seamlessly sending them to the OpenLineage API. ADF seamlessly integrates with the observability layer, further enhancing data lineage tracking.

Great Expectations

Quality is of paramount importance in any data-driven ecosystem. Great Expectations ensures that quality facets are seamlessly integrated into each dataset version. Also, by adding custom facets through the OpenLineage API, Great Expectations fortifies the observability layer with powerful data quality monitoring and validation capabilities.

EventHub

As the intricate events generated by the OpenLineage component must be seamlessly integrated into the Apache Atlas API from Purview, EventHub takes center stage. As an intermediate queue, EventHub diligently parses and prepares these events for further processing, ensuring smooth and efficient communication between the observability layer and Purview.

Function API

To facilitate this parsing and preparation process, Azure Functions come into play. Purpose-built functions are created to handle the OpenLineage events and transform them into Atlas-supported events. These functions ensure compatibility and coherence between the observability layer and Purview, enabling seamless data flow.

Purview

Finally, we have Purview, the ultimate destination for all lineage and catalog events. Purview’s user interface becomes the go-to hub for tracking and monitoring the rich lineage and catalog events captured by the observability layer. With Purview, users can gain comprehensive insights, make informed decisions, and unlock the full potential of their data ecosystem.

Making Observability Effective on the ADB Platform

At Tiger Analytics, we’ve worked with a varied roster of clients, across sectors to help them achieve better data observability. So, we crafted an efficient solution that bridges the gap between Spark operations in Azure Databricks and Azure Purview. It transfers crucial observability events, enabling holistic data management. This helps organizations thrive with transparent, informed decisions and comprehensive data utilization.

The relationship is simple. Azure Purview and Azure Databricks complement each other. Azure Databricks offers powerful data processing and collaboration, while Azure Purview helps manage and govern data assets. Integrating them allows you to leverage Purview’s data cataloging capabilities to discover, understand, and access data assets within your Databricks workspace.

How did we implement the solution? Let’s dive in and find out.

Step 1: Setting up the Environment: We began by configuring the Azure Databricks environment, ensuring the right runtime version was in place. To capture observability events, we attached the OpenLineage jar to the cluster, laying a solid foundation for the journey ahead.

Step 2: Cluster Configuration: Smooth communication between Azure Databricks and Azure Purview was crucial. To achieve this, we configured the Spark settings at the cluster level, creating a bridge between the two platforms. By specifying the OpenLineage host, namespace, custom app name, version, and extra listeners, we solidified this connection.

Sample code snippet:

spark.openlineage.host https://<<>>:5000
spark.openlineage.namespace <<>>
spark.app.name <<>>
spark.openlineage.version v1
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener

Step 3: Spark at Work: With Spark’s power, the OpenLineage listeners came into action, capturing the Spark logical plan. This provided us with a comprehensive view of data operations within the cluster.

Step 4: Enter the Service Account: This account, created using a service principle, took center stage in authenticating the Azure Functions app and Azure Purview. Armed with owner/ contributor access, this service account became the key to a seamless connection.

Step 5: Azure Purview Unleashed: To unlock the full potential of Azure Purview, we created an Azure Purview service. Within Purview Studio, we assigned the roles of data curator, data source administrator, and collection admin to the service account. This granted the necessary permissions for a thrilling data management adventure.

Step 6: Seamless Deployment: Leveraging the deployment JSON provided in the OpenLineage GitHub repository, we embarked on a smooth AZ deployment. This process created essential services such as storage accounts, blob services, server farms, and websites, laying the foundation for a robust data lineage and cataloging experience –

Microsoft.Storage/storageAccounts
Microsoft.Storage/storageAccounts/blobServices/containers
Microsoft.Web/serverfarms
Microsoft.Web/sites
olToPurviewMappings
Microsoft.EventHub/namespaces
Microsoft.EventHub/namespaces/eventhubs
Microsoft.KeyVault/vaults
Microsoft.KeyVault/vaults/secrets

Step 7: Access Granted: An authentication token was added, granting seamless access to the Purview API. This opened the door to a treasure trove of data insights, empowering us to delve deeper into the observability journey.

Step 8: Spark and Azure Functions United: In the final step, we seamlessly integrated Azure Databricks with Azure Functions. By adding the Azure Function App URL and key to the Spark properties, a strong connection was established. This enabled the capture of observability events during Spark operations, effortlessly transferring them to Azure Purview, resulting in a highly effective data lineage.

Sample code snippet:

spark.openlineage.host https://.azurewebsites.net
spark.openlineage.url.param.code

By following the steps outlined above, our team successfully provided a comprehensive and highly effective data lineage and observability solution. By linking parent and child job IDs between Azure Data Factory (ADF) and Databricks, this breakthrough solution enabled correlation among cross-platform data pipeline executions. As a result, the client could leverage accurate insights that flowed effortlessly. This empowered them to make informed decisions, ensure data quality, and unleash the true power of data.

Extending OpenLineage Capabilities Across Data Components

Enterprises require data observability across multiple platforms. OpenLineage, a powerful data observability solution, offers out-of-the-box integration with various data sources and processing frameworks. However, what if you want to extend its capabilities to cover other data platform components? Let’s explore two simple methodologies to seamlessly integrate OpenLineage with additional data platforms, enabling comprehensive observability across your entire data ecosystem:

1. Custom event through OpenLineage API: Maintain a custom function to generate OpenLineage-supported JSON with observability events and trigger the function with required parameters wherever the event needs to be logged.

2. Leveraging API provided by target governance portal: Another option is to leverage the APIs provided by target governance portals. These portals often offer APIs for Data observability event consumption. By utilizing these APIs, you can extend OpenLineage’s solution to integrate with other data platforms. For example, Azure Purview has an API enabled with Apache Atlas for event ingestion. You can use Python packages such as PyApacheAtlas to create observability events in the format supported by the target API.

Data observability continues to be an important marker in evaluating the health of enterprise data environments. It provides organizations with a consolidated source of technical metadata, including data lineage, execution information, data quality attributes, dataset and schema changes generated by diverse data pipelines, and operational runtime metadata. This helps operational teams conduct precise RCA.

As various data consumption tools are in demand, along with the increasing use of multi-cloud data platforms, the data observability layer should be platform-agnostic and effortlessly adapt to available data sources and computing frameworks.

Sources:

https://www.usgs.gov/faqs/what-are-differences-between-data-dataset-and-database#:~:text=A%20dataset%20is%20a%20structured,data%20stored%20as%20multiple%20datasets.

https://learn.microsoft.com/en-us/azure/purview/register-scan-azure-databricks

https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

https://learn.microsoft.com/en-us/azure/databricks/introduction/

https://www.linkedin.com/pulse/what-microsoft-azure-purview-peter-krolczyk/

The post Enabling Cross Platform Data Observability in Lakehouse Environment appeared first on Tiger Analytics.

Unleash the Full Potential of Data Processing: A Roadmap to Leveraging Databricks

TA@2023 — Wed, 07 Jun 2023 18:36:42 +0000

Scenario # 1:

Thousands of files flood the data lake every day. These files are dumped in parallel by the source system, resulting in a massive influx of data.

Scenario #2:

There’s a continuous influx of incremental data from a transactional table in SAP. Every 15 minutes, a massive file containing millions of records has to be extracted and sent to the data lake landing zone. This critical dataset is essential for the business, but the sheer volume of data and the complexity of the transactional system poses significant challenges.

How would you tackle these situations?

In today’s data-driven world, organizations heavily rely on efficient data processing to extract valuable insights. Streamlined data processing directly impacts decision-making – enabling them to unlock hidden patterns, optimize operations, and drive innovation. But often, as businesses keep growing, they are faced with the uphill task of managing data velocity, variety, and volume.

Can data ingestion services help simplify the data loading process, so that business teams can focus on analyzing the data rather than managing the intricate loading process?

Leveraging Databricks to elevate data processing

The process of collecting, transforming, and loading data into a data lake can be complex and time-consuming. At Tiger Analytics, we’ve used Databricks Auto Loader across our clients and various use cases to make the data ingestion process hassle-free.

Here’s how we tackled the two problem statements for our clients:

Scenario 1: Multiple File Ingestion Based on Control File Trigger

Thousands of files flooded our client’s data lake daily. The source system would dump them in parallel, resulting in a massive influx of data. Then, to indicate the completion of the extraction process, the source system dropped a control file named ‘finish.ctrl’. The primary challenge was to trigger the ingestion process based on this control file and efficiently load all the files dropped by the source system.

The challenges:

Large number of files: The daily extract consisted of a staggering 10,000 to 20,000 text files, making manual processing impractical and time-consuming.
Volume of records: Each file contained hundreds of thousands of records, further complicating the data processing task.
Timely refresh of silver and gold layers: The business required their Business Intelligence (BI) reports to be refreshed within an hour, necessitating a streamlined and efficient data ingestion process.
Duplicate file loading: In cases where the extraction process failed at the source, the entire process would start again, resulting in the redundant loading of previously processed files.

How we effectively used Databricks to streamline the ingestion process:

We worked with Databricks Auto Loader by automating the detection and ingestion of thousands of files. With the help of these efforts, the team experienced increased efficiency, improved data quality, and accelerated data processing times – revolutionizing their entire data ingestion process.

The implementation involved the following steps:

Setting up a data factory orchestrator: The team leveraged Azure Data Factory as an orchestrator to trigger a Databricks notebook based on the event trigger. Specifically, they configured the event trigger to be activated when the source system dropped the control file ‘finish.ctrl’.
Configuring the Auto Loader notebook: The team coded a Databricks notebook to run Auto Loader with the trigger once option. This configuration ensured that the notebook would run once, ingesting all the files into the bronze table before automatically terminating them.

Sample code snippet:

df = (spark.readStream

.format(“cloudFiles”)

.option(“cloudFiles.format”, )

.schema())

df.writeStream

.format(‘delta’)

.trigger(once = True)

.outputMode(‘append’)

.option(“checkpointLocation”, )

.option(“path”,)

.table()

query.awaitTermination()

Business impact:

Increased efficiency: Manually processing thousands of files became a thing of the past. The client saved significant time and valuable resources by automating the data ingestion process.
Improved data quality: Ingesting data into the data lake using Databricks Delta Lake ensured enhanced data quality and consistency. This, in turn, mitigated the risk of data errors and inconsistencies.
Faster data processing: With the automation of data ingestion and improved data quality, the client could achieve lightning-fast file processing times. Files that previously took hours to process were now handled within minutes, empowering the team to make data-driven decisions swiftly.

Scenario 2: Streamlining the Data Ingestion Pipeline

Our client was dealing with managing a continuous influx of incremental data from a transactional table in SAP. Every 15 minutes, a massive file containing millions of records had to be extracted and sent to the data lake landing zone. While the critical dataset was essential for their business, the sheer volume of data and the complexity of the transactional system posed huge challenges.

The challenges:

Managing a large volume of data: The transactional system generated millions of transactions per hour, resulting in an overwhelming volume of data that needed to be ingested, processed, and analyzed.
Ordered file processing: It was crucial to process the incremental files in the correct order to maintain data consistency and accuracy with the source system.
Near real-time data processing: Due to the critical nature of the data, the business required immediate ingestion of the files as soon as they arrived in the landing zone, enabling near real-time data processing.

Using Databricks to enable efficient incremental file processing

The team strategically decided to implement Databricks Auto Loader streaming. This feature allowed them to process new data files incrementally and effectively as they arrived in the cloud storage.

The implementation involved the following steps:

Leveraging file notification and queue services: The team configured Auto Loader to use the file notification service and queue service, which subscribed to file events from the input directory. This setup ensured that new data files were promptly detected and processed.
Custom role creation for service principle: To enable the file notification service, the team created a custom role for the service principle. This role encompassed the necessary permissions to create the queue and event subscription required for seamless file notification.

Sample code snippet:

“permissions”: [

{

“actions”: [

“Microsoft.EventGrid/eventSubscriptions/write”,

“Microsoft.EventGrid/eventSubscriptions/read”,

“Microsoft.EventGrid/eventSubscriptions/delete”,

“Microsoft.EventGrid/locations/eventSubscriptions/read”,

“Microsoft.Storage/storageAccounts/read”,

“Microsoft.Storage/storageAccounts/write”,

“Microsoft.Storage/storageAccounts/queueServices/read”,

“Microsoft.Storage/storageAccounts/queueServices/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/read”,

“Microsoft.Storage/storageAccounts/queueServices/queues/delete”

],

“notActions”: [],

“dataActions”: [

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/delete”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/read”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/write”,

“Microsoft.Storage/storageAccounts/queueServices/queues/messages/process/action”

],

“notDataActions”: []

}

]

df = ( spark.readStream

.format(“cloudFiles”)

.option(“cloudFiles.format”, )

.option(“cloudFiles.useNotifications”,”true”)

.option(“cloudFiles.resourceGroup”,)

.option(“cloudFiles.subscriptionId”,)

.option(“cloudFiles.tenantId”,)

.option(“cloudFiles.clientId”,)

.option(“cloudFiles.clientSecret”,)

.option(“cloudFiles.maxFilesPerTrigger”, 1)

.schema()

.load())

input_df.writeStream

.format(“delta”)

.foreachBatch()

.outputMode(“update”)

.option(“checkpointLocation”, )

.start()

Business impact:

Automated data discovery and loading: Auto Loader automated the process of identifying new data files as they arrived in the data lake and automatically loaded the data into the target tables. This eliminated the manual effort required for managing the data loading process.
Enhanced focus on data analysis: The client could shift from managing the loading process to analyzing the data by streamlining the data ingestion process. Hence, they derived valuable insights and could make informed business decisions promptly.

Making Databricks Auto Loader Work for You

If you’re using Databricks to manage data ingestion, keep these things in mind so that you can create maximum value for your clients:

Data discovery: Since Databricks Auto Loader automatically detects new data files as they arrive in the data lake, it eliminates the need for manual scanning thus saving time while ensuring no data goes unnoticed.
Automatic schema inference: The Auto Loader can automatically infer the schema of incoming files based on the file format and structure. It also supports changes in the schema. This means that you can choose to drop new columns, fail on change, or rescue new columns and store them separately. It facilitates smooth data ingestion without delays during schema changes. There’s also no need to define the schema manually, making the loading process more seamless and less error-prone.
Parallel processing: Databricks Auto Loader is designed to load data into target tables in parallel. This will come in handy when you need to handle large volumes of data efficiently.
Delta Lake integration: Databricks Auto Loader seamlessly integrates with Delta Lake – open-source data storage and management system optimized for data processing and analytics workloads. You can therefore access leverage Delta Lake’s unique features like ACID transactions, versioning, time travel, and more.
Efficient job restarts: The Auto Loader stores metadata about the processed data in RocksDB as key-value pairs, enabling seamless job restarts without the need to log failures in the check-point location.
Spark structured streaming: The Auto Loader leverages Spark structured streaming for immediate data processing, providing real-time insights.
Flexible file identification: The Auto Loader provides two options for identifying new files – directory listing and file notification. The directory list mode allows the quick launching of an Auto Loader stream without additional permissions. At the same time, file notification and queue services eliminate the need for directory listing in cases of large input directories or unordered file volumes.
Batch workloads compatibility: While the Auto Loader excels in streaming and processing hundreds of files, it can also be used for batch workloads. This eliminates the need for running continuous clusters. In addition, with check-pointing, you can start and stop streams efficiently. The Auto Loader can also be scheduled for regular batch loads using the trigger once option, leveraging all its features.

Data ingestion and processing are crucial milestones in the Data Management journey. While organizations can generate vast amounts of data, it’s important to ingest and process that data correctly for accurate insights. With services like Databricks, the data-loading process becomes simpler and more efficient, improving output accuracy and empowering organizations to make data-driven decisions.

The post Unleash the Full Potential of Data Processing: A Roadmap to Leveraging Databricks appeared first on Tiger Analytics.