Data Observability Archives - Tiger Analytics

The Data Leader’s Guide to Responsible AI: Why Strong Data Governance Is Key to Mitigating AI Risks

onemg — Tue, 29 Apr 2025 08:24:37 +0000

In 1968, HAL 9000’s “I’m sorry, Dave. I’m afraid I can’t do that” marked the beginning of a new era in entertainment. As the years passed, films like 2004’s IRobot, and 2015’s Chappie continued to explore AI’s potential – from “One day they’ll have secrets… one day they’ll have dreams” to “I am consciousness. I am alive. I am Chappie.” While these fictional portrayals pushed the boundaries of our imagination, they also laid the groundwork for AI technologies such as self-driving cars, consumer personalizations, Generative AI and the like, that are shaping the world today.

Today, the rise of GenAI and copilots from various tool vendors and organizations has generated significant interest, driven by advancements in NLP, ML, computer vision, and other deep learning models. For CIOs, CDOs, and data leaders, this shift underscores a critical point: AI-powered technologies must be responsible, transparent, ensure privacy, and free of bias to truly add business value.

Since AI and GenAI both depend on data for fuel, it cements the need for the availability of the right data with the right quality, trust, and compliance. Without strong data governance, organizations risk AI models that reinforce bias, misinterpret data, or fail to meet regulatory requirements. This further underscores the importance of Data Governance as a critical discipline that serves as a guiding light.

Hence, ‘The lighthouse remains a beacon amidst shifting tides’ – In today’s context, this metaphor reflects the challenges faced by both data-driven and AI-driven enterprises. The landscape of data generation, usage, and transformation is constantly evolving, presenting new complexities for organizations to navigate. While data governance is not new, with many a change in weather (data) patterns and the infusion of AI across industries, it has grown increasingly relevant, acting as the foundation on which AI can be governed and enabled.

At Tiger Analytics, we are constantly exploring new opportunities to optimize the way we work. Take, for example, enterprises where time-to-market is critical, product vendors have developed copilots using GenAI. We have also observed many initiatives among our Fortune 100 clients leveraging models and various AI elements to achieve a faster time-to-market or develop new offerings. Many of these projects are successful, scalable, and continue to drive efficiency. However, the inevitable question arises: How do we govern AI?

What are the biggest challenges in Data Governance – Answering key questions

Data governance is not just about compliance — it is essential to enhance data quality and trustworthiness, efficiency, scalability, and produce better AI outcomes. Strong governance practices (process, op model, R&R) empower enterprises to unlock the full potential of their data assets.

Below are a few important questions that stakeholders across the enterprise, including CxOs, business leaders, Line of Business (LoB) owners, and data owners, are seeking to answer today. As organizations strive towards data literacy and ethical AI practices, these questions highlight the importance of implementing governance strategies that can support both traditional data management and emerging AI risks.

Who is in charge of the model or the data product that uses my model?
Who can control (modify/delete/archive) the dataset?
Who will decide how to control the data and make key decisions?
Who will decide what is to be controlled in the workflow or data product or model that my data is part of?
What are the risks to the end outcome if intelligence is augmented without audits or controls, or quality assurance?
Are controls for AI different from current ones or can existing ones be repurposed?
Which framework will guide me?
Is the enterprise data governance initiative flexible to accommodate my AI risks and related work?
With my organization in the process of becoming data literate and ensuring data ethics, how can AI initiatives take advantage of the same?
Is user consent still valid in the new AI model, and how is it protected?
What are the privacy issues to be addressed?

Let’s consider an example. A forecasting model is designed to help predict seasonal sales to launch a new apparel range targeted at a specific customer segment within an existing market. Now, assume the data is to be sourced from your marketplace and there are readymade data products that can be used – How do you check the health of the data before you run a simulation? What if you face challenges such as ownership disputes, metadata inconsistencies, or data quality issues? Is there a risk of privacy breaches if, for example, someone forgets to remove sample data from the dataset?

This is why Data Governance (including data management) and AI must work in tandem, even more so when we consider the risk of non-compliance, for which the impact is far greater. Any governance approach must be closely aligned with data governance practices and effectively integrated into daily operations. There are various ways in which the larger industry and we at Tiger Analytics are addressing this. In the next section, we take a look at the key factors that can serve as the foundation for AI governance within an enterprise.

Untangling the AI knot: How to create a data governance framework for AI

At Tiger Analytics, we’ve identified seven elements that are crucial in establishing a framework for Foundational Governance for AI – we call it HEal & INtERAcT. We believe a human-centric and transparent approach is essential in governing AI assets. As AI continues to evolve and integrate into various processes within an organization, governance must remain simple.

Rather than introducing entirely new frameworks, our approach focuses on accessible AI governance in which existing data governance operations are expanded to include new dimensions, roles, processes, and standards. This creates a seamless extension rather than a separate entity, thereby eliminating the complexities of managing AI risks in silos and untangling the “AI knot” through smooth integration.

The seven elements ensure AI governance remains transparent and aligns with the larger enterprise data governance strategy, influencing processes, policies, standards, and change management. For instance, Integrity and Trustworthiness reinforce reliability in model outputs and help create a trustworthy output that ensures privacy, while Accountability and Responsibility establish clear ownership of AI-driven decisions, ensuring compliance and ethical oversight. As AI introduces new roles and responsibilities, governance frameworks are revised to cover emerging risks and complexities like cross-border data, global teams, mergers, and varying regulations.

In addition, the data lifecycle in any organization is dependent on data governance. AI cannot exist without enterprise data. Synthetic data can only mimic actual data and issues. Therefore, high-quality, fit-for-purpose data is essential to train AI models and GenAI for more accurate predictions and better content generation.

Getting started with AI governance

Here is how an enterprise can begin its AI governance journey:

Identify all the AI elements and list out every app and area that uses it
What does your AIOps look like, and how is it being governed?
Identify key risks from stakeholders
Map them back to the principles
Define controls for the risks identified
Align framework with your larger data governance strategy

Enable specific processes for AI
Set data standards for AI
Tweak data policies for AI
Include an AI glossary for Cataloging and Lineage, providing better context
Data observability for AI to set up proactive detection for better model output and performance

Essentially, Enterprise DG+AI principles (framework) along with Identification & Mitigation strategies and Risk Controls, will pave the way for efficient AI governance. Given the evolving nature of this space, there is no one-size-fits-all solution. Numerous principles exist, but expert guidance and consulting are essential to navigate this complexity and implement the right approach.

The Road Ahead

AI has moved from science fiction to everyday reality, shaping decisions, operations, and personalized customer experiences. The focus now is on ensuring it is transparent, ethical, and well-governed. For this, AI and data governance must work in tandem. From customer churn analysis to loss prevention and identifying the right business and technical metrics, managing consent and privacy in the new era of AI regulations, AI can drive business value — but only when built on a foundation of strong data governance. A well-structured governance program ensures AI adoption is responsible and scalable, minimizing risks while maximizing impact. By applying the principles and addressing the key questions above, you can ensure a successful implementation, enabling your business to leverage AI for meaningful outcomes.

So while you ponder on these insights, ’til next time — just as the T800 said, “I’ll be back!”

The post The Data Leader’s Guide to Responsible AI: Why Strong Data Governance Is Key to Mitigating AI Risks appeared first on Tiger Analytics.

What is Data Observability Used For?

TA@2023 — Fri, 27 Sep 2024 10:35:54 +0000

Imagine you’re managing a department that handles account openings in a bank. All services seem fine, and the infrastructure seems to be working smoothly. But one day, it becomes clear that no new account has been opened in the last 24 hours. On investigation, you find that this is because one of the microservices involved in the account opening process is taking a very long time to respond.

For such a case, the data analyst examining the problem can use traces with triggers based on processing time. But there must be an easier way to spot anomalies.
Traditional monitoring involves recording the performance of the infrastructure and applications. Data observability allows you to track your data flows and find faults in them (may even extend to business processes). While traditional tools analyze infrastructure and applications using metrics, logs, and traces, data observability uses data analysis in a broader sense.

So, how do we tackle the case of no new account creation in 24 hours?

The data analyst could use traces with time-based triggers. There has to be an easier way of detecting potential anomalies on site.

A machine learning model is used to predict future events, such as the volume of future sales, by utilizing regularly updated historical data. However, because the input data may not always be of perfect quality, the model can sometimes produce inaccurate forecasts. These inaccuracies can lead to either excess inventory for the retailer or, worse, out-of-stock situations when there is consumer demand.

Classifying and Addressing Unplanned Events

The point of Data Observability is to identify so-called data downtime. Data Downtime refers to a sudden unplanned event in your business/infrastructure/code that leads to a sudden change in the data. In other words, it is the process of finding anomalies in data.

How can you classify these events?

Exceeding a given metric value or an abnormal jump in a given metric. This type is the simplest. Imagine that you add 80-120 clients every day (confidence interval with some probability), and in one day, only 20. Perhaps something caused it to drop suddenly, and it’s worth looking into.
Abrupt change in data structure. Let’s take a past example with clients. Everything was fine, but one day, the contact information field began to receive empty values. Perhaps something has broken in your data pipeline, and it’s better to check.
The occurrence of a certain condition or deviation from it. Just as GPS coordinates should not show a truck in the ocean, banking transactions should not suddenly appear in unexpected locations or in unusual amounts that deviate significantly from the norm.
Statistical anomalies. During a routine check, the bank’s analysts notice that on a particular day, the average ATM withdrawal per customer spiked to $500, which is significantly higher than the historical average.

On the one hand, it seems that there is nothing new in this approach of classifying abnormal events and taking the necessary remedial action. But on the other hand, previously there were no comprehensive and specialized tools for these tasks.

Data Observability is Essential for Ensuring Fresh, Accurate, and Smooth Data Flow

Data observability serves as a checkup for your systems. It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

Persona	Why Question	Observability Use case	Business Outcome
Business User	WHY Data quality metrics are in Amber/Red WHY is my dataset/report not accurate WHY do I see a sudden demand for my product and what is the root cause	Data Quality, Anomaly Detection and RCA	Improve the quality of insights Boost trust and confidence in decision making
Data Engineers/Data Reliability Engineers	WHY there is data downtime WHY did the pipeline fail WHY there is an SLA breach in Data Freshness	Data Pipeline Observability, Troubleshooting and RCA	Better Productivity Speed up MTTR Enhance Pipeline efficiency Intelligent Triaging
Data Scientists	WHY the model predictions are not accurate	Data Quality Model	Improve Model Reliability

Tiger Analytics’ Continuous Observability Solution

Continuous monitoring and alerting of potential issues (gathered from various sources) before a customer/operations reports an issue. Consists of Set of tools, patterns and practices to build Data Observability components for your big data workloads in Cloud platform to reduce DATA DOWNTIME.

Select examples of our experience in Data observability and Quality

Tools and Technology

Tiger Analytics Data Observability is set of tools, patterns and best practices to:

Ingest MELT(Metrics, Events, Logs, Traces) data
Enrich, Store MELT for getting insights on Event & Log Correlations, Data Anomalies, Pipeline Failures, Performance Metrics
Configure Data Quality rules using a Self Service UI
Monitor Operational Metrics like Data quality, Pipeline health, SLAs
Alert Business team when there is Data Downtime
Perform Root cause analysis
Fix broken pipelines and data quality issues

Which will help:

Minimize data downtime using automated data quality checks
Discover data problems before they impact the business KPIs
Accelerate Troubleshooting and Root Cause Analysis
Boost productivity and reduce operational cost
Improve Operational Excellence, QoS, Uptime

Data observability and Generative AI (GenAI) can play crucial roles in enhancing data-driven decision-making and machine learning (ML) model performance.

The combination of data observability primes the pump by instilling confidence with smooth sailing, high-quality and always available data which forms a foundation for any data-driven initiative while GenAI enables to realize what is achievable through it, opening up new avenues into how we can simulate, generate or even go beyond innovate. Organizations can use both to improve their data capabilities, decision-making processes, and innovation with different areas.

Thus, Monte Carlo, a company that produces a tool for data monitoring, raised $135 million, Observe – $112 million, Acceldata – $100 million have an excellent technology medium in the Data Observability space.

To summarize

Data Observability is an approach to identifying anomalies in business processes and the operation of applications and infrastructure, allowing users to quickly respond to emerging incidents.It lets you ensure your data is fresh, accurate, and flowing smoothly, helping you catch potential problems early on.

And if there is no particular novelty in technology, there is certainly novelty in the approach, tools and new terms that make it possible to better convince investors and clients. The next few years will show how successful new players will be in the market.

References

https://www.oreilly.com/library/view/data-observability-for/9781804616024/
https://www.oreilly.com/library/view/data-quality-fundamentals/9781098112035/

The post What is Data Observability Used For? appeared first on Tiger Analytics.

Enabling Cross Platform Data Observability in Lakehouse Environment

TA@2023 — Tue, 13 Jun 2023 19:05:05 +0000

Imagine a world where organizations effortlessly unlock their data ecosystem’s full potential as data lineage, cataloging, and quality seamlessly flow across platforms. As we rely more and more on data, the technology for uncovering valuable insights has grown increasingly nuanced and complex. While we’ve made significant progress in collecting, storing, aggregating, and visualizing data to meet the needs of modern data teams, one crucial factor defines the success of enterprise-level data platforms — Data observability.

Data observability is often conflated with data monitoring, and it’s easy to see why. The two concepts are interconnected, blurring the lines between them. However, data monitoring is the first step towards achieving true observability; it acts as a subset of observability.

Some of the industry-level challenges are:

The proliferation of data sources with varying tools and technologies involved in a typical data pipeline diminishes the visibility of the health of IT applications.
Data is consumed in various forms, making it harder for data owners to understand the data lineage.
The complexity of debugging pipeline failures poses major hurdles with a multi-cloud data services infrastructure.
Nonlinearity between creation, curation, and usage of data makes data lineage tough to track.

What is Data Observability?

To grasp the concept of data observability, the first step is to understand what it entails. Data observability focuses on examining the health of enterprise data environments by focusing on:

Design Lineage: Providing contextual data observations such as the job’s name, code location, version from Git, environment (Dev/QA/Prod), and data source metadata like location and schema.
Operational Lineage: Generating synchronous data observations by computing metrics like size, null values, min/max, cardinality, and more custom measures like skew, correlation, and data quality validation. It also includes usage attributes such as infrastructure and resource information.
Tracing and Continuous Validation: Generating data observations with continuously validated data points and sources for efficient tracing. It involves business thresholds, the absence of skewed categories, input and output data tracking, lineage, and event tracing.

Implementing Data Observability in Your Lakehouse Environment

Benefits of Implementation

Capturing critical metadata: Observability solutions capture essential design, operational, and runtime metadata, including data quality assertions.
Seamless integration: Technical metadata, such as pipeline jobs, runs, datasets, and quality assertions, can seamlessly integrate into your enterprise data governance tool.
End-to-end data lineage: Gain insights into the versions of pipelines, datasets, and more by establishing comprehensive data lineage across various cloud services.

Essential Elements of Data Observability Events

Job details: Name, owner, version, description, input dependencies, and output artifacts.
Run information: Immutable version of the job, event type, code version, input dataset, and output dataset.
Dataset information: Name, owner, schema, version, description, data source, and current version.
Dataset versions: Immutable versions of datasets.
Quality facets: Data quality rules, results, and other relevant quality facets.

Implementation Process

A robust and holistic approach to data observability requires a centralized interface in data. So, end-to-end data observability consists of implementing the below four layers in any of the Lakehouse environments.

Observability Agent: A listener setup that depends on the sources/data platform.
Data Reconciliation API: An endpoint for transforming the event to fit into the data model.
Metadata Repository: A data model created in a relational database.
Data Traceability Layer: A web-based interface or existing data governance tool.

By implementing these four layers and incorporating the essential elements of data observability, organizations can achieve improved visibility, traceability, and governance over their data in a Lakehouse environment.

The core data model for data observability prioritizes immutability and timely processing of datasets, which are treated as first-class values generated by job runs. Each job run is associated with a versioned code and produces one or more immutable versioned outputs. Changes to datasets are captured at various stages during job execution.

Technical Architecture: Observability in a Modern Platform

The depicted technical architecture exemplifies the implementation of a data observability layer for a modern medallion-based data platform. It helps enable a data observability layer for an enterprise-level data platform that collects and correlates metrics.

In data observability, this robust architecture effectively captures and analyzes critical metadata. Let’s dive into the technical components that make this architecture shine and understand their roles in the observability layer.

OpenLineage Agent

This observability agent bridges data sources, processing frameworks, and the observability layer. Its mission is to communicate seamlessly, ingesting custom facets to enhance event understanding. The OpenLineage agent’s compatibility with a wide range of data sources, processing frameworks, and orchestration tools makes it remarkable. In addition, it offers the flexibility needed to accommodate the diverse technological landscape of modern data environments.

OpenLineage API Server

As the conduit for custom events, the OpenLineage API Server allows ingesting these events into the metadata repository.

Metadata Repository

The metadata repository is at the heart of the observability layer. This data model, carefully crafted within a relational data store, captures essential information such as jobs, datasets, and runs.

Databricks

Azure Databricks offers powerful data processing engine with various types of clusters. Setting up OpenLineage agent in Databricks cluster enables to capture the dataset lineage tracking events based on the data processing jobs triggered in the workspace.

Azure Data Factory

With its powerful data pipeline orchestration capabilities, Azure Data Factory (ADF) takes center stage. ADF enables the smooth flow of data pipeline orchestration events, seamlessly sending them to the OpenLineage API. ADF seamlessly integrates with the observability layer, further enhancing data lineage tracking.

Great Expectations

Quality is of paramount importance in any data-driven ecosystem. Great Expectations ensures that quality facets are seamlessly integrated into each dataset version. Also, by adding custom facets through the OpenLineage API, Great Expectations fortifies the observability layer with powerful data quality monitoring and validation capabilities.

EventHub

As the intricate events generated by the OpenLineage component must be seamlessly integrated into the Apache Atlas API from Purview, EventHub takes center stage. As an intermediate queue, EventHub diligently parses and prepares these events for further processing, ensuring smooth and efficient communication between the observability layer and Purview.

Function API

To facilitate this parsing and preparation process, Azure Functions come into play. Purpose-built functions are created to handle the OpenLineage events and transform them into Atlas-supported events. These functions ensure compatibility and coherence between the observability layer and Purview, enabling seamless data flow.

Purview

Finally, we have Purview, the ultimate destination for all lineage and catalog events. Purview’s user interface becomes the go-to hub for tracking and monitoring the rich lineage and catalog events captured by the observability layer. With Purview, users can gain comprehensive insights, make informed decisions, and unlock the full potential of their data ecosystem.

Making Observability Effective on the ADB Platform

At Tiger Analytics, we’ve worked with a varied roster of clients, across sectors to help them achieve better data observability. So, we crafted an efficient solution that bridges the gap between Spark operations in Azure Databricks and Azure Purview. It transfers crucial observability events, enabling holistic data management. This helps organizations thrive with transparent, informed decisions and comprehensive data utilization.

The relationship is simple. Azure Purview and Azure Databricks complement each other. Azure Databricks offers powerful data processing and collaboration, while Azure Purview helps manage and govern data assets. Integrating them allows you to leverage Purview’s data cataloging capabilities to discover, understand, and access data assets within your Databricks workspace.

How did we implement the solution? Let’s dive in and find out.

Step 1: Setting up the Environment: We began by configuring the Azure Databricks environment, ensuring the right runtime version was in place. To capture observability events, we attached the OpenLineage jar to the cluster, laying a solid foundation for the journey ahead.

Step 2: Cluster Configuration: Smooth communication between Azure Databricks and Azure Purview was crucial. To achieve this, we configured the Spark settings at the cluster level, creating a bridge between the two platforms. By specifying the OpenLineage host, namespace, custom app name, version, and extra listeners, we solidified this connection.

Sample code snippet:

spark.openlineage.host https://<<>>:5000
spark.openlineage.namespace <<>>
spark.app.name <<>>
spark.openlineage.version v1
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener

Step 3: Spark at Work: With Spark’s power, the OpenLineage listeners came into action, capturing the Spark logical plan. This provided us with a comprehensive view of data operations within the cluster.

Step 4: Enter the Service Account: This account, created using a service principle, took center stage in authenticating the Azure Functions app and Azure Purview. Armed with owner/ contributor access, this service account became the key to a seamless connection.

Step 5: Azure Purview Unleashed: To unlock the full potential of Azure Purview, we created an Azure Purview service. Within Purview Studio, we assigned the roles of data curator, data source administrator, and collection admin to the service account. This granted the necessary permissions for a thrilling data management adventure.

Step 6: Seamless Deployment: Leveraging the deployment JSON provided in the OpenLineage GitHub repository, we embarked on a smooth AZ deployment. This process created essential services such as storage accounts, blob services, server farms, and websites, laying the foundation for a robust data lineage and cataloging experience –

Microsoft.Storage/storageAccounts
Microsoft.Storage/storageAccounts/blobServices/containers
Microsoft.Web/serverfarms
Microsoft.Web/sites
olToPurviewMappings
Microsoft.EventHub/namespaces
Microsoft.EventHub/namespaces/eventhubs
Microsoft.KeyVault/vaults
Microsoft.KeyVault/vaults/secrets

Step 7: Access Granted: An authentication token was added, granting seamless access to the Purview API. This opened the door to a treasure trove of data insights, empowering us to delve deeper into the observability journey.

Step 8: Spark and Azure Functions United: In the final step, we seamlessly integrated Azure Databricks with Azure Functions. By adding the Azure Function App URL and key to the Spark properties, a strong connection was established. This enabled the capture of observability events during Spark operations, effortlessly transferring them to Azure Purview, resulting in a highly effective data lineage.

Sample code snippet:

spark.openlineage.host https://.azurewebsites.net
spark.openlineage.url.param.code

By following the steps outlined above, our team successfully provided a comprehensive and highly effective data lineage and observability solution. By linking parent and child job IDs between Azure Data Factory (ADF) and Databricks, this breakthrough solution enabled correlation among cross-platform data pipeline executions. As a result, the client could leverage accurate insights that flowed effortlessly. This empowered them to make informed decisions, ensure data quality, and unleash the true power of data.

Extending OpenLineage Capabilities Across Data Components

Enterprises require data observability across multiple platforms. OpenLineage, a powerful data observability solution, offers out-of-the-box integration with various data sources and processing frameworks. However, what if you want to extend its capabilities to cover other data platform components? Let’s explore two simple methodologies to seamlessly integrate OpenLineage with additional data platforms, enabling comprehensive observability across your entire data ecosystem:

1. Custom event through OpenLineage API: Maintain a custom function to generate OpenLineage-supported JSON with observability events and trigger the function with required parameters wherever the event needs to be logged.

2. Leveraging API provided by target governance portal: Another option is to leverage the APIs provided by target governance portals. These portals often offer APIs for Data observability event consumption. By utilizing these APIs, you can extend OpenLineage’s solution to integrate with other data platforms. For example, Azure Purview has an API enabled with Apache Atlas for event ingestion. You can use Python packages such as PyApacheAtlas to create observability events in the format supported by the target API.

Data observability continues to be an important marker in evaluating the health of enterprise data environments. It provides organizations with a consolidated source of technical metadata, including data lineage, execution information, data quality attributes, dataset and schema changes generated by diverse data pipelines, and operational runtime metadata. This helps operational teams conduct precise RCA.

As various data consumption tools are in demand, along with the increasing use of multi-cloud data platforms, the data observability layer should be platform-agnostic and effortlessly adapt to available data sources and computing frameworks.

Sources:

https://www.usgs.gov/faqs/what-are-differences-between-data-dataset-and-database#:~:text=A%20dataset%20is%20a%20structured,data%20stored%20as%20multiple%20datasets.

https://learn.microsoft.com/en-us/azure/purview/register-scan-azure-databricks

https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

https://learn.microsoft.com/en-us/azure/databricks/introduction/

https://www.linkedin.com/pulse/what-microsoft-azure-purview-peter-krolczyk/

The post Enabling Cross Platform Data Observability in Lakehouse Environment appeared first on Tiger Analytics.