Data Quality Archives - Tiger Analytics

Building Trusted Data: A Comprehensive Guide to Tiger Analytics’ Snowflake Native Data Quality Framework

onemg — Fri, 24 Jan 2025 13:06:13 +0000

A 2024 report on data integrity trends and insights found that 50% of the 550 leading data and analytics professionals surveyed believed data quality is the number one issue impacting their organization’s data integration projects. And that’s not all. Poor data quality was also negatively affecting other initiatives meant to improve data integrity with 67% saying they don’t completely trust the data used for decision-making. As expected, data quality is projected to be a top priority investment for 2025.

Trusted, high-quality data is essential to make informed decisions, deliver exceptional customer experiences, and stay competitive. However, maintaining quality is not quite so simple, especially as data volume grows. Data arrives from diverse sources, is processed through multiple systems, and serves a wide range of stakeholders, increasing the risk of errors and inconsistencies. Poor data quality can lead to significant challenges, including:

Operational Inefficiencies: Incorrect or incomplete data can disrupt workflows and increase costs.
Lost Revenue Opportunities: Decisions based on inaccurate data can result in missed business opportunities.
Compliance Risks: Regulatory requirements demand accurate and reliable data; failure to comply can result in penalties.
Eroded Trust: Poor data quality undermines confidence in data-driven insights, impacting decision-making and stakeholder trust.

Manual approaches to data quality are no longer sustainable in modern data environments. Organizations need a solution that operates at scale without compromising performance, integrates seamlessly into existing workflows and platforms, and provides actionable insights for continuous improvement.

This is where Tiger Analytics’ Snowflake Native Data Quality Framework comes into play, leveraging Snowflake’s unique capabilities to address these challenges effectively.

Tiger Analytics’ Snowflake Native Data Quality Framework – An Automated and Scalable Solution

At Tiger Analytics, we created a custom solution leveraging Snowpark, Great Expectations (GE), Snowflake Data Metric Functions, and Streamlit to redefine data quality processes. By designing this framework as Snowflake-native, we capitalize on the platform’s capabilities for seamless integration, scalability, and performance.

Snowflake’s native features offer significant advantages when building a Data Quality (DQ) framework, addressing the evolving needs of data management and governance. These built-in tools streamline processes, ensuring efficient monitoring, validation, and enhancement of data quality throughout the entire data lifecycle:

Efficient Processing with Snowpark:
Snowpark lets users run complex validations and transformations directly within Snowflake. Its ability to execute Python, Java, or Scala workloads ensures that data remains in place, eliminating unnecessary movement and boosting performance.
Flexible and Predefined DQ Checks:
The inclusion of Great Expectations and Snowflake Data Metric Functions enables a hybrid approach, combining open-source flexibility with Snowflake-native precision. This ensures that our framework can cater to both standard and custom business requirements.
Streamlined Front-End with Streamlit:
Streamlit provides an interactive interface for configuring rules, schedules, and monitoring results, making it accessible to users of all skill levels.
Cost and Latency Benefits:
By eliminating the need for external tools, containers, or additional compute resources, our framework minimizes latency and reduces costs. Every process is optimized to leverage Snowflake’s compute clusters for maximum efficiency.
Integration and Automation:
Snowflake’s task scheduling, streams, and pipelines ensure seamless integration into existing workflows. This makes monitoring and rule execution effortless and highly automated.

Tiger Analytics’ Snowflake Native Data Quality Framework leverages Snowflake’s ecosystem to provide a scalable and reliable data quality solution that can adapt to the changing needs of modern businesses.

Breaking Down the Tiger Analytics’ Snowflake Native Data Quality Framework

Streamlit App: A Unified Interface for Data Quality
Serves as a centralized front-end, integrating multiple components of the data quality framework. It allows users to configure rules and provides access to the profiler, recommendation engine, scheduling, and monitoring functionalities – all within one cohesive interface.

This unified approach simplifies the management and execution of data quality processes, ensuring seamless operation and improved user experience
Data Profiler
Data profiler automatically inspects and analyzes datasets to identify anomalies, missing values, duplicates, and other data quality issues directly within Snowflake. It helps generate insights into the structure and health of the data, without requiring external tools.

It also provides metrics on data distribution, uniqueness, and other characteristics to help identify potential data quality problems
DQ Rules Recommendation Engine
The DQ Rules Recommendation Engine analyzes data patterns and profiles to suggest potential data quality rules based on profiling results, metadata, or historical data behavior. These recommendations can be automatically generated and adjusted for more accurate rule creation.
DQ Engine
The DQ Engine is the core of Tiger Analytics’ Snowflake Native Data Quality Framework. Built using Snowpark, Great Expectations, and Snowflake Data Metric Functions, it ensures efficient and scalable data quality checks directly within the Snowflake ecosystem. Key functionalities include:
- Automated Expectation Suites:
  The engine automatically generates Great Expectations expectation suites based on the configured rules, minimizing manual effort in setting up data quality checks.
- Snowpark Compute Execution:
  These expectation suites are executed using Snowpark’s compute capabilities, ensuring performance and scalability for even the largest datasets.
- Results Storage and Accessibility:
  All validation results are stored in Snowflake tables, making them readily available for monitoring, dashboards, and further processing.
- On-Demand Metric Execution:
  In addition to GE rules, the engine can execute Snowflake Data Metric Functions on demand, providing flexibility for ad hoc or predefined data quality assessments. This combination of automation, scalability, and seamless integration ensures that the DQ Engine is adaptable to diverse data quality needs.
Scheduling Engine
The Scheduling Engine automates the execution of DQ rules at specified intervals, such as on-demand, daily, or in sync with other data pipelines. By leveraging Snowflake tasks & streams, it ensures real-time or scheduled rule execution within the Snowflake ecosystem, enabling continuous data quality monitoring.
Alerts and Notifications
The framework integrates with Slack and Outlook to send real-time alerts and notifications about DQ issues. When a threshold is breached or an issue is detected, stakeholders are notified immediately, enabling swift resolution.
NLP-Based DQ Insights
Leveraging Snowflake Cortex, the NLP-powered app enables users to query DQ results using natural language, providing non-technical users with straightforward access to valuable data quality insights. Users can ask questions such as below and receive clear, actionable insights directly from the data.
- What are the current data quality issues?
- Which rules are failing the most?
- How has data quality improved over time?
DQ Dashboards
These dashboards offer a comprehensive view of DQ metrics, trends, and rule performance. Users can track data quality across datasets and monitor improvements over time. It also provides interactive visualizations to track data health. Drill-down capabilities provide in-depth insight into specific issues, allowing for more detailed analysis and understanding.
Data Pipeline Integration
The framework can be integrated with existing data pipelines, ensuring that DQ checks are part of the ETL/ELT process. These checks are automatically triggered as part of the data pipeline workflow, verifying data quality before downstream usage.

How the Framework Adds Value

As organizations rely more on data to guide strategies, ensuring the accuracy, consistency, and integrity of that data becomes a top priority. Tiger Analytics’ Snowflake Native Data Quality Framework addresses this need by providing a comprehensive, end-to-end solution that integrates seamlessly into your existing Snowflake environment. With customizable features and actionable insights, it empowers teams to act quickly and efficiently. Here are the key benefits explained:

End-to-End Solution: Everything from profiling to monitoring is integrated in one place.
Customizable: Flexibly configure rules, thresholds, and schedules to meet your specific business requirements.
Real-Time DQ Enforcement: Maintain data quality throughout the entire data lifecycle with real-time checks.
Seamless Integration: Fully native to Snowflake, integrates easily with existing data pipelines and workflows.
Actionable Insights: Provide clear, actionable insights to help users take corrective actions quickly.
Scalability: Leverages Snowflake’s compute power, allowing for easy scaling as data volume grows.
Minimal Latency: Ensures efficient processing and reduced delays by executing DQ checks directly within Snowflake.
User-Friendly: Intuitive interface for both technical and non-technical users, enabling broad organizational adoption.
Proactive Monitoring: Identify data quality issues before they affect downstream processes.
Cost-Efficiency: Reduces the need for external tools, minimizing costs and eliminating data movement overhead.

Next Steps

While the framework offers a wide range of features to address data quality needs, we are continuously looking for opportunities to enhance its functionality. We at Tiger Analytics are exploring additional improvements that will further streamline processes, and increase flexibility. Some of the enhancements we are currently working on include:

AI-Driven Recommendations: Use machine learning to improve and refine DQ rule suggestions.
Anomaly Detection: Leverage AI to detect unusual patterns and data quality issues that may not be captured by traditional rules.
Advanced Visualizations: Enhance dashboards with predictive analytics and deeper trend insights.
Expanded Integration: Explore broader support for hybrid cloud and multi-database environments.

A streamlined data quality framework redefines how organizations ensure and monitor data quality. By leveraging Snowflake’s capabilities and tools like SnowPark, our Snowflake Native Data Quality Framework simplifies complex processes and delivers measurable value.

The post Building Trusted Data: A Comprehensive Guide to Tiger Analytics’ Snowflake Native Data Quality Framework appeared first on Tiger Analytics.

How to Simplify Data Profiling and Management with Snowpark and Streamlit

TA@2023 — Thu, 10 Oct 2024 12:41:48 +0000

The accuracy of the data-to-insights journey is underpinned by one of the most foundational yet often overlooked aspects of data management – Data Quality. While all models need good quality data to generate useful insights and patterns, data quality is especially important across industries like retail, healthcare, and finance. Inconsistent, missing, or duplicate data can impact critical operations, from customer segmentation to and even affect regulatory compliance, resulting in potential financial or reputational losses.

Let’s look at an example:

A large retail company relies on customer data from various sources, such as online orders, in-store purchases, and loyalty program interactions. Over time, inconsistencies and errors in the customer database, such as duplicate records, incorrect addresses, and missing contact details, impacted the company’s ability to deliver personalized marketing campaigns, segment customers accurately, and forecast demand.

Data Profiling Matters – Third-party or Native app? Understanding the options

Data profiling helps the organization understand the nature of the data to build the data models, and ensures data quality and consistency, enabling faster decision-making and more accurate insights.

Improves Data Accuracy: Identifies inconsistencies, errors, and missing values.
Supports Better Decision-Making: Ensures reliable data for predictive analytics.
Enhances Efficiency: Helps detect and remove redundant data, optimizing resources and storage.

For clients using Snowflake for data management purposes, traditional data profiling tools often require moving data outside of Snowflake, creating complexity, higher costs, and security risks.

Data Transfer Overhead: External tools may require data to be moved out of Snowflake, increasing latency and security risks.
Scalability Limitations: Third-party tools may struggle with large Snowflake datasets.
Cost and Performance: Increased egress costs and underutilization of Snowflake’s native capabilities.
Integration Complexity: Complex setup and potential incompatibility with Snowflake’s governance and security features.

At Tiger Analytics, our clients faced a similar problem statement. To address these issues, we developed a Snowflake Native App utilizing Snowpark and Streamlit to perform advanced data profiling and analysis within the Snowflake ecosystem. This solution leverages Snowflake’s virtual warehouses for scalable, serverless computational power, enabling efficient profiling without external infrastructure.

How Snowpark Makes Data Profiling Simple and Effective

Snowpark efficiently manages large datasets by chunking data into smaller pieces, ensuring smooth profiling tasks. We execute YData Profiler and custom Python functions directly within Snowflake, storing results like outlier detection and statistical analysis for historical tracking.

We also created stored procedures and UDFs with Snowpark to automate daily or incremental profiling jobs. The app tracks newly ingested data, using Snowflake’s Task Scheduler to run operations automatically. Additionally, profiling outputs integrate seamlessly into data pipelines, with alerts triggered when anomalies are detected, ensuring continuous data quality monitoring.

By keeping operations within Snowflake, Snowpark reduces data transfer, lowering latency and enhancing performance. Its native integration ensures efficient, secure, and scalable data profiling.

Let’s look at the key features of the app, built leveraging Snowpark’s capabilities.

Building a Native Data Profiling App in Snowflake – Lessons learnt:

1. Comprehensive Data Profiling

At the core of the app’s profiling capabilities are the YData Profiler or custom-built profilers – Python libraries, integrated using Snowpark. These libraries allow users to profile data directly within Snowflake by leveraging its built-in compute resources.

Key features include:

Column Summary Statistics: The Quickly review important statistics for columns with all the datatypes like string, number, and date to understand the data at a glance.
Data Completeness Checks: Identify missing values and assess the completeness of your datasets to ensure no critical information is overlooked.
Data Consistency Checks: Detect anomalies or inconsistent data points to ensure that your data is uniform and accurate across the board.
Pattern Recognition and Value Distribution: Analyze data patterns and value distributions to identify trends or detect unusual values that might indicate data quality issues.
Overall Data Quality Checks: Review the health of your dataset by identifying potential outliers, duplicates, or incomplete data points.

2. Snowflake Compute Efficiency

The app runs entirely within Snowflake’s virtual warehouse environment. No external servers or machines are needed, as the app fully utilizes Snowflake’s built-in computing power. This reduces infrastructure complexity while ensuring top-tier performance, allowing users to profile and manage even large datasets efficiently.

3. Flexible Profiling Options

The app allows users to conduct profiling in two distinct ways—either by examining entire tables or by focusing on specific columns. This flexibility ensures that users can tailor the profiling process to their exact needs, from broad overviews to highly targeted analyses.

4. Full Data Management Capabilities

In addition to profiling, the app supports essential data management tasks. Users can insert, update, and delete records within Snowflake directly from the app, providing an all-in-one tool for both profiling and managing data.

5. Streamlit-Powered UI for Seamless Interaction

The app is built using Streamlit, which provides a clean, easy-to-use user interface. The UI allows users to interact with the app’s profiling and data management features without needing deep technical expertise. HTML-based reports generated by the app can be easily shared with stakeholders, offering clear and comprehensive data insights.

6. Ease in Generating and Sharing Profiling Reports

Once the data profiling is complete, the app generates a pre-signed URL that allows users to save and share the profiling reports. Here’s how it works:

Generating Pre-Signed URLs: The app creates a pre-signed URL to a file on a Snowflake stage using the stage name and relative file path. This URL provides access to the generated reports without requiring direct interaction with Snowflake’s internal storage.
Accessing Files: Users can access the files in the stage through several methods:
- Navigate directly to the pre-signed URL in a web browser.
- Retrieve the pre-signed URL within Snowsight by clicking on it in the results table.
- Send the pre-signed URL in a request to the REST API for file support.
Handling External Stages: For files in external stages that reference Microsoft Azure cloud storage, the function requires Azure Active Directory authentication. This is because querying the function fails if the container is accessed using a shared access signature (SAS) token. The GET_PRESIGNED_URL function requires Azure Active Directory authentication to create a user delegation SAS token, utilizing a storage integration object that stores a generated service principal.

7. Different roles within an organization can utilize this app in various scenarios:

Data Analysts: Data analysts can use the app to profile datasets, identify inconsistencies, and understand data quality issues. They will analyze the patterns and relationships in the data and point out the necessary fixes to resolve any errors, such as missing values or outliers.
Data Stewards/Data Owners: After receiving insights from data analysts, data stewards or data owners can apply the suggested fixes to cleanse the data, ensuring it meets quality standards. They can make adjustments directly through the app by inserting, updating, or deleting records, ensuring the data is clean and accurate for downstream processes.

This collaborative approach between analysts and data stewards ensures that the data is high quality and reliable, supporting effective decision-making across the organization.

Final notes

Snowpark offers a novel approach to data profiling by bringing it into Snowflake’s native environment. This approach reduces complexity, enhances performance, and ensures security. Whether improving customer segmentation in retail, ensuring compliance in healthcare, or detecting fraud in finance, Snowflake Native Apps with Snowpark provides a timely solution for maintaining high data quality across industries.

For data engineers looking to address client pain points this translates to:

Seamless Deployment: Easily deployable across teams or accounts, streamlining collaboration.
Dynamic UI: The Streamlit-powered UI provides an interactive dashboard, allowing users to profile data without extensive technical knowledge.
Flexibility: Supports profiling of both Snowflake tables and external files (e.g., CSV, JSON) in external stages like S3 or Azure Blob.

With upcoming features like AI-driven insights, anomaly detection, and hierarchical data modeling, Snowpark provides a powerful and flexible platform for maintaining data quality across industries, helping businesses make smarter decisions and drive better outcomes.

The post How to Simplify Data Profiling and Management with Snowpark and Streamlit appeared first on Tiger Analytics.

Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security

TA@2023 — Thu, 07 Mar 2024 07:03:07 +0000

Discovering and handling sensitive data in the data lake or analytics environment can be challenging. It involves overcoming technical complexities in data processing and dealing with the associated costs of resources and computing. Identifying sensitive information at the entry point of the data pipeline, probably during data ingestion, can help overcome these challenges to some extent. This proactive approach allows organizations to fortify their defenses against potential breaches and unauthorized access.

According to AWS, Amazon Macie is “a data security service that uses machine learning (ML) and pattern matching to discover and help protect sensitive data”, such as personally identifiable information (PII), payment card data, and Amazon Web Services . At Tiger Analytics we’ve integrated these features into our pipelines within our proprietary Data Fabric solution called Tiger Data Fabric.

The Tiger Data Fabric is a self-service, low/no-code data management platform that facilitates seamless data integration, efficient data ingestion, robust data quality checks, data standardization, and effective data provisioning. Its user-centric, UI-driven approach demystifies data handling, enabling professionals with diverse technical proficiencies to interact with and manage their data resources effortlessly.

Leveraging Salient Features for Enhanced Security

The Tiger Data Fabric-AWS Macie integration offers a robust solution to enhance data security measures, including:

Data Discovery: The solution, with the help of Macie, discovers and locates sensitive data within the active data pipeline.
Data Protection: The design pattern isolates the sensitive data in a secure location with restricted access.
Customized Actions: The solution gives flexibility to design (customize) the actions to be taken when sensitive data is identified. For instance, the discovered sensitive data can be encrypted, redacted, pseudonymized, or even dropped from the pipeline with necessary approvals from the data owners.
Alerts and Notification: Data owners receive alerts when any sensitive data is detected, allowing them to take the required actions in response.

Tiger Data Fabric has many data engineering capabilities and has been enhanced recently to include sensitive data scans at the data ingestion step of the pipeline. Source data present on the S3 landing zone path is scanned for sensitive information and results are captured and stored at another path in the S3 bucket.

By integrating AWS Macie with the Tiger Data Fabric, we’re able to:

Automate the discovery of sensitive data.
Discover a variety of sensitive data types.
Evaluate and monitor data for security and access control.
Review and analyze findings.

For data engineers looking to integrate “sensitive data management” into their data pipelines , here’s a walkthrough of how we, at Tiger Analytics, implement these components for maximum value:

S3 Buckets store data in various stages of processing. A raw databucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket which harbors objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda and Step Functions execute the critical tasks of running sensitive data scans and managing workflows. Step Functions coordinate Lambda functions to manage business logic and execute the steps mentioned below:
- triggerMacieJob: This Lambda function creates a Macie-sensitive data discovery job on the designated S3 bucket during the scan stage..
- pollWait: This Step Function waits for a specific state to be reached, ensuring the job runs smoothly.
- checkJobStatus: This Lambda function checks the status of the Macie scan job.
- isJobComplete: This Step function uses a Choice state to determine if the job has finished. If it has, it triggers additional steps to be executed.
- waitForJobToComplete: This Step function employs a Choice state to wait for the job to complete and prevent the next action from running before the scan is finished.
- UpdateCatalog: This Lambda function updates the catalog table in the backend Data Fabric database, and ensures that all job statuses are accurately reflected in the database.

A Macie scan job scans the specified S3 bucket for sensitive data. The process of creating the Macie job involves multiple steps, allowing us to choose data identifiers, either through custom configurations or standard options:
- We create a one-time Macie job through the triggerMacieJob Lambda function.
- We provide the complete S3 bucket path for sensitive data buckets to filter out the scan and avoid unnecessary scanning on other buckets.
- While creating the job, Macie provides a provision to select data identifiers for sensitive data. In the AWS Data Fabric, we have automated the selection of custom identifiers for the scan, including CREDIT_CARD_NUMBER, DRIVERS_LICENSE, PHONE_NUMBER, USA_PASSPORT_NUMBER, and USA_SOCIAL_SECURITY_NUMBER.
  
  The findings can be seen on the AWS console and filtered based on S3 Buckets. We employed Glue jobs to parse the results and route the data to the manual review bucket and raw buckets. The Macie job execution time is around 4-5 minutes. After scanning, if there are less than 1000 sensitive records, they are moved to the quarantine bucket.
The parsing of Macie results is handled by a Glue job, implemented as a Python script. This script is responsible for extracting and organizing information from the Macie scanned results bucket.
- In the parser job, we retrieve the severity level (High, Medium, or Low) assigned by AWS Macie during the one-time job scan.
- In the Macie scanning bucket, we created separate folders for each source system and data asset, registered through Tiger Data Fabric UI.
  For example: zdf-fmwrk-macie-scan-zn-us-east-2/data/src_sys_id=100/data_asset_id=100000/20231026115848
  
  The parser job checks for severity and the report in the specified path. If sensitive data is detected, it is moved to the quarantine bucket. We format this data into parquet and process it using Spark data frames.
- If we peruse the parquet file, found below, sensitive data can be clearly seen as SSN and phone number columns.
- In the quarantine bucket, the same file is being moved after finding the sensitive data.
  
  If there are no sensitive records, move the data to the raw zone from where data is further sent to the data lake.
Airflow operators come in handy for orchestrating the entire pipeline, whether we integrate native AWS security services with Amazon MWAA or implement custom airflow on EC2 or EKS.
- GlueJobOperator: Executes all the Glue jobs pre and post-Macie scan.
- StepFunctionStartExecutionOperator: Starts the execution of the Step Function.
- StepFunctionExecutionSensor: Waits for the Step Function execution to be completed.
- StepFunctionGetExecutionOutput Operator: Gets the output from the Step function.
IAM Policies grant the necessary permissions for the AWS Lambda functions to access AWS resources that are part of the application. Also, access to the Macie review bucket is managed using standard IAM policies and best practices.

Things to Keep in Mind for Effective Implementation

Based on our experience integrating AWS Macie with the Tiger Data Fabric, here are some points to keep in mind for an effective integration of AWS Macie. Macie’s primary objective is sensitive data discovery. It acts as a background process that keeps scanning the S3 buckets/objects. It generates reports that can be consumed by various users and accordingly, actions can be taken. But if the requirement is to string it with a pipeline and automate the action, based on the reports, then a custom process must be created.
Macie stops reporting the location of sensitive data after 1000 occurrences of the same detection type. However, this quota can be increased by requesting AWS. It is important to keep in mind that in our use case, where Macie scans are integrated into the pipeline, each job is dynamically created to scan the dataset. If the sensitive data occurrences per detection type exceed 1000, we move the entire file to the quarantine zone.
For certain data elements that Macie doesn’t consider sensitive data, custom data identifiers help a lot. It can be defined via regular expressions and its sensitivity can also be customized. Organizations with data that are deemed sensitive internally by their data governance team can use this feature.
Macie also provides an allow list—this helps in ignoring some of the data elements which by default Macie tag as sensitive data.’

The AWS Macie – Tiger Data Fabric integration seamlessly enhances automated data pipelines, addressing the challenges associated with unintended exposure of sensitive information in data lakes. By incorporating customizations such as employing regular expressions for data sensitivity and establishing suppression rules within the data fabrics they are working on, data engineers gain enhanced control and capabilities over managing and safeguarding sensitive data.

Armed with the provided insights, they can easily adapt the use cases and explanations to align with their unique workflows and specific requirements.

The post Invisible Threats, Visible Solutions: Integrating AWS Macie and Tiger Data Fabric for Ultimate Security appeared first on Tiger Analytics.

How Tiger’s Data Quality Framework unlocks Improvements in Data Quality

TA@2023 — Tue, 06 Dec 2022 15:42:31 +0000

According to Gartner, poor data quality could cost organizations up to $12.9 million annually. It has been predicted that by this year, 70% of organizations will track data quality metrics to reduce operational risks and costs. In fact, At Tiger Analytics, we’ve seen our customers embarking on their data observability journey by focusing on data quality checks retrofitted to their pre-existing data pipelines.

We’ve helped our customers leverage their homegrown data quality framework and integrate it with their existing data engineering pipelines to improve the quality of data. Here’s how we did it.

Investigating what contributes to poor data quality – A checklist.

In our experience, we’ve found that the common data quality issues are:

1. Duplicate data

Since the data is ingested from multiple data sources for downstream analytics, data duplication could easily take place.

2. Incomplete data

Key observations in a data set could be missed as the data goes through multiple levels of transformations before it can use used for analysis.

3. Incorrect data

Accurate data is vital for data analytics, incorrect data could be captured from the source due to wrong entries or human error.

4. Inconsistent data

During a database transaction, when the data is updated, it should reflect all the related data points, or else, data inconsistencies would occur.

5. Ambiguous data

Misleading column names, misspellings, and formatting errors could lead to ambiguous data.

The above data quality issues lead to data decay that significantly impacts the analysis and affects business operations, leading to a loss of revenue.

How to measure data quality

Here are a few questions we need to ask ourselves to determine the quality of our data.

Data Quality Dimensions

Manually measuring data quality against several of the above dimensions in a big data landscape could be time-consuming and resource-intensive. Detecting data quality issues manually is costly and will have a serious impact on business operations, so a scalable and automated data quality framework becomes all the more necessary.

The need for Automated Data Quality Monitoring

According to several studies [2], Data scientists seem to spend 30 to 60% of their time fixing data quality issues in the data preparation phase. This affects the productivity of data scientists who spend their time troubleshooting data quality issues instead of working on building models. So, investing in the right set of technology frameworks for automated data quality checks will play a vital role in Data Engineering pipelines to improve speed to market, operational efficiency, and data integrity.

We’ve built an MVP solution that has helped us quickly build a configurable metadata-driven framework by detecting data quality issues early in the value chain.

Tigers’ Automated Data Quality Framework – Technical Deep Dive

Following the Fail fast design principle, we at Tiger Analytics built an MVP solution that can quickly deliver business impact by detecting the data quality issues early in the value chain. The MVP strategy has helped us to quickly build a configurable metadata-driven framework, as shown below:

Tiger Data Quality Framework

Tiger’s Data Quality Framework is built to detect data quality problems and notify the business via Data Quality Dashboards and notifications so that the business and data teams get better visibility on the health of the data.

The framework is built on top of the Great Expectations Library [3], and we have abstracted some of its functionalities and made it configurable and parametrized for the calling applications.

The framework has been built keeping in mind the following principles:

Scalability – Scalable framework that can cater to a high number and variety of checks
Configurability – A focus on a configuration-driven framework to eliminate the need for ongoing development as use cases increase
Portability – Cloud agnostic and open-source tool set to ensure plug-and-play design
Flexibility – Designed to leverage a common framework for a variety of technologies – Databricks, Queries, Files, S3, RDBMS, etc.
Maintainability – Modular and metadata-driven framework for ease of maintenance
Integration – Ability to integrate with existing pipelines or function independently in a decoupled fashion with ad hoc invocation or event-driven patterns

The framework helps us identify and fix issues in data and not let it propagate to other pipelines. Using a CI/CD workflow, rules can be added to the rule repository. The rule repository consists of user configurations such as source details, columns to verify, and exceptions to run. The framework uses Spark and Great Expectations under the hood to generate logic to validate datasets. The framework relies on Spark to filter rules from the rule repository and to read the source data. After which we leverage Great Expectations to validate data. The result, which is in JSON format, is written out to S3. The result is processed and summarized in a Data quality dashboard built using Amazon-managed Grafana, as shown below.

Tigers’ Automated Data Quality Framework – Custom Anomaly Detection Models

In a world of ever-growing data, data quality issues like inconsistent data, schema changes, and data drift have become more prevalent, affecting Data Science models’ performance and productivity. The use of statistical techniques could help detect outliers or anomalies. A handful of statistical DQ metrics to determine the anomalies are explained below:

In all the examples below, population data is the historical or archived data stored in the database, while sample data is the incremental or additional data generated. For easier understanding, all the examples mentioned below are with respect to the Retail Sales Data.

1. Suspicious volume indicator:

A moving average of data is calculated over the last quarter and compared with the actual data; if the data point is above or below the moving average by a threshold percentage, it is detected as an anomaly.

Example: If historical data is 10000 over the past x quarters, but the current data point is more than 12000 or less than 8000 (which is +/- 20% of 10000), then it is an anomaly. (With an assumption that the threshold is set to 20%)

2. Median absolute deviation (MAD):

The median absolute deviation of the data is calculated, and data points beyond +/- 3 standard deviations of median absolute deviation (MAD) are detected as anomalies.

Example: If the median revenue is 50 and Median Absolute Deviation (MAD) is 10 for the laundry category of sales data, then any data point greater than 80 (50 + 3*10) and lesser than 20 (50 – 3*10) would be detected as an anomaly

3.Inter-Quartile Range (IQR):

IQR is defined as the range between the 25th percentile (Q1) and 75th percentile (Q3) of the dataset.

Let’s consider revenue for the laundry category,

The revenue data points which are below Q1 – 1.5 times IQR are considered anomalies
The revenue data points which are above Q3 + 1.5 times IQR are considered anomalies

Example: If the revenue of the laundry category has an IQR of 20 (assume Q1 is 40 and Q3 is 60), then the revenue higher than 90 (60 + 1.5*20) and lower than 10 (40 – 1.5*20) would be detected as an anomaly.

4. Test of proportions (Independent Samples):

One sample Z-test:

To determine whether there is a significant difference between the population proportion and the proportion of data under observation.

Example: Historically, the revenue proportion of the laundry category in the month of January was 0.5. The latest data has a proportion of 0.6, so one sample z-test is performed to check whether there is any significant difference between the proportions.

Two sample Z-test (Independent Samples):

To test for significant differences between the proportions of two samples, or both the proportions are from the same populations.

Example: (Success/ Failure), (Yes/No), (A/B testing).

The proportion of the revenue for the laundry category is 0.2 when it is marketed through digital marketing and 0.17 when in-store, so a z-test for two proportions is used to check for significant differences between the two marketing types.

5. Repeated measures of ANOVA (Dependent Samples):

To check whether there is any significant difference between more than two sample means of the same subjects under consideration over time. This test would let us know whether all the means are the same or at least one mean is different from the other means.

Example: Let’s consider the revenue of the laundry, cooktop, and refrigerator categories over the month of January, February, and March individually. The repeated measure of ANOVA will test for any significant difference between the revenues of the above-mentioned categories over the period.

6. Signed Rank test – Wilcoxon Signed rank test:

It’s a repeated measures test/ test of related samples. For one sample, it’s an independent test, and for the two-sample data, it compares the data under observation under different conditions for the same subjects and checks whether there is a significant difference between them. This helps in evaluating the time course effect.

Examples: To test for any significant difference between the revenue of the refrigerator category in the months of winter and summer seasons, a signed rank test is used.

7. Shift indicator:

a. Benford Law/Law of first numbers:

The pre-defined theoretical percentage distribution of the first digits in Benford law is compared against the percentage of the first digits of data under observation, and if it’s beyond a certain percentage threshold, then it’s detected as an anomaly.

Example: Pre-defined theoretical percentage distribution of digit 9 in Benford’s law or the law of first numbers is 4.6%, and if the first digit percentage distribution of digit 9 for the revenue of the laundry category over the month of January is 6%, then it’s an anomaly. (Assume the threshold is set to 10%)

b. Kolmogorov-Smirnov (KS) test – One sample:

Theoretical Benford percentage distribution is compared with the first digit percentage of data under observation. The confidence interval is defined to check whether there is a significant difference between the distributions.

Example: Consider the revenue data for the refrigerator category for the month of January. The percentage distribution of the first digit (1-9) of the revenue data is compared with the Benford law’s theoretical distribution. If there is any significant difference between them, then it’s an anomaly.

c. Kolmogorov-Smirnov (KS) test – Two samples:

In KS two-sample test, the first digit percentage distribution of two samples under observation is compared to check for significant differences between them.

Example: Consider the revenue data for the refrigerator category for the month of January and February. The percentage distribution of the first digits (0-9) for revenue data for the months of January and February are compared, and if any significant difference is observed, then it’s an anomaly.

8. Quantile-Quantile (Q-Q) plot (Graphical Approach):

This approach checks whether the data under observation follows the intended theoretical distribution (Gaussian, Poisson, Uniform, Exponential, etc.) or whether the data is skewed. It also helps in identifying the outliers.

Example: Plotting revenue data for the month of January to check for any outliers and data distribution.

9. Correspondence analysis (Graphical Approach):

Correspondence analysis helps in identifying the relative association between and within two groups based on their variables or attributes.

Examples: Refrigerators of brands A, B, and C will have certain attributes associated with them such as price, quality, warranty, after-sales service, and rating. The correspondence analysis will help in identifying the relative association between these two refrigerators based on the above-mentioned attributes.

Summary

Accurate data is the key to businesses making the right decisions. To set the data quality right, business leaders must meticulously define the data quality objectives and processes to profile, detect and fix data quality issues early in the BI value chain to improve operational efficiency. IT leaders must work on detailed data quality strategies, solutions, and technology to meet business objectives.

By taking an automated approach to data quality using the Tiger Data Quality framework, data platform teams can avoid the time-consuming processes of manual data quality checks. This can bring the following outcomes to the business:

Improves confidence in making effective decisions from good data
Improves customer experience
Reduces the cost of detecting and fixing data quality issues much late in the value chain
Improves productivity for data engineers, and data scientists who can focus on solving business problems over debugging data quality issues
Improves data compliance

Sources

https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality

https://visit.figure-eight.com/2015-data-scientist-report

https://greatexpectations.io/

The post How Tiger’s Data Quality Framework unlocks Improvements in Data Quality appeared first on Tiger Analytics.

Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins

onemg — Thu, 24 Dec 2020 18:19:09 +0000

Historically, manufacturing equipment maintenance has been done during scheduled service downtime. This involves periodically stopping production for carrying out routine inspections, maintenance, and repairs. Unexpected equipment breakdowns disrupt the production schedule; require expensive part replacements, and delay the resumption of operations due to long procurement lead times.

Sensors that measure and record operational parameters (temperature, pressure, vibration, RPM, etc.) have been affixed on machinery at manufacturing plants for several years. Traditionally, the data generated by these sensors was compiled, cleaned, and analyzed manually to determine failure rates and create maintenance schedules. But every equipment downtime for maintenance, whether planned or unplanned, is a source of lost revenue and increased cost. The manual process was time-consuming, tedious, and hard to handle as the volume of data rose.

The ability to predict the likelihood of a breakdown can help manufacturers take pre-emptive action to minimize downtime, keep production on track, and control maintenance spending. Recognizing this, companies are increasingly building both reactive and predicted computer-based models based on sensor data. The challenge these models face is the lack of a standard framework for creating and selecting the right one. Model effectiveness largely depends on the skill of the data scientist. Each model must be built separately; model selection is constrained by time and resources, and models must be updated regularly with fresh data to sustain their predictive value.

As more equipment types come under the analytical ambit, this approach becomes prohibitively expensive. Further, the sensor data is not always leveraged to its full potential to detect anomalies or provide early warnings about impending breakdowns.

In the last decade, the Industrial Internet of Things (IIoT) has revolutionized predictive maintenance. Sensors record operational data in real-time and transmit it to a cloud database. This dataset feeds a digital twin, a computer-generated model that mirrors the physical operation of each machine. The concept of the digital twin has enabled manufacturing companies not only to plan maintenance but to get early warnings of the likelihood of a breakdown, pinpoint the cause, and run scenario analyses in which operational parameters can be varied at will to understand their impact on equipment performance.

Several eminent ‘brand’ products exist to create these digital twins, but the software is often challenging to customize, cannot always accommodate the specific needs of each and every manufacturing environment, and significantly increases the total cost of ownership.

ML-powered digital twins can address these issues when they are purpose-built to suit each company’s specific situation. They are affordable, scalable, self-sustaining, and, with the right user interface, are extremely useful in telling machine operators the exact condition of the equipment under their care. Before embarking on the journey of leveraging ML-powered digital twins, certain critical steps must be taken:

1. Creation of an inventory of the available equipment, associated sensors and data.

2. Analysis of the inventory in consultation with plant operations teams to identify the gaps. Typical issues may include missing or insufficient data from the sensors; machinery that lacks sensors; and sensors that do not correctly or regularly send data to the database.

3. Coordination between the manufacturing operations and analytics/technology teams to address some gaps: installing sensors if lacking (‘sensorization’); ensuring that sensor readings can be and are being sent to the cloud database; and developing contingency approaches for situations in which no data is generated (e.g., equipment idle time).

4. A second readiness assessment, followed by a data quality assessment, must be performed to ensure that a strong foundation of data exists for solution development.

This creates the basis for a cloud-based, ML-powered digital twin solution for predictive maintenance. To deliver the most value, such a solution should:

Use sensor data in combination with other data as necessary
Perform root cause analyses of past breakdowns to inform predictions and risk assessments
Alert operators of operational anomalies
Provide early warnings of impending failures
Generate forecasts of the likely operational situation
Be demonstrably effective to encourage its adoption and extensive utilization
Be simple for operators to use, navigate and understand
Be flexible to fit the specific needs of the machines being managed

When model-building begins, the first step is to account for the input data frequency. As sensors take readings at short intervals, timestamps must be regularized and resamples taken for all connected parameters where required. At this time, data with very low variance or too few observations may be excised. Model data sets containing sensor readings (the predictors) and event data such as failures and stoppages (the outcomes) are then created for each machine using both dependent and independent variable formats.

To select the right model for anomaly detection, multiple models are tested and scored on the full data set and validated against history. To generate a short-term forecast, gaps related to machine testing or idle time must be accounted for, and a range of models evaluated to determine which one performs best.

Tiger Analytics used a similar approach when building these predictive maintenance systems for an Indian multinational steel manufacturer. Here, we found that regression was the best approach to flag anomalies. For forecasting, the accuracy of Random Forest models was higher compared to ARIMA, ARIMAX, and exponential smoothing.

Using a modular paradigm to build ML-powered digital twin makes it straightforward to implement and deploy. It does not require frequent manual recalibration to be self-sustaining, and it is scalable so it can be implemented across a wide range of equipment with minimal additional effort and time.

Careful execution of the preparatory actions is as important as strong model-building to the success of this approach and its long-term viability. To address the challenge of low-cost, high-efficiency predictive maintenance in the manufacturing sector, employ this sustainable solution: a combination of technology, business intelligence, data science, user-centric design, and the operational expertise of the manufacturing employees.

This article was first published in Analytics India Magazine.

The post Maximizing Efficiency: Redefining Predictive Maintenance in Manufacturing with Digital Twins appeared first on Tiger Analytics.

Automating Data Quality: Using Deequ with Apache Spark

onemg — Thu, 03 Sep 2020 10:09:56 +0000

Introduction

When dealing with data, the main factor to be considered is the quality of data. Especially in the Big data environment, data quality is critical. Having inaccurate or flawed data will produce incorrect results in data analytics. Many developers test the data manually before training their model with the available data. This is time-consuming, and there are possibilities of committing mistakes.

Deequ

Deequ is an open-source framework for testing the data quality. It is built on top of Apache Spark and is designed to scale up to large data sets. Deequ is developed and used at Amazon for verifying the quality of many large production datasets. The system computes data quality metrics regularly, verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success.

Deequ allows us to calculate data quality metrics on our data set, and also allows us to define and verify data quality constraints. It also specifies what constraint checks are to be made on your data. There are three significant components in Deequ. These are Constraint Suggestion, Constraint Verification, and Metrics Computation. This is depicted in the image below.

Deequ Components

This blog provides a detailed explanation of these three components with the help of practical examples.

1. Constraint Verification: You can provide your own set of data quality constraints which you want to verify on your data. Deequ checks all the provided constraints and gives you the status of each check.

2. Constraint Suggestion: If you are not sure of what constraints to test on your data, you can use Deequ’s constraint suggestion. Constraint Suggestion provides a list of possible constraints you can test on your data. You can use these suggested constraints and pass it to Deequ’s Constraint Verification to check your data quality.

3. Metrics Computation: You can also compute data quality metrics regularly.

Now, let us implement these with some sample data.

For this example, we have downloaded a sample csv file with 100 records. The code is run using Intellij IDE (you can also use Spark Shell).

Add Deequ library

You can add the below dependency in your pom.xml

com.amazon.deequ

deequ

1.0.2

If you are using Spark Shell, you can download the Deequ jar as shown below-

wget

https://repo1.maven.org/maven2/com/amazon/deequ/deequ/1.0.1/deequ-1.0.2.jar

Now, let us start the Spark session and load the csv file to a dataframe.

val spark = SparkSession.builder()

.master(“local”)

.appName(“deequ-Tests”).getOrCreate()

val data = spark.read.format(“csv”)

.option(“header”,true)

.option(“delimiter”,”,”)

.option(“inferschema”,true)

.load(“C:/Users/Downloads/100SalesRecords.csv”)

data.show(5)

The data has now been loaded into a data frame.

data.printSchema

Schema

Note

Ensure that there are no spaces in the column names. Deequ will throw an error if there are spaces in the column names.

Constraint Verification

Let us verify our data by defining a set of data quality constraints.

Here, we have given duplicate check(isUnique), count check (hasSize), datatype check(hasDataType), etc. for the columns we want to test.

We have to import Deequ’s verification suite and pass our data to that suite. Only then, we can specify the checks that we want to test on our data.

import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.constraints.ConstrainableDataTypes

import com.amazon.deequ.{VerificationResult, VerificationSuite}//Constraint verification

val verificationResult: VerificationResult = {

VerificationSuite()

.onData(data) //our input data to be tested

//data quality checks

.addCheck(

Check(CheckLevel.Error, “Review Check”)

.isUnique(“OrderID”)

.hasSize(_ == 100)

.hasDataType(“UnitPrice”,ConstrainableDataTypes.String)

.hasNumberOfDistinctValues(“OrderDate”,_>=5)

.isNonNegative(“UnitCost”))

.run()

}

On successful execution, it displays the below result. It will show each check status and provide a message if any constraint fails.

Using the below code, we can convert the check results to a data frame.

//Converting Check results to a DataFrame

val verificationResultDf = checkResultsAsDataFrame(spark, verificationResult)

verificationResultDf.show(false)

Constraint Verification

Constraint Suggestion

Deequ can provide possible data quality constraint checks to be tested on your data. For this, we need to import ConstraintSuggestionRunner and pass our data to it.

import com.amazon.deequ.checks.{Check, CheckLevel}

import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}//Constraint Suggestion

val suggestionResult = {

ConstraintSuggestionRunner()

.onData(data)

.addConstraintRules(Rules.DEFAULT)

.run()

}

We can now investigate the constraints that Deequ has suggested.

import spark.implicits._

val suggestionDataFrame = suggestionResult.constraintSuggestions.flatMap {

case (column, suggestions) =>

suggestions.map { constraint =>

(column, constraint.description, constraint.codeForConstraint)

}

}.toSeq.toDS()

On successful execution, we can see the result as shown below. It provides automated suggestions on your data.

suggestionDataFrame.show(4)

Constraint Suggestion

You can also pass these Deequ suggested constraints to VerificationSuite to perform all the checks provided by SuggestionRunner. This is illustrated in the following code.

val allConstraints = suggestionResult.constraintSuggestions

.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}

.toSeq

val generatedCheck = Check(CheckLevel.Error, “generated constraints”, allConstraints)//passing the generated checks to verificationSuite

val verificationResult = VerificationSuite()

.onData(data)

.addChecks(Seq(generatedCheck))

.run()

val resultDf = checkResultsAsDataFrame(spark, verificationResult)

Running the above code will check all the constraints that Deequ suggested and provide the status of each constraint check, as shown below.

resultDf.show(4)

Metrics Computation

You can compute metrics using Deequ. For this, you need to import AnalyzerContext.

import com.amazon.deequ.analyzers._

import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame

import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}//Metrics Computation

val analysisResult: AnalyzerContext = {

AnalysisRunner

// data to run the analysis on

.onData(data)

// define analyzers that compute metrics .addAnalyzers(Seq(Size(),Completeness(“UnitPrice”),ApproxCountDistinct(“Country”),DataType(“ShipDate”),Maximum(“TotalRevenue”)))

.run()

}

val metricsDf = successMetricsAsDataFrame(spark, analysisResult)

Once the run is successful you can see results as below.

metricsDf.show(false)

Metrics Computation

You can also store the metrics that you computed on your data. For this, you can use Deequ’s metrics repository. To know more about this repository, click here.

Conclusion

Overall, Deequ has many advantages. We can calculate data metrics, define, and verify data quality constraints. Even large datasets that consist of billions of rows of data can be easily verified using Deequ. Apart from Data quality checks, Deequ also provides Anamoly Detection and Incremental metrics computation.

The post Automating Data Quality: Using Deequ with Apache Spark appeared first on Tiger Analytics.