Data Management Archives - Tiger Analytics

The Data Leader’s Guide to Responsible AI: Why Strong Data Governance Is Key to Mitigating AI Risks

onemg — Tue, 29 Apr 2025 08:24:37 +0000

In 1968, HAL 9000’s “I’m sorry, Dave. I’m afraid I can’t do that” marked the beginning of a new era in entertainment. As the years passed, films like 2004’s IRobot, and 2015’s Chappie continued to explore AI’s potential – from “One day they’ll have secrets… one day they’ll have dreams” to “I am consciousness. I am alive. I am Chappie.” While these fictional portrayals pushed the boundaries of our imagination, they also laid the groundwork for AI technologies such as self-driving cars, consumer personalizations, Generative AI and the like, that are shaping the world today.

Today, the rise of GenAI and copilots from various tool vendors and organizations has generated significant interest, driven by advancements in NLP, ML, computer vision, and other deep learning models. For CIOs, CDOs, and data leaders, this shift underscores a critical point: AI-powered technologies must be responsible, transparent, ensure privacy, and free of bias to truly add business value.

Since AI and GenAI both depend on data for fuel, it cements the need for the availability of the right data with the right quality, trust, and compliance. Without strong data governance, organizations risk AI models that reinforce bias, misinterpret data, or fail to meet regulatory requirements. This further underscores the importance of Data Governance as a critical discipline that serves as a guiding light.

Hence, ‘The lighthouse remains a beacon amidst shifting tides’ – In today’s context, this metaphor reflects the challenges faced by both data-driven and AI-driven enterprises. The landscape of data generation, usage, and transformation is constantly evolving, presenting new complexities for organizations to navigate. While data governance is not new, with many a change in weather (data) patterns and the infusion of AI across industries, it has grown increasingly relevant, acting as the foundation on which AI can be governed and enabled.

At Tiger Analytics, we are constantly exploring new opportunities to optimize the way we work. Take, for example, enterprises where time-to-market is critical, product vendors have developed copilots using GenAI. We have also observed many initiatives among our Fortune 100 clients leveraging models and various AI elements to achieve a faster time-to-market or develop new offerings. Many of these projects are successful, scalable, and continue to drive efficiency. However, the inevitable question arises: How do we govern AI?

What are the biggest challenges in Data Governance – Answering key questions

Data governance is not just about compliance — it is essential to enhance data quality and trustworthiness, efficiency, scalability, and produce better AI outcomes. Strong governance practices (process, op model, R&R) empower enterprises to unlock the full potential of their data assets.

Below are a few important questions that stakeholders across the enterprise, including CxOs, business leaders, Line of Business (LoB) owners, and data owners, are seeking to answer today. As organizations strive towards data literacy and ethical AI practices, these questions highlight the importance of implementing governance strategies that can support both traditional data management and emerging AI risks.

Who is in charge of the model or the data product that uses my model?
Who can control (modify/delete/archive) the dataset?
Who will decide how to control the data and make key decisions?
Who will decide what is to be controlled in the workflow or data product or model that my data is part of?
What are the risks to the end outcome if intelligence is augmented without audits or controls, or quality assurance?
Are controls for AI different from current ones or can existing ones be repurposed?
Which framework will guide me?
Is the enterprise data governance initiative flexible to accommodate my AI risks and related work?
With my organization in the process of becoming data literate and ensuring data ethics, how can AI initiatives take advantage of the same?
Is user consent still valid in the new AI model, and how is it protected?
What are the privacy issues to be addressed?

Let’s consider an example. A forecasting model is designed to help predict seasonal sales to launch a new apparel range targeted at a specific customer segment within an existing market. Now, assume the data is to be sourced from your marketplace and there are readymade data products that can be used – How do you check the health of the data before you run a simulation? What if you face challenges such as ownership disputes, metadata inconsistencies, or data quality issues? Is there a risk of privacy breaches if, for example, someone forgets to remove sample data from the dataset?

This is why Data Governance (including data management) and AI must work in tandem, even more so when we consider the risk of non-compliance, for which the impact is far greater. Any governance approach must be closely aligned with data governance practices and effectively integrated into daily operations. There are various ways in which the larger industry and we at Tiger Analytics are addressing this. In the next section, we take a look at the key factors that can serve as the foundation for AI governance within an enterprise.

Untangling the AI knot: How to create a data governance framework for AI

At Tiger Analytics, we’ve identified seven elements that are crucial in establishing a framework for Foundational Governance for AI – we call it HEal & INtERAcT. We believe a human-centric and transparent approach is essential in governing AI assets. As AI continues to evolve and integrate into various processes within an organization, governance must remain simple.

Rather than introducing entirely new frameworks, our approach focuses on accessible AI governance in which existing data governance operations are expanded to include new dimensions, roles, processes, and standards. This creates a seamless extension rather than a separate entity, thereby eliminating the complexities of managing AI risks in silos and untangling the “AI knot” through smooth integration.

The seven elements ensure AI governance remains transparent and aligns with the larger enterprise data governance strategy, influencing processes, policies, standards, and change management. For instance, Integrity and Trustworthiness reinforce reliability in model outputs and help create a trustworthy output that ensures privacy, while Accountability and Responsibility establish clear ownership of AI-driven decisions, ensuring compliance and ethical oversight. As AI introduces new roles and responsibilities, governance frameworks are revised to cover emerging risks and complexities like cross-border data, global teams, mergers, and varying regulations.

In addition, the data lifecycle in any organization is dependent on data governance. AI cannot exist without enterprise data. Synthetic data can only mimic actual data and issues. Therefore, high-quality, fit-for-purpose data is essential to train AI models and GenAI for more accurate predictions and better content generation.

Getting started with AI governance

Here is how an enterprise can begin its AI governance journey:

Identify all the AI elements and list out every app and area that uses it
What does your AIOps look like, and how is it being governed?
Identify key risks from stakeholders
Map them back to the principles
Define controls for the risks identified
Align framework with your larger data governance strategy

Enable specific processes for AI
Set data standards for AI
Tweak data policies for AI
Include an AI glossary for Cataloging and Lineage, providing better context
Data observability for AI to set up proactive detection for better model output and performance

Essentially, Enterprise DG+AI principles (framework) along with Identification & Mitigation strategies and Risk Controls, will pave the way for efficient AI governance. Given the evolving nature of this space, there is no one-size-fits-all solution. Numerous principles exist, but expert guidance and consulting are essential to navigate this complexity and implement the right approach.

The Road Ahead

AI has moved from science fiction to everyday reality, shaping decisions, operations, and personalized customer experiences. The focus now is on ensuring it is transparent, ethical, and well-governed. For this, AI and data governance must work in tandem. From customer churn analysis to loss prevention and identifying the right business and technical metrics, managing consent and privacy in the new era of AI regulations, AI can drive business value — but only when built on a foundation of strong data governance. A well-structured governance program ensures AI adoption is responsible and scalable, minimizing risks while maximizing impact. By applying the principles and addressing the key questions above, you can ensure a successful implementation, enabling your business to leverage AI for meaningful outcomes.

So while you ponder on these insights, ’til next time — just as the T800 said, “I’ll be back!”

The post The Data Leader’s Guide to Responsible AI: Why Strong Data Governance Is Key to Mitigating AI Risks appeared first on Tiger Analytics.

Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation

TA@2023 — Mon, 21 Oct 2024 14:30:41 +0000

Why is Data Transformation important?

E owns a retail business and makes their daily sales through their online and brick-and-mortar stores. Every day, they receive a huge wave of data in their systems—products viewed, orders placed, returns received, new customers, repeat customers, etc. Data ingestion pipelines help them ingest data from disparate sources. The ingested raw data pool is large and can generate valuable insights, but how do they harmonize it and turn it into actionable insights that boost the business?

That’s where Data transformation comes in.

Customers may have diverse data formats and reporting requirements, and that’s why data transformation forms a crucial aspect of any analytics product. If we examine traditional methods of manually building data transformation pipelines, we can see that these can be time-consuming and often delay the life cycle of deriving insights from data. One solution to this challenge is implementing dynamic transformation using metadata configuration. A flexible pipeline capable of adapting to data processing needs can be created by defining transformation rules and specifications.

Modernizing Data Platforms with Snowflake: Our Snowpark & Streamlit Solution

At Tiger Analytics, we created a custom solution with Snowpark and Streamlit to build dynamic data transformation pipelines based on a metadata-driven framework to accelerate data platform modernization into Snowflake. Once data is available in the raw layer, the current framework can be leveraged to build a data pipeline to transform and harmonize data in subsequent layers. The high-level architecture of this Snowpark framework is depicted below in Figure 1.

Figure 1: High-level architecture

Tiger’s Snowpark Data Transformation Framework Provides Five Key Functionalities:

Metadata Setup

Once the data ingestion is complete, this Snowpark-based framework is leveraged by users to feed in Source to Target Mapping (STTM) and additional driver (technical) metadata (Incremental/Full Load, Schedule timing, SCD types, etc.), which drives the entire process. The input details are parsed, validated with Snowflake raw layer, and merged into metadata tables.

Dynamic Transformation Pipeline Generation

After the metadata setup is complete, the next Snowpark code reads the STTM metadata table and creates data transformation SQL dynamically. Based on additional driver metadata information, the Snowpark code implements change data capture (CDC) based on watermark/audit columns on top of raw layer ingested data sets, Truncate and Load, Dynamic merge/insert SQL for SCD types, etc. It is purely driven by user-provided metadata input, and any changes will be refactored in the pipeline during its generation. For this Snowpark framework, it is just a matter of a few clicks to accommodate any changes like

New data source, logic to be included
Logic changes to existing pipeline

Schedule and Execution

Snowflake Task is used to implement orchestration for data transformation pipelines. It is created dynamically using the Snowpark framework. Users can provide preferred cadence information as input, which can be referenced to create tasks dynamically with Cron schedule. Once the Tasks are created and resumed, they will automatically trigger in the scheduled cadence. As this is a metadata-driven framework, the good part is that any changes will be accommodated in the pipeline quickly.

Alert & Notification

A Snowpark-based email notification framework helps trigger emails in Outlook and Slack notifications in case of any failures observed during data transformation pipeline execution.

Detailed Logging

This Snowpark framework captures detailed logging of each step for ease of debugging. It also provides information about the pipeline execution status, which can be a key metric for other downstream modules like Observability.

This Snowpark metadata driven framework can be leveraged in any data transformation program in Snowflake to set up dynamic transformation pipelines. It can help to accelerate the overall journey and will implement the business logic efficiently, with reduced code quality issues which often arise because of human errors.

Comparing our Snowpark Framework to Other Data Transformation Pipelines

What makes our Snowpark framework unique compared to other traditional data transformation pipelines?

Agility: Post finalization of the data model, this framework helps in faster implementation of transformation pipelines dynamically in Snowflake.
Flexibility: Allows users to make easy modifications in the logic to adapt to the specific needs of each customer without manually changing the underlying code.
Efficiency: Less efforts are needed to implement changes; when input metadata is changed, the pipeline gets updated.
Auto alerts: An email is triggered based on Snowflake’s native email integration feature to alert users of any pipeline execution failures.
Multi-feature support: Provides flexibility to enable SCD-type implementations, change data capture, and e2e detailed orchestration leveraging Snowflake Tasks.
Pipeline Monitoring: Proper logs are maintained at each level to trace back easily in case of any issues. Proven highly impactful, reducing the overall time for the support team to debug.
Scalability: Data volumes keep increasing. Dynamic transformation pipelines in cloud data warehouses like Snowflake can scale with the increasing demand.
Ease of Maintenance: Simple to maintain and update as metadata changes are made directly from the Streamlit app without altering the pipeline manually.

Final thoughts

At Tiger Analytics, we created a flexible metadata-driven solution to generate dynamic data transformation pipelines. With the help of the Snowpark framework, business teams can now rapidly convert their Source-to-Target Mapping (STTM) logic into executable pipelines, significantly reducing the time to market.

The ability to quickly adapt transformation rules and handle changes through a simple, metadata-driven process ensures that businesses can harmonize raw data from multiple silos without delays, delivering faster insights and value.

With its scalability, adaptability, and ease of maintenance, this framework allows organizations to manage complex transformations effectively. In short, dynamic data transformation ensures that the data is always optimized for decision-making, empowering businesses to stay agile and competitive.

The post Building Dynamic Data Pipelines with Snowpark: Our Framework to Drive Modern Data Transformation appeared first on Tiger Analytics.

How to Simplify Data Profiling and Management with Snowpark and Streamlit

TA@2023 — Thu, 10 Oct 2024 12:41:48 +0000

The accuracy of the data-to-insights journey is underpinned by one of the most foundational yet often overlooked aspects of data management – Data Quality. While all models need good quality data to generate useful insights and patterns, data quality is especially important across industries like retail, healthcare, and finance. Inconsistent, missing, or duplicate data can impact critical operations, from customer segmentation to and even affect regulatory compliance, resulting in potential financial or reputational losses.

Let’s look at an example:

A large retail company relies on customer data from various sources, such as online orders, in-store purchases, and loyalty program interactions. Over time, inconsistencies and errors in the customer database, such as duplicate records, incorrect addresses, and missing contact details, impacted the company’s ability to deliver personalized marketing campaigns, segment customers accurately, and forecast demand.

Data Profiling Matters – Third-party or Native app? Understanding the options

Data profiling helps the organization understand the nature of the data to build the data models, and ensures data quality and consistency, enabling faster decision-making and more accurate insights.

Improves Data Accuracy: Identifies inconsistencies, errors, and missing values.
Supports Better Decision-Making: Ensures reliable data for predictive analytics.
Enhances Efficiency: Helps detect and remove redundant data, optimizing resources and storage.

For clients using Snowflake for data management purposes, traditional data profiling tools often require moving data outside of Snowflake, creating complexity, higher costs, and security risks.

Data Transfer Overhead: External tools may require data to be moved out of Snowflake, increasing latency and security risks.
Scalability Limitations: Third-party tools may struggle with large Snowflake datasets.
Cost and Performance: Increased egress costs and underutilization of Snowflake’s native capabilities.
Integration Complexity: Complex setup and potential incompatibility with Snowflake’s governance and security features.

At Tiger Analytics, our clients faced a similar problem statement. To address these issues, we developed a Snowflake Native App utilizing Snowpark and Streamlit to perform advanced data profiling and analysis within the Snowflake ecosystem. This solution leverages Snowflake’s virtual warehouses for scalable, serverless computational power, enabling efficient profiling without external infrastructure.

How Snowpark Makes Data Profiling Simple and Effective

Snowpark efficiently manages large datasets by chunking data into smaller pieces, ensuring smooth profiling tasks. We execute YData Profiler and custom Python functions directly within Snowflake, storing results like outlier detection and statistical analysis for historical tracking.

We also created stored procedures and UDFs with Snowpark to automate daily or incremental profiling jobs. The app tracks newly ingested data, using Snowflake’s Task Scheduler to run operations automatically. Additionally, profiling outputs integrate seamlessly into data pipelines, with alerts triggered when anomalies are detected, ensuring continuous data quality monitoring.

By keeping operations within Snowflake, Snowpark reduces data transfer, lowering latency and enhancing performance. Its native integration ensures efficient, secure, and scalable data profiling.

Let’s look at the key features of the app, built leveraging Snowpark’s capabilities.

Building a Native Data Profiling App in Snowflake – Lessons learnt:

1. Comprehensive Data Profiling

At the core of the app’s profiling capabilities are the YData Profiler or custom-built profilers – Python libraries, integrated using Snowpark. These libraries allow users to profile data directly within Snowflake by leveraging its built-in compute resources.

Key features include:

Column Summary Statistics: The Quickly review important statistics for columns with all the datatypes like string, number, and date to understand the data at a glance.
Data Completeness Checks: Identify missing values and assess the completeness of your datasets to ensure no critical information is overlooked.
Data Consistency Checks: Detect anomalies or inconsistent data points to ensure that your data is uniform and accurate across the board.
Pattern Recognition and Value Distribution: Analyze data patterns and value distributions to identify trends or detect unusual values that might indicate data quality issues.
Overall Data Quality Checks: Review the health of your dataset by identifying potential outliers, duplicates, or incomplete data points.

2. Snowflake Compute Efficiency

The app runs entirely within Snowflake’s virtual warehouse environment. No external servers or machines are needed, as the app fully utilizes Snowflake’s built-in computing power. This reduces infrastructure complexity while ensuring top-tier performance, allowing users to profile and manage even large datasets efficiently.

3. Flexible Profiling Options

The app allows users to conduct profiling in two distinct ways—either by examining entire tables or by focusing on specific columns. This flexibility ensures that users can tailor the profiling process to their exact needs, from broad overviews to highly targeted analyses.

4. Full Data Management Capabilities

In addition to profiling, the app supports essential data management tasks. Users can insert, update, and delete records within Snowflake directly from the app, providing an all-in-one tool for both profiling and managing data.

5. Streamlit-Powered UI for Seamless Interaction

The app is built using Streamlit, which provides a clean, easy-to-use user interface. The UI allows users to interact with the app’s profiling and data management features without needing deep technical expertise. HTML-based reports generated by the app can be easily shared with stakeholders, offering clear and comprehensive data insights.

6. Ease in Generating and Sharing Profiling Reports

Once the data profiling is complete, the app generates a pre-signed URL that allows users to save and share the profiling reports. Here’s how it works:

Generating Pre-Signed URLs: The app creates a pre-signed URL to a file on a Snowflake stage using the stage name and relative file path. This URL provides access to the generated reports without requiring direct interaction with Snowflake’s internal storage.
Accessing Files: Users can access the files in the stage through several methods:
- Navigate directly to the pre-signed URL in a web browser.
- Retrieve the pre-signed URL within Snowsight by clicking on it in the results table.
- Send the pre-signed URL in a request to the REST API for file support.
Handling External Stages: For files in external stages that reference Microsoft Azure cloud storage, the function requires Azure Active Directory authentication. This is because querying the function fails if the container is accessed using a shared access signature (SAS) token. The GET_PRESIGNED_URL function requires Azure Active Directory authentication to create a user delegation SAS token, utilizing a storage integration object that stores a generated service principal.

7. Different roles within an organization can utilize this app in various scenarios:

Data Analysts: Data analysts can use the app to profile datasets, identify inconsistencies, and understand data quality issues. They will analyze the patterns and relationships in the data and point out the necessary fixes to resolve any errors, such as missing values or outliers.
Data Stewards/Data Owners: After receiving insights from data analysts, data stewards or data owners can apply the suggested fixes to cleanse the data, ensuring it meets quality standards. They can make adjustments directly through the app by inserting, updating, or deleting records, ensuring the data is clean and accurate for downstream processes.

This collaborative approach between analysts and data stewards ensures that the data is high quality and reliable, supporting effective decision-making across the organization.

Final notes

Snowpark offers a novel approach to data profiling by bringing it into Snowflake’s native environment. This approach reduces complexity, enhances performance, and ensures security. Whether improving customer segmentation in retail, ensuring compliance in healthcare, or detecting fraud in finance, Snowflake Native Apps with Snowpark provides a timely solution for maintaining high data quality across industries.

For data engineers looking to address client pain points this translates to:

Seamless Deployment: Easily deployable across teams or accounts, streamlining collaboration.
Dynamic UI: The Streamlit-powered UI provides an interactive dashboard, allowing users to profile data without extensive technical knowledge.
Flexibility: Supports profiling of both Snowflake tables and external files (e.g., CSV, JSON) in external stages like S3 or Azure Blob.

With upcoming features like AI-driven insights, anomaly detection, and hierarchical data modeling, Snowpark provides a powerful and flexible platform for maintaining data quality across industries, helping businesses make smarter decisions and drive better outcomes.

The post How to Simplify Data Profiling and Management with Snowpark and Streamlit appeared first on Tiger Analytics.

Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution

TA@2023 — Fri, 26 Jul 2024 10:07:10 +0000

In the high-stakes world of financial technology, data is king. But what happens when that data becomes a labyrinth of inconsistencies? This was the challenge faced by a senior data engineer at a leading fintech company.

“Our merchant identification system is failing us. We’re losing millions in potential revenue and our fraud detection is compromised. We need a solution, fast.”

The issue was clear but daunting. Every day, their system processed millions of transactions, each tied to a merchant. But these merchant names needed to be more consistent. For instance, a well-known retail chain might be listed as its official name, a common abbreviation, a location-specific identifier, or simply a generic category. This inconsistency was wreaking havoc across the business.

Initial Approach: Snowflake SQL Procedures

Initially, the data engineer and his team developed Snowflake SQL procedures to handle this complex data transformation. While these procedures worked they wanted to add the automated testing pipelines and quickly realized the limitations. “We need more robust regression and automated testing capabilities. And we need to implement these tests without constantly connecting to a Snowflake account.” This capability wasn’t possible with traditional Snowflake SQL procedures, pushing them to seek external expertise.

Enter Tiger Analytics: A New Approach with Snowpark and Local Testing Framework

After understanding the challenges, the Tiger team proposed a solution: leveraging Snowpark for complex data transformations and introducing a local testing framework. This approach aimed to solve the merchant identity issue and improve the entire data pipeline process.

To meet these requirements, the team turned to Snowpark. Snowpark enabled them to perform complex data transformations and manipulations within Snowflake, leveraging the power of Snowflake’s computational engine. However, the most crucial part was the Snowpark Python Local Testing Framework. This framework allowed the team to develop and test their Snowpark DataFrames, stored procedures, and UDFs locally, fulfilling the need for regression testing and automated testing without connecting to a Snowflake account.

Key Benefits

Local Development: The team could develop and test their Snowpark Python code without a Snowflake account. This reduced the barrier to entry and sped up their iteration cycles.
Efficient Testing: By utilizing familiar testing frameworks like PyTest, the team integrated their tests seamlessly into existing development workflows.
Enhanced Productivity: The team quickly iterated on their code with local feedback, enabling rapid prototyping and troubleshooting before deploying to their Snowflake environment.

Overcoming Traditional Unit Testing Limitations

In the traditional sense of unit testing, Snowpark does not support a fully isolated environment independent of a Snowflake instance. Typically, unit tests would mock a database object, but Snowpark lacks a local context for such mocks. Even using the create_dataframe method requires Snowflake connectivity.

The Solution with Local Testing Framework

Despite these limitations, the Snowpark Python Local Testing Framework enabled the team to create and manipulate DataFrames, stored procedures, and UDFs locally, which was pivotal for our use case. Here’s how the Tiger team did it:

Setting Up the Environment

First, set up a Python environment:

pip install "snowflake-snowpark-python[localtest]"
pip install pytest

Next, create a local testing session:

from snowflake.snowpark import Session
session = Session.builder.config('local_testing', True).create()

Creating Local DataFrames

The Tiger team created DataFrames from local data sources and operated on them:

table = 'example'
session.create_dataframe([[1, 2], [3, 4]], ['a', 'b']).write.save_as_table(table)

Operating on these DataFrames was straightforward:

df = session.create_dataframe([[1, 2], [3, 4]], ['a', 'b'])
res = df.select(col('a')).where(col('b') > 2).collect()
print(res)

Creating UDFs and Stored Procedures

The framework allowed the team to create and call UDFs and stored procedures locally:

from snowflake.snowpark.functions import udf, sproc, call_udf, col
from snowflake.snowpark.types import IntegerType, StringType

@udf(name='example_udf', return_type=IntegerType(), input_types=[IntegerType(), IntegerType()])
def example_udf(a, b):
    return a + b

@sproc(name='example_proc', return_type=IntegerType(), input_types=[StringType()])
def example_proc(session, table_name):
    return session.table(table_name)\
        .with_column('c', call_udf('example_udf', col('a'), col('b')))\
        .count()

# Call the stored procedure by name
output = session.call('example_proc', table)

Using PyTest for Efficient Testing

The team leveraged PyTest for efficient unit and integration testing:

PyTest Fixture

In the conftest.py file, the team created a PyTest fixture for the Session object:

import pytest
from snowflake.snowpark.session import Session

def pytest_addoption(parser):
    parser.addoption("--snowflake-session", action="store", default="live")

@pytest.fixture(scope='module')
def session(request) -> Session:
    if request.config.getoption('--snowflake-session') == 'local':
        return Session.builder.configs({'local_testing': True}).create()
    else:
        snowflake_credentials = {} # Specify Snowflake account credentials here
        return Session.builder.configs(snowflake_credentials).create()

Using the Fixture in Test Cases

from project.sproc import my_stored_proc
  
def test_create_fact_tables(session):
    expected_output = ...
    actual_output = my_stored_proc(session)
    assert expected_output == actual_output

Running Tests

To run the test suite locally:

pytest --snowflake-session local

To run the test suite against your Snowflake account:

pytest

Addressing Unsupported Functions

Some functions were not supported in the local testing framework. For these, the team used patch functions and MagicMock:

Patch Functions

For unsupported functions like upper(), used patch functions:

from unittest.mock import patch
from snowflake.snowpark import Session
from snowflake.snowpark.functions import upper
  
session = Session.builder.config('local_testing', True).create()
  
@patch('snowflake.snowpark.functions.upper')
def test_to_uppercase(mock_upper):
    mock_upper.side_effect = lambda col: col + '_MOCKED'
    df = session.create_dataframe([('Alice',), ('Bob',)], ['name'])
    result = df.select(upper(df['name']))
    collected = result.collect()
    assert collected == [('Alice_MOCKED',), ('Bob_MOCKED',)]

MagicMock

For more complex behaviors like explode(), team used MagicMock:

from unittest.mock import MagicMock
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col

session = Session.builder.config('local_testing', True).create()

def test_explode_df():
    mock_explode = MagicMock()
    mock_explode.side_effect = lambda: 'MOCKED_EXPLODE'
    df = session.create_dataframe([([1, 2, 3],), ([4, 5, 6],)], ['data'])
    with patch('snowflake.snowpark.functions.explode', mock_explode):
        result = df.select(col('data').explode())
        collected = result.collect()
        assert collected == ['MOCKED_EXPLODE', 'MOCKED_EXPLODE']

test_explode_df()

Scheduling Procedures Limitations

While implementing these solutions, the Tiger team faced issues with scheduling procedures using serverless tasks, so they used the Task attached to the warehouse. They created the Snowpark-optimized warehouse. The team noted that serverless tasks cannot invoke certain object types and functions, specifically:

UDFs (user-defined functions) that contain Java or Python code.
Stored procedures are written in Scala (using Snowpark) or those that call UDFs containing Java or Python code.

Turning Data Challenges into Business Insights

The journey from the initial challenge of extracting merchant identities from inconsistent transaction data to a streamlined, efficient process demonstrates the power of advanced data solutions. The Tiger team leveraged Snowpark and its Python Local Testing Framework, not only solving the immediate problem but also enhancing their overall approach to data pipeline development and testing. The combination of regex-based, narration’s pattern-based, and ML-based methods enabled them to tackle the complexity of unstructured bank statement data effectively.

This project’s success extends beyond merchant identification, showcasing how the right tools and methodologies can transform raw data into meaningful insights. For data engineers facing similar challenges, this case study highlights how Snowpark and local testing frameworks can significantly improve data application development, leading to more efficient, accurate, and impactful solutions.

The post Solving Merchant Identity Extraction in Finance: Snowpark’s Data Engineering Solution appeared first on Tiger Analytics.

A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric

TA@2023 — Mon, 01 Jul 2024 12:13:32 +0000

Organizations often face challenges in effectively leveraging data to streamline operations and enhance customer satisfaction. Siloed data, complexities associated with ingesting, processing, and storing data at scale, and limited collaboration across departments can hinder a company’s ability to make informed, data-driven decisions. This can result in missed opportunities, inefficiencies, and suboptimal customer experiences.

Here’s where Microsoft’s new SaaS platform “Microsoft Fabric” can give organizations a much-needed boost. By integrating data across various functions, including data science (DS), data engineering (DE), data analytics (DA), and business intelligence (BI), Microsoft Fabric enables companies to harness the full potential of their data. The goal is to enable seamless sharing of data across the organization while simplifying all the key functions of Data Engineering, Data Science, and Data Analytics to facilitate quicker and better-informed decision-making at scale.

For enterprises looking to utilize Microsoft Fabric’s full capabilities, understanding the platform’s pricing and licensing intricacies is crucial, impacting several key financial aspects of the organization:

1. Reserved vs Pay-as-you-go: Understanding pay-as-you-go versus reserved pricing helps in precise budgeting and can affect both initial and long-term operational costs.
2. Capacity Tiers: Clear knowledge of capacity tiers allows for predictable scaling of operations, facilitating smooth expansions without unexpected costs.
3. Fabric Tenant Hierarchy: It is important to understand the tenant hierarchy as this would have a bearing on the organization’s need to buy capacity based on their unique needs.
4. Existing Power BI Licenses: For customers having existing Power BI, it is important to understand how to utilize existing licenses (free/pro/premium) and how it ties in with Fabric SKU.

At Tiger Analytics, our team of seasoned SMEs have helped clients navigate the intricacies of licensing and pricing models for robust platforms like Microsoft Fabric based on their specific needs.

In this blog, we will provide insights into Microsoft Fabric’s pricing strategies to help organizations make more informed decisions when considering this platform.

Overview of Microsoft Fabric:

Microsoft Fabric offers a unified and simplified cloud SaaS platform designed around the following ‘Experiences’:

Data Ingestion – Data Factory
Data Engineering – Synapse DE
Data Science – Synapse DS
Data Warehousing – Synapse DW
Real-Time Analytics – Synapse RTA
Business Intelligence – Power BI
Unified storage – OneLake

A Simplified Pricing Structure

Unlike Azure, where each tool has separate pricing, Microsoft Fabric simplifies this by focusing on two primary cost factors:

1. Compute Capacity: A single compute capacity can support all functionalities concurrently, which can be shared across multiple projects and users without any limitations on the number of workspaces utilizing it. You do not need to select capacities individually for Data Factory, Synapse Data Warehousing, and other Fabric experiences.

2. Storage: Storage costs are separate yet simplified, making choices easier for the end customer.

Understanding Fabric’s Capacity Structure

To effectively navigate the pricing and licensing of Microsoft Fabric, it is crucial to understand how a Fabric Capacity is associated with Tenant and Workspaces. These three together help organize the resources within an Organization and help manage costs and operational efficiency.

1. Tenant: This represents the highest organizational level within Microsoft Fabric, and is associated with a single Microsoft Entra ID. An organization could also have multiple tenants.

2. Capacity: Under each tenant, there are one or more capacities. These represent pools of compute and storage resources that power the various Microsoft Fabric services. Capacities provide capabilities for workload execution. These are analogous to horsepower for car engines. The more you provision capacity, the more workloads can be run or can be run faster.

3. Workspace: Workspaces are environments where specific projects and workflows are executed. Workspaces are assigned a capacity, which represents the computing resources it can utilize. Multiple workspaces can share the resources of a single capacity, making it a flexible way to manage different projects or departmental needs without the necessity of allocating additional resources for each new project/ department.

The figure above portrays the Tenant hierarchy in Fabric and how different organizations can provision capacities based on their requirements.

Understanding Capacity Levels, SKUs, and Pricing in Microsoft Fabric

Microsoft Fabric capacities are defined by a Stock Keeping Unit (SKU) that corresponds to a specific amount of compute power, measured in Capacity Units (CUs). A CU is a unit that quantifies the amount of compute power available.

Capacity Units (CUs) = Compute Power

As shown in the table below, each SKU (Fn) is represented with a CU. E.g. F4 is double in capacity as compared to F2 but is half that of F8.

The breakdown below shows the SKUs available for the West Europe region, showing both Pay As You Go and Reserved (1-year) pricing options:

Comparative table showing Fabric SKUs, CUs, associated PBI SKU, Pay-as-you-Go and Reserved pricing for a region.
1 CU pay-as-you-price at West EU Region = $0.22/hour
1 CU PAYGO monthly rate calculation: $0.22*730 =$160.6, F2 =$160.6*2=$321.2
1 CU RI monthly rate calculation: Round ($0.22* (1-0.405)*730*12,0)/12=~$95.557…F2 RI = ~$95.557…*2=~$191.11

Pricing Models Explained:

Pay As You Go: This flexible model allows you to pay monthly based on the SKU you select, making it ideal if your workload demands are uncertain. You can purchase more capacity or even upgrade/downgrade your capacity. You further get an option to pause your capacities to save costs.

Reserved (1 year): In this option, you pay reserved prices monthly. The reservation is for 1 year. The prices of reserved can give you a savings of around 40%. It involves no option to pause and is billed monthly regardless of capacity usage.

Storage Costs in Microsoft Fabric (OneLake)

In Microsoft Fabric, compute capacity does not include data storage costs. This means that businesses need to budget separately for storage expenses.

Storage costs need to be paid for separately.
Storage costs in Fabric (OneLake) are similar to ADLS (Azure Data Lake Storage).
BCDR (Business continuity Disaster recovery) charges are also included. This comes into play when Workspaces are deleted but some data needs to be extracted from the same.
Beyond this, there are costs for cache storage (for KQL DB)
There are also costs for the transfer of data between regions – which is known as Bandwidth pricing. More details are in this link.

Optimizing Resource Use in Microsoft Fabric: Understanding Bursting and Smoothing Techniques

Despite purchasing a capacity, your workload may demand higher resources in between.

For this, Fabric allows two methods to help with faster execution (burst) while flattening the usage over time (smooth) to maintain optimal costs.

Bursting: Bursting enables the use of additional compute resources beyond your existing capacity to accelerate workload execution. For instance, if a task normally takes 60 seconds using 64 CUs, bursting can allocate 256 CUs to complete the same task in just 15 seconds.
Smoothing: Smoothing is applied automatically in Fabric across all capacities to manage brief spikes in resource usage. This method distributes the compute demand more evenly over time, which helps in avoiding extra costs that could occur with sudden increases in resource use.

Understanding Consumption: Where do your Computation Units (CUs) go?

Image credit: Microsoft

The following components in Fabric consume or utilize the CU (Capacity Units)

Data Factory Pipelines
Data Flow Gen2
Synapse Warehouse
Spark Compute
Event Stream
KQL Database
OneLake
Copilot
VNet Data Gateway
Data Activator (Reflex)
PowerBI

The CU consumption depends on the solution implemented for functionality. Here’s an example for better understanding:

Business Requirement: Ingest data from an on-prem data source and use it for Power BI reporting.

Solution Implemented: Data Factory pipelines with Notebooks to perform DQ checks on the ingested data. PowerBI reports were created pointing to the data in One Lake.

How are CU’s consumed:

CUs would be consumed every time the data factory pipeline executes and further invokes the Notebook (Spark Compute) to perform data quality checks.

Further, CU’s would get consumed whenever the data refreshes on the dashboard.

Microsoft Fabric Pricing Calculator:

Microsoft has streamlined the pricing calculation with its online calculator. By selecting your region, currency, and billing frequency (hourly or monthly), you can quickly view the pay-as-you-go rates for all SKUs. This gives you an immediate estimate of the monthly compute and storage costs for your chosen region. Additionally, links for reserved pricing and bandwidth charges are also available.

For more detailed and specific pricing analysis, Microsoft offers an advanced Fabric SKU Calculator tool through partner organizations.

Understanding Fabric Licensing: Types and Strategic Considerations

Licensing in Microsoft Fabric is essential because it legally permits and enables the use of its services within your organizational framework, ensuring compliance and tailored access to various functionalities. Licensing is distinct from pricing, as licensing outlines the terms and types of access granted, whereas pricing involves the costs associated with these licenses.

There are two types of licensing in Fabric:

Capacity-Based Licensing: This licensing model is required for operating Fabric’s services, where Capacity Units (CUs) define the extent of compute resources available to your organization. Different Stock Keeping Units (SKUs) are designed to accommodate varying workload demands, ranging from F2 to F2048. This flexibility allows businesses to scale their operations up or down based on their specific needs.
Per-User Licensing: User-based licensing was used in Power BI, and this has not changed in Fabric (for compatibility). The User accounts include:

Free
Pro
Premium Per User (PPU)

Each tailored to specific sets of capabilities as seen in the table below:

Image Credit: Microsoft (https://learn.microsoft.com/en-us/fabric/enterprise/licenses)

Understanding Licensing Scenarios

To optimally select the right Fabric licensing options and understand how they can be applied in real-world scenarios, it’s helpful to look at specific use cases within an organization. These scenarios highlight the practical benefits of choosing the right license type based on individual and organizational needs.

Scenario 1: When do you merely require a Power BI Pro License?

Consider the case of Sarah, a data analyst whose role involves creating and managing Power BI dashboards used organization-wide. These dashboards are critical for providing the leadership with the data needed to make informed decisions. In such a scenario, a Pro License is best because it allows Sarah to:

Create and manage Power BI dashboards within a dedicated workspace.
Set sharing permissions to control who can access the dashboards.
Enable colleagues to build their visualizations and reports from her Power BI datasets, fostering a collaborative work environment.

In the above scenario, a Pro license would suffice (based on the above-listed requirements.)

Scenario 2: What are the Licensing Options for Small Businesses?*

Consider a small business with about 60 users that wants to leverage premium Power BI features (pls. refer to the comparison table above which shows the capabilities for free, pro, and PPU (premium per user) to enhance its data analysis capabilities. The company has two primary licensing options within Microsoft Fabric to accommodate its needs, each with different cost implications and service access levels.

Option 1: Premium Per User (PPU) Licensing

This option involves purchasing a Premium Per User license for each of the 60 users.
Cost Calculation: 60 users x $20 per month = $1,200 per month.
Note: This option does not include any Fabric services or capacities; it only covers the Power BI Premium features.

Option 2: Combining F4 Capacity with Power BI Pro Licenses

Alternatively, the company can opt for a combination of an F4 Fabric capacity and 60 Power BI Pro licenses.
Cost Calculation: F4 capacity at $525 per month + (60 Power BI Pro licenses x $10 = $600) = $1,125 per month. Additional storage and other service costs may apply.
Benefits: This option is not only more cost-effective compared to Option 1, but it also provides access to broader Fabric services beyond just Power BI, enhancing the organization’s overall data management capabilities.

Option 2 offers a more economical and service-inclusive approach. Furthermore, it opens up opportunities to scale up using higher Fabric capacities with reserved (1-year) pricing for even greater efficiency and cost savings in the long run.

Table: Fabric SKU and Power BI SKUs for reference calculations and comparisons

Scenario 3: A Medium business organization is looking to implement analytics solutions using Fabric services and reporting using Power BI. They are also looking to share Power BI content for collaborative decision-making. What are the licensing options in Fabric?

Considerations:

1. Since the organization is looking to share Power BI content, you will need Power BI premium or equivalent Fabric capacities (F64 and above)
2. Microsoft is transitioning/enabling Power BI premium capacities to automatically be Fabric capacities – which brings more flexibility for organizations while keeping costs the same (when compared with PPU licenses)
3. It would be wise to start with F64 Pay-As-You-Go initially, check for performance and other factors such as bursting in the monthly bills, and finally decide on the final Fabric capacity with reserved pricing to avail up to 40% savings.

Scenario 4: An organization is looking to use Co-Pilot extensively across the platform. What Fabric capacity can they start with?

Considerations: A minimum of F64 SKU is required to be able to use Co-Pilot.

The table above provides a reference for understanding how different SKUs align with specific user needs and organizational goals, helping to further clarify the most effective licensing strategies for various roles within a company.

Key Considerations for selecting the right Fabric SKU and License

Now that we have seen some practical scenarios related to making licensing decisions, let us list out the key considerations for selecting the optimal Fabric SKU and license:

Organization Size & Usage Patterns:

A large organization with diverse data needs will likely require a higher-capacity SKU and more user licenses. Consider a mix of per-user and capacity licenses – analyze which teams work heavily in Fabric vs. those who are light consumers.
If your organization already uses Power BI extensively, or it’s central to your use of Fabric, having at least one Pro or PPU license is essential.

Workload Types and Frequency:

Batch vs. real-time processing: One-time bulk data migrations might benefit from short-term bursts, while continuous streaming needs consistent capacity.
Complexity of transformations: Resource-intensive tasks like complex data modeling, machine learning, or large-scale Spark jobs will consume more CUs than simple data movement.
Frequency of Power BI Use: Frequent dataset refreshes and report queries in Power BI significantly increase compute resource consumption.
Content Sharing/ CoPilot usage: To share the Power BI content freely across the organization or in order to use CoPilot, you must be on a minimum F64 or higher SKUs.

Operational Time:

Pay-as-you-go v/s Reserved (1-year) pricing: Reserved capacity locks in a price for consistent usage, while pay-as-you-go is better for sporadic workloads. The Reserved licensing provides roughly about 40% savings over the Pay-as-you-Go.
Pausing: You can pause your existing pay-as-you-go license when the capacity is not in use, resulting in cost savings.
Development vs. production: Dev environments can often use lower tiers or be paused when idle to reduce costs.

Region:

Costs vary by Azure region. Align your Fabric deployment with your primary user location to minimize data egress charges.

Power BI Premium: While Power BI licenses have not changed in Fabric, it is important to consider that the Power BI premium license would be merged with Fabric (F) licenses. The Free and Pro licenses would not be impacted.
Mixed Use: You may need to consider purchasing both Fabric (capacity) and Power BI licenses for sharing content across the organization.

How to Bring These Factors into Your Planning

Before beginning the Fabric deployment, consider these steps to ensure you choose the right SKU and licensing options:

Start with Baselining: Before scaling up, run pilot workloads to understand your capacity unit (CU) consumption patterns. This helps in accurately predicting resource needs and avoiding unexpected costs.
Estimate Growth: Project future data volumes, user counts, and evolving analytics needs. This foresight ensures that your chosen capacity can handle future demands without frequent upgrades.
Right-size, Don’t Overprovision: Initially, select an SKU that slightly exceeds your current needs. Microsoft Fabric’s flexibility allows you to scale up as necessary, preventing unnecessary spending on excess capacity.
Regularly Monitor Usage: Utilize the Capacity Metrics App to track resource usage and identify trends. This ongoing monitoring allows for timely adjustments and optimization of your resource allocation, ensuring cost-effectiveness.

Power BI Capacity Metrics App: Your Cost Control Center in Fabric

The Power BI Capacity Metrics App is an essential tool for understanding how different Microsoft Fabric components consume resources. It provides

Detailed reports and visualizations on the usage of computing and storage.
Empowers you to identify cost trends, potential overages, and optimization opportunities.
Helps you to stay within your budget.

Microsoft Fabric has streamlined licensing and pricing options, offering significant benefits at both capacity and storage levels:

Capacity Benefits

Image credits: Microsoft

Storage Benefits

In this blog, we’ve explored the intricacies of Microsoft Fabric’s pricing and licensing, along with practical considerations for making informed purchase decisions. If you want to integrate Fabric into your business, you can purchase the capacities and licenses from Azure Portal or reach out to us in case you need to discuss your use case.

The post A Comprehensive Guide to Pricing and Licensing on Microsoft Fabric appeared first on Tiger Analytics.

Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh

Ibees . — Fri, 03 May 2024 05:35:55 +0000

D, a data engineer with a knack for solving complex problems, recently faced a challenging task. A client needed a smart way to manage their data in Power BI, especially after acquiring new companies. This meant separating newly acquired third-party data from their existing internal data, while also ensuring that historical data remained intact and accessible. The challenge? This process involved refreshing large data sets, sometimes as many as 25 million rows for a single year, just to incorporate a few thousand new entries. This task was not just time-consuming but would also put a strain on computational resources.

At first glance, Power BI’s Custom Partitions seemed like a promising solution. It would allow D to organize data neatly, separating third-party data from internal data as the client wanted. However, Power BI typically partitions data by date, not by the source or type of data, which made combining Custom Partitions with Incremental Refresh—a method that updates only recent changes rather than the entire dataset—a bit of a puzzle.

Limitations of Custom Partition and Incremental Refresh in Power BI

Custom Partitions offer the advantage of dividing the table into different parts based on the conditions defined, enabling selective loading of partitions during refreshes. However, Power BI’s built-in Incremental Refresh feature, while automated and convenient, has limitations in terms of customization. It primarily works on date columns, making it challenging to partition the table based on non-date columns like ‘business region’.

Incremental Refresh Pros:

Creation of partitions is automated, and the updation of partitions based on date is also automated, no manual intervention is needed.

Incremental Refresh Cons:

Cannot have two separate logics defined for partition of data based on flag column.
Cannot support the movement of data using the Power BI Pipeline feature.

Custom Partitions Pros:

Can create partitions of our own logical partitions.
Can support the movement of data using the Power BI Pipeline Feature.

Custom Partitions Cons:

All the processes should be done manually.

To tackle these challenges, D came up with another solution. By using custom C# scripts and Azure Functions, D found a way to integrate Custom Partitions with an Incremental Refresh in the Power BI model. This solution not only allowed for efficient management of third-party and internal data but also streamlined the refresh process. Additionally, D utilized Azure Data Factory to automate the refresh process based on specific policies, ensuring that data remained up-to-date without unnecessary manual effort.

This is how we at Tiger Analytics, solved our client’s problem and separated third-party data. In this blog, we’ll explore the benefits of combining Custom Partitions with Incremental Refresh. Based on our experiences, how this combination can enhance data management in Power BI and provide a more efficient and streamlined approach to data processing.

Benefits of combining Incremental Refresh with Custom Partitions in Power BI

Merging the capabilities of Incremental Refresh with Custom Partitions in Power BI offers a powerful solution to overcome the inherent limitations of each approach individually. This fusion enables businesses to fine-tune their data management processes, ensuring more efficient use of resources and a tailored fit to their specific data scenarios.

Leveraging tools like Azure Function Apps, the Table Object Model (TOM) library, and Power BI’s XMLA endpoints, automating the creation and management of Custom Partitions becomes feasible. This automation grants the flexibility to design data partitions that meet precise business needs while enjoying the streamlined management and automated updates provided by Power BI.

Optimizing Query Performance:

Custom Partitions improve query performance by dividing data into logical segments based on specific criteria, such as a flag column.
When combined with an Incremental Refresh, only the partitioned data that has been modified or updated needs to be processed during queries.
This combined approach reduces the amount of data accessed, leading to faster query response times and improved overall performance.

Efficient Data Refresh:

Incremental Refresh allows Power BI to refresh only the recently modified or added data, reducing the time and resources required for data refreshes.
When paired with Custom Partitions, the refresh process can be targeted to specific partitions, rather than refreshing the entire dataset.
This targeted approach ensures that only the necessary partitions are refreshed, minimizing processing time and optimizing resource utilization.

Enhanced Data Organization and Analysis:

Custom Partitions provide a logical division of data, improving data organization and making it easier to navigate and analyze within the data model.
With Incremental Refresh, analysts can focus on the most recent data changes, allowing for more accurate and up-to-date analysis.
The combination of Custom Partitions and Incremental Refresh enables more efficient data exploration and enhances the overall data analysis process.

Scalability for Large Datasets:

Large datasets can benefit significantly from combining Custom Partitions and Incremental Refresh.
Custom Partitions allow for efficient querying of specific data segments, reducing the strain on system resources when dealing with large volumes of data.
Incremental Refresh enables faster and more manageable updates to large datasets by focusing on the incremental changes, rather than refreshing the entire dataset.

Implementation Considerations:

Combining Custom Partitions and Incremental Refresh may require a workaround, such as using calculated tables and parameters.
Careful planning is necessary to establish relationships between the partition table, data tables, and Incremental Refresh configuration.
Proper documentation and communication of the combined approach are essential to ensure understanding and maintainability of the solution.

How to implement Incremental Refresh and Custom Partitions: A step-by-step guide

Prerequisites:

Power BI Premium Capacity or PPU License: The use of XMLA endpoints, which are necessary for managing Custom Partitions, is limited to Power BI Premium capacities. Alternatively, you can utilize Power BI premium-per-user (PPU) licensing to access these capabilities.
PPU: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-per-user-faq
Xmla Reference: https://learn.microsoft.com/en-us/power-bi/enterprise/service-premium-connect-tools

Dataset Published to Premium Workspace: The dataset for implementing Custom Partitions and Incremental Refresh should be published to a Power BI Premium workspace.

Permissions for Azure Functions and Power BI Admin Portal: To automate the creation and management of Custom Partitions, you need the appropriate permissions. This includes the ability to create and manage Azure Functions and the necessary rights to modify settings in Power BI’s Admin portal.

In the Function App, Navigate to Settings -> Identity and Turn On the system assigned.
Next, create a security group in Azure and add the function App as a member.
Go to Power BI, navigate to the Admin portal, and add the security group to the Admin API setting that allows service principles to use Power BI APIs.
Go to Workspace, Go for access, and Add the function as a member to the Workspace.

Check Incremental Refresh Policy: The Incremental Refresh policy needs to be false to create partitions on the table (through code).

Fulfilling these prerequisites will enable effective utilization of Custom Partitions and Incremental Refresh in Power BI.

Implementation at a glance:

Create an Azure Function with .NET as the Runtime Stack: Begin by adding the necessary DLL files for Power BI model creation and modification to the Azure Function console.

Connect to the Power BI Server Using C# Code: Establish a connection by passing the required connection parameters, such as the connection string and the table name where partitions need to be implemented. (C# code and additional details are available in the GitHub link provided in the note section).

Develop Code for Creating Partitions: Utilize the inbuilt functions from the imported DLL files to create partitions within the Power BI server.

Implement Code for Multiple Partitions: Use a combination of for-loop and if-conditions to develop code capable of handling multiple partitions.

There are two types of data partitions to consider based on the Flag value:

Flag Value ‘Y’ will be stored in a single partition, referred to as the ABC Partition.
Flag Value ‘N’ will be partitioned based on the date column, adhering to the incremental logic implemented. (Examples of partition naming include 2020, 2021, 2022, 202301, 202302, 202303, etc., up to 202312, 202401, 202402).

Check and Create ABC Partition if it does not exist: The initial step in the logic involves verifying the existence of the ABC Partition. If it does not exist, the system should create it.

Implement Logic Within the Looping Block:

The first action is to verify the existence of yearly partitions for the last three years. If any are missing, they should be created.
Next, combine all partitions from the previous year into a single-year partition.
Subsequently, create new partitions for the upcoming year until April.
Any partitions outside the required date range should be deleted.

Automate Partition Refresh with a Pipeline: Establish a pipeline designed to trigger the Azure function on the 1st of April of every year, aligning with the business logic.

Partition Logical flow:

Step-by-Step implementation:

From the home dashboard, search for and select the Function App service. Enter all necessary details, review the configuration, and click ‘Create’.
Configure the function’s runtime settings.
Check the Required dll’s
Navigate to Development Tools -> Advanced Tools -> Go
Give CMD -> site -> wwwroot -> new function name -> bin and paste all dll’s
The primary coding work, including the creation of partitions, is done in the run.csx file. This is where you’ll write the C# code.

The Partitions should be as below:

Input body to the function:

{
    "connectionstring":"Connection details",
    "datasource": "datasource name",
    "workspace": "workspace name",
    "dataset": "dataset name",
    "table": "Table name",
    "partition": "Y",
    "sourceschema": "Schema name",
    "sourceobject": "Source object Name table or view name",
    "partitionstatement": "Year",
    "history": "2"
}

Refresh the selected partition using Azure Pipeline:

Create Azure Pipeline, which uses web activity to call the Rest-API refresh method in the Power BI model.
The first step for using the Pipeline is to have the APP registered with Power Bi workspace and Model access.
Then, with the APP, get the AAD Token for authentication.
With the AAD Token, use the In-Built refresh POST methods in Rest-API for refreshing the required table and partition.
To make the Pipeline wait till the refresh is complete, use the In-Built refreshes GET methods in Rest-API. Implementing GET methods within the pipeline to monitor the refresh status, ensures the process completes successfully.
The Pipeline is built in a modular way, where workspaceID and DatasetID and Table name and partition name are passed.
The pipeline can call any model refresh until the API used in the Pipeline has access to the Model and Workspace.

What does each activity in the pipeline mean:
- Get Secret from AKV: This block of pipeline accepts the key vault URL and the secret name which has a secret for an app used to access Power BI. The output of this block is a secret value.
- Get AAD Token: This block accepts the input of tenant id, app id, and output of Get Secret from AKV which gives an output as a token through which enables access to the Power BI model.
- Get Dataset Refresh: This block accepts the input of workspace id, dataset id, body, and then token which we get from the 2nd block then this block triggers the refresh of the corresponding table and partitions that are passed through the body for the model. This block will follow the post method.
Until Refresh Complete:
- Wait: To ensure the refresh completes, this block checks every 10 seconds.
- Get Dataset: This involves inputting the workspace ID, dataset ID, and request body, following the GET method. The output is a list of ongoing refreshes on the model.
- Set Dataset: Assigning the output of the previous block to a variable
  This block will run till the variable is not equal to unknown.
- If Condition: This step checks if the refresh process has failed. If so, the pipeline’s execution is considered unsuccessful.

Refresh the selected partition using Azure Function:

Please follow the same steps as above from 1- 6 to create the Azure function for refresh.
In the code+test pane add the c# code shared in the github.

Input body to the function:

{
    "workspace": "Workspace Name",
    "dataset": "Semantic Model Name",
    "tables": [
        {
            "name": "Table1RefreshallPartitions",
            "refreshPartitions": false
        },
        {
            "name": "Table2Refreshselectedpartitions",
            "refreshPartitions": true,
            "partitions": [
                "202402",
                "202403",
                "202404"
            ]
        }
    ]
}

Both Incremental Refresh and Custom Partitions in Power BI are essential for efficiently managing data susceptible to change within a large fact table. They allow you to optimize resource utilization, reduce unnecessary processing, and maintain control over partition design to align with your business needs. By combining these features, you can overcome the limitations of each approach and ensure a streamlined and effective data management solution.

References:

https://www.tackytech.blog/how-to-automate-the-management-of-custom-partitions-for-power-bi-datasets/

Note: Access the following GitHub link for Azure Function code, body which we pass for function and pipeline JSON files. Copy the JSON file inside the pipeline folder and paste that in adf pipeline by renaming the pipeline name as mentioned in the file, you will get the pipeline.

https://github.com/TirumalaBabu2000/Incremental_Refresh_and_Custom_Partition_Pipeline.git

The post Advanced Data Strategies in Power BI: A Guide to Integrating Custom Partitions with Incremental Refresh appeared first on Tiger Analytics.

Data-Driven Disruption? How Analytics is Shifting Gears in the Auto Market

onemg — Thu, 24 Mar 2022 12:43:31 +0000

In an age when data dictates decision-making, from cubicles to boardrooms, many auto dealers worldwide continue to draw insights from past experiences. However, the automotive market is ripe with opportunities to leverage data science to improve operational efficiency, workforce productivity, and consequently – customer loyalty.

Data challenges faced by automotive dealers

There are many reasons why auto dealers still struggle to collect and use data. The biggest one is the presence of legacy systems that bring entangled processes with disparate data touchpoints. This makes it difficult to consolidate information and extract clean, structured data – especially when there are multiple repositories. More importantly, they are unable to derive and harness actionable insights to improve their decision-making capabilities, instead of merely relying on gut instincts.

In addition, the sudden growth of the BEV/PHEV market has proven to complicate matters – with increasing pressure on regulatory compliance.

But the reality is that future-ready data management is a must-have strategy – not just to thrive but even to survive today’s automotive market. The OEMs are applying market pressure on one side of the spectrum – expecting more cost-effective vehicle pricing models to establish footprints in smaller or hyper-competitive markets. On the other side, modern customers are making it abundantly clear that they will no longer tolerate broken, inefficient, or repetitive experiences. And if you have brands operating in different parts of the world, data management can be a nightmarishly time-consuming and complex journey.

Future-proofing the data management strategy

Now, it’s easier said than done for the automotive players to go all-in on adopting a company-wide data mindset. It is pertinent to create an incremental data-driven approach to digital transformation that looks to modernize in phases. Walking away from legacy systems with entangled databases means that you must be assured of hassle-free deployment and scalability. It can greatly help to prioritize which markets/OEMs/geographies you want to target first, with data science by your side.

Hence, the initial step is to assess the current gaps and challenges to have a clear picture of what needs to be fixed on priority and where to go from thereon. Another key step in the early phase should be to bring in the right skill sets to build a future-proofed infrastructure and start streamlining the overall flow of data.

It is also important to establish a CoE model to globalize data management from day zero. In the process, a scalable data pipeline should be built to consolidate information from all touchpoints across all markets and geographies. This is a practical way to ensure that you have an integrated source of truth that churns out actionable insights based on clean data.

You also need to create a roadmap so that key use cases can be detected with specific markets identified for initial deployment. But first, you must be aware of the measurable benefits that can be unlocked by tapping into the power of data.

Better lead scoring: Identify the leads most likely to purchase a vehicle and ensure targeted messaging.
Smarter churn prediction: Identify aftersales customers with high churn propensity and send tactical offers.
Accurate demand forecasting: Reduce inventory days, avoid out-of-stock items, and minimize promotional costs.
After-sales engagement: Engage customers even after the initial servicing warranty is over regarding repairs, upgrades, etc. as well an effective parts pricing strategy.
Sales promo assessment: Analyze historical sales data, seasonality/trends, competitors, etc., to recommend the best-fit promo.
Personalized customer engagement: Customize interactions with customers based on data-rich actionable intelligence instead of unreliable human instincts.

How we helped Inchcape disrupt the automotive industry

When Tiger Analytics began the journey with Inchcape, a leading global automotive distributor, we knew that it was going to disrupt how the industry tapped into data. Fast-forward to a year later, we were thrilled to recently take home Microsoft’s ‘Partner of the Year 2021’ award in the Data & AI category. What started as a small-scale project grew into one of the largest APAC-based AI and Advanced Analytics projects. We believe that this project has been a milestone moment for the automotive industry at large. If you’re interested in finding out how our approach raised the bar in a market notorious for low data adoption, please read our full case study.

The post Data-Driven Disruption? How Analytics is Shifting Gears in the Auto Market appeared first on Tiger Analytics.

When Opportunity Calls: Unlocking the Power of Analytics in the BPO Industry

onemg — Thu, 27 Jan 2022 10:26:37 +0000

Around 1981, the term outsourcing entered our lexicons. Two decades later, we had the BPO boom in India, China, and the Philippines with every street corner magically sprouting call centers. Now, in 2022, the industry is transitioning into an era of analytics, aiming to harness its sea of data for profitability, efficiency, and improved customer experience.

In this blog, we delve into details of what this data is, the unique challenges it poses, and the key business areas that can benefit from the use of analytics. We also share our experiences in developing these tools and how they have helped our clients in the BPO industry.

The Information Ocean

The interaction between BPO agents and customers generates huge volumes of both structured and unstructured (text, audio) data. On the one hand, you have the call data that measures metrics such as the number of incoming calls, time taken to address issues, service levels, and the ratio of handled vs abandoned calls. On the other hand, you have customer data measuring satisfaction levels and sentiment.

Insights from this data can help deliver significant value for your business whether it’s around more call resolution, reduced call time & volume, agent & customer satisfaction, operational cost reduction, growth opportunities through cross-selling & upselling, or increased customer delight.
The trick is to find the balance between demand (customer calls) and supply (agents). An imbalance can often lead to revenue losses and inefficient costs and this is a dynamic that needs to be facilitated by processes and technology.

Challenges of Handling Data

When you are handling such sheer volumes of data, the challenges too can be myriad.
Our clients wage a daily battle with managing these vast volumes, harmonizing internal and external data, and driving value through them. For those that have already embarked on their analytical journey, the primary goals are finding the relevance of what they built, driving scalability, and leveraging new-age predictive tools to drive ROI.

Delivering Business Value

Based on our experience, the business value delivered from advanced Analytics in the BPO industry is unquestionable, exhaustive and primarily influences these key aspects:

1) Call Management

Planning agent resources based on demand (peak and off-peak) and skillsets accounting for how long they take to resolve issues can impact business costs. AI can help automate the process to help optimize costs We have built an automated and real-time scheduling and resource optimization tool that has led one of our BPO clients to a cost reduction of 15%.

2) Customer Experience

Call center analytics give agents access to critical data and insights to work faster and smarter, improve customer relationships and drive growth. Analytics can help understand the past behavior of a customer/similar customers and recommend products or services that will be most relevant, instead of generic offers. It can also predict which customers are likely to need proactive management. Our real-time cross-selling analytics has led to a 20% increase in revenue.

3) Issue Resolution

First-call resolution refers to the percentage of cases that are resolved during the first call between the customer and the call center. Analytics can help automate the categorization process of contact center data by building a predictive model. This can help with a better customer servicing model achieved by appropriately capturing the nuances of customer chats with contact centers. This metric is extremely important as it helps in reducing the customer churn rate.

4) Agent Performance

Analytics on call-center agents can assist in segmenting those who had a low-resolution rate or were spending too much time on minor issues, compared with top-performing agents. This helps the call center resolve gaps or systemic issues, identify agents with leadership potential, and create a developmental plan to reduce attrition and increase productivity.

5) Call Routing

Analytics-based call routing is based on the premise that records of a customer’s call history or demographic profile can provide insight into which call center agent(s) has the right personality, conversational style, or combination of other soft skills to best meet their needs.

6) Speech Analytics

Detecting trends in customer interactions and analyzing audio patterns to read emotions and stress in a speaker’s voice can help reduce customer churn, boost contact center productivity, improve agent performance and reduce costs by 25%. Our tools have clients in predicting member dissatisfaction to achieve a 10% reduction in first complaints and 20% reduction in repeat complaints.

7) Chatbots and Automation

Thanks to the wonders of automation, we can now enhance the user experience to provide personalized attention to customers available 24/7/365. Reduced average call duration and wage costs improve profitability. Self-service channels such as the help center, FAQ page, and customer portals empower customers to resolve simple issues on their own while deflecting more cases for the company. Our AI-enabled chatbots helped in strengthening engagement and quicker resolutions of 80% of user queries.

Lessons from The Philippines

Recently, in collaboration with Microsoft, we conducted a six-week Data & Analytics Assessment for a technology-enabled outsourcing firm in the Philippines. The client was encumbered by complex ETL processes, resource bottlenecks on legacy servers, and a lack of UI for troubleshooting leading to delays in resolution and latency issues. They engaged Tiger Analytics to assess their data landscape.

We recommended an Enterprise Data Warehouse modernization approach to deliver improved scalability & elasticity, strengthened data governance & security, and improved operational efficiency.

We did an in-depth assessment to understand the client’s ecosystem, key challenges faced, data sources, and their current state architecture. Through interactions with IT and business stakeholders, we built a roadmap for a future state data infrastructure that would enable efficiency, scalability, and modernization. We also built a strategic roadmap of 20+ analytics use cases with potential ROI across HR and contact center functions.

The New Era

Today, the Philippines has been recognized as the BPO capital of the world. The competition will toughen both from new players and existing ones. A digital transformation is underway in the BPO industry. Success in this competitive space lies with companies that will harness the huge volume of data they have into meaningful and actionable change.

The post When Opportunity Calls: Unlocking the Power of Analytics in the BPO Industry appeared first on Tiger Analytics.

CECL Loss Forecasting: Navigating Challenges and Seizing Opportunities

onemg — Thu, 11 Jun 2020 18:58:56 +0000

Introduction

One of the key lessons the financial world took away from the Great Recession of 2008 was that the existing and traditional GAAP approach of incurred loss methodology to recognize credit losses needed a rehaul. This helps Financial Institutions (FIs) maintain a sufficient loss reserve to ensure smooth sailing (or, at the very least, avoid sinking) in times of financial crisis.

In this two-part series, we will explore various aspects of Current Expected Credit Loss (or “CECL”) – a recently implemented vital component of the current Loss Forecasting approach for FIs in the US.

In the first part, we will introduce the basics of CECL, the challenges it poses – especially for unsecured revolving portfolios such as credit cards, and discuss approaches for implementing CECL.

The second part will focus on how we developed and implemented CECL into the loss forecasting model for the credit card portfolio in a mid-sized US bank.

The American Bankers Association has called CECL “the most sweeping change to bank accounting ever” with good reason – the implications are huge. This change is expected to lead to an increase of 35-50% in loss reserve levels. While CECL was introduced by the FASB in 2016, it was to be implemented from January 2020 for large publicly traded banks and January 2023 for smaller reporting companies. However, in light of the COVID19 pandemic, the stimulus package signed by US President Donald Trump on 27th March, allows lenders to delay implementing CECL till 31st December or till health officials declare the end of the national emergency (whichever comes first).

The road to CECL

Even as the broader economic conditions started worsening in 2006, the loss forecasts by lenders raised no red flags. Banks are no strangers to risks or losses, and all of them had established loss forecasting models in place. Why, then, did their balance sheets fail to show any indication of the upcoming calamity? This can be attributed to the delayed identification of credit losses. Under the traditional method, FIs delayed recognition until it was probable that a loss had been incurred. This model restricted FIs’ ability to record credit losses that are expected, but do not yet meet the “probable” threshold. Under the new paradigm, FIs could then make estimates of “expected credit losses” using forward-looking information.

This led to the Comprehensive Capital Analysis and Review (CCAR) test being introduced by the Federal Reserve in 2011 for Bank Holding Companies with total assets of $50 billion or more. Composed of multiple stress tests, CCAR serves to ensure that banks have enough reserve capital in place to protect themselves in adverse financial conditions or market downturns.

However, this still was based on incurred loss. Cue: CECL. Introduced by the Financial Accounting Standards Board (FASB) in 2016, CECL, or Current Expected Credit Loss, transforms how FIs estimate losses. Under CECL, the expected lifetime credit loss (potential losses over the entire life of the loan) needs to be estimated at the time of origination. The value of the reserve needs to be based on the past, existing, and future information, taking into account the inherent risk for each loan as well as the economic factors.

Challenges

While CECL does promise to provide a more accurate estimate of expected loss, it comes with a set of implementation challenges:

CECL requires the credit loss allowance to be established for the entire foreseeable life of the loan, at the time of loan origination. This necessitates changes to the underlying credit loss models that currently estimate potential losses over a 12-month horizon.
The Life of Loan forecast is not straightforward, especially for revolving lines of credit, like credit cards – where borrowers have the option to make minimum, partial or full payments at any given point in the monthly billing cycle. For products like mortgages, banks must keep in mind the possibilities of loan modifications and prepayments.
Additional analysis is needed for economic risk factors – to assess their potential impact on credit losses better. The macroeconomic forecasts cannot merely be based on deviations in the past. CECL requires additional data granularity for macroeconomic variables, like real estate prices and unemployment rates.
Since the FASB has not specified guidelines on the calculation methodology, banks need increased transparency with regard to the modeling assumptions. The justifications provided will have to be quantitatively supportable.
The models will be audited internally and externally by federal supervisory bodies for CECL standards, and the level of disclosure will increase substantially. This necessitates comprehensive documentation that captures the entire process.
In addition to this come the structural challenges – the CECL compliant forecasts need to be integrated with the representation of losses in the general ledger, the firm’s prevalent accounting practices, and financial reporting systems.

CECL modeling approaches

Since CECL poses unique challenges for portfolios such as credit cards, we examine here the different approaches one can take to tackle this head-on. Both bottom-up and top-down regression approaches will help meet the CECL requirements.

Let’s discuss these approaches in detail:

I. Bottom-Up Approach:

This is used for building models for the most relevant segments. The credit loss model is built at the loan-level portfolio, using granular, geography-specific macroeconomic indicators. The granular data provides a better estimate for more complex portfolios, evaluating the changes over time on loan attributes and the performance for each account.

II. Top-Down Approach:

A vintage-based approach can be utilized for portfolios or segments based on well-defined delinquency behavior. Typically, these segments (such as deceased, transactors, etc.) have a small life of loan. Hence loan-level data may not possess discriminatory power. Overall segment-level balanced information data can be used for these segments.

Bottom-up loss forecasting approaches have gained increased favor in recent years. Our experience with both methods inclines us to believe that the added value from the extra complexity outweighs the drawbacks.

Segmentation analysis, a crucial part of CECL models, needs to be performed separately for each of the dependent events, i.e., charge-off and prepay. The initial segments may reflect the delinquency status, transition status, payment activity, and tenure at observation points. The segment distribution can be examined at different snapshots to ensure that they are stable across time. A loan-level “bottom-up” modeling approach can be utilized for the major segments

Life of Loan Estimation

CECL standard defines lifetime as the life of the balance as of the evaluation date (not the account opening date). According to the Transition Resource Group for Credit Losses (TRG), unused credit lines for a credit card can be treated as off-balance-sheet credit exposures that are unconditionally cancellable by the issuer. Since CECL does not require an allowance for available credit, it is not to be evaluated; determining the life of loan, thus, is the critical factor in the expected loss estimation. TRG discussed two views to determine the ‘estimated life’ of a credit card receivable, first-in-first-out (FIFO), and CARD Act method. Apart from this, there is a sophisticated hybrid approach combining both. Subsequently, TRG concluded that both methods are applicable, though FIFO is easier to implement. The amount of expected future payments estimated can be either of the following:

 View A: All payments expected to be collected from the borrower
 
 View B: Only a portion of payments expected to be collected from the borrower.

After reviewing various scenarios, considering consistency, simplicity, and justifiability, we are inclined towards using View A and FIFO methodology for future payments. This approach sets the path for effective loss calculation. A detailed monthly paydown curve is created to study the life of loan for an account.

Components of Loss Forecasting Models

CECL is calculated using a multiplicative framework. This requires the estimation of the following parameters describing the risk exposure.

i) Probability of Default (PD) – gives the average percentage of accounts, that experience a default event.

The performance duration is defined as the status of the account until one of the following “terminal events” happens after any given observation month.

1. Charged-off – Where the default is 180+ days past due

2. Charged-off due to bankruptcy

3. The balance is paid in full – the account has prepaid the outstanding balance at the time of evaluation

The terminal events are modeled using a hazard competing risk framework. We advocate using calculation of polychotomous logistic regression parameters using individualized regressions.

ii) Loss given default (LGD) – the percentage of monetary exposure if the borrower defaults

For an unsecured portfolio, the LGD can be safely taken as 100% for all the accounts, or other values based upon historical loss experience.

iii) Exposure at Default (EAD)- the outstanding amount (drawn amounts) in case the borrower defaults.

EAD can be defined based on balance paydown curves using the FIFO payment hierarchy.

To obtain the final Expected Credit Loss (ECL) for an account, the following equation is used:

ECL = PD X LGD X EAD

Conclusion

On balance, we have observed that the implementation of CECL standards for credit card portfolios is very complex compared to the previously incurred loss methodology. At the same time, the added value from the extra complexity outweighs the drawbacks. This new regulation prepares banks for any adverse financial situations which could pop-up in the future. This is best observed in the early adopters of CECL, who appear to be better positioned to brace the headwinds brought on by the global slowdown due to the pandemic.

This is the first part of a two-part series on CECL. In the next part of the blog [CECL modeling & Implementation – Practical approaches], we will discuss how we assisted a midsize bank in the US to comply with CECL regulatory requirements.

To know more about how we can support your CECL journey and to schedule a 1:1 discussion with our specialists, send an email to info@tigeranalytics.com

References

https://www.fasb.org/jsp/FASB/Document_C/DocumentPage&cid=1176169103350
https://www.fasb.org/jsp/FASB/FASBContent_C/ActionAlertPage&cid=1176164349105
https://www.aba.com/advocacy/our-issues/implementation-challenges

The post CECL Loss Forecasting: Navigating Challenges and Seizing Opportunities appeared first on Tiger Analytics.

Managing Parallel Threads: Techniques with Apache NiFi

onemg — Thu, 24 Oct 2019 16:33:39 +0000

As a data engineering enthusiast, you must be aware that Apache NiFi is designed to automate the data flow between multiple software systems. NiFi makes it possible to understand quickly the various dataflow operations that would otherwise take a significant amount of time.

In this blog, we deal with a specific problem encountered while dealing with NiFi. It has a feature to control the number of concurrent threads at an individual processor level. But there is no direct approach to control the number of concurrent threads at a process group level. We provide you with an approach to help resolve this challenge.

A Look at the Existing Concurrency Feature in NiFi

As mentioned, NiFi provides concurrency at an individual processor level. It is available for most processors and this option is available in the Scheduling tab called “Concurrent Tasks”. But, at the same time, there are also certain types of single-threaded processors that do not allow concurrency.

Concurrency set to 5 on Processor Level

This option allows the processor to run concurrent threads by using system resources at a higher priority when compared to other processors. In addition to this processor-level concurrency setting, NiFi has global maximum timer and event-driven thread settings. Its default values are 10 and 5 respectively. It controls the maximum number of threads NiFi can request from the server for fulfilling concurrent task requests from NiFi processor components. These global values can be adjusted in controller settings (Located via the hamburger menu in the upper right corner of the NiFi UI.)

Controller Settings

NiFi sets the Max Timer Thread Counts relatively low to support operating on commodity hardware. This default setting can limit performance when there is a very large and high volume data flow that must perform a lot of concurrent processing. The general guidance for setting this value is two to four times the number of cores available to the hardware on which the NiFi service is running.

NOTE: Thread Count applied within the NiFi UI is applied to every node in a NiFi cluster. The cluster UI can be used to see how the total active threads are being used per node.

Custom Approach to Set the Concurrency at Processor-group Level

To customize and control the number of concurrent threads to be executed within a processor group, use the NiFi Wait and Notify processors. Using notify signals created by the Notify processor, the number of threads to be executed within a processor group can be controlled.

Here is a sample use case – create a NiFi flow that processes input flow files in batches. For example, process five inputs at a time within a processor group and always keep those five concurrent threads active until any of them get completed. As soon as one or more flow files complete their processing, those available slots should be used for the queued input flow files. To explain further, if there are 40 tables to be processed and only five threads can be in parallel within the actual processor group, it would mean initially 5 tables have to run concurrently by taking 5 threads from the system resources and the remaining 35 tables will get a chance only when any of the previously running threads gets completed.

Now, to set the concurrency, design two processor groups PG-1 and PG-2,

Thread Controlling Processor Group(PG-1):This is the core controller that manages the number of concurrent threads. It decides how many concurrent threads can run within the processor group PG-2.

Actual Processor Group(PG-2):This is the processor group that performs the functionality that we want to parallelize. For example, it can be a cleansing/transformation operation that runs on all input tables.

Mechanism to control the concurrency

Code base (XML file) for this NiFi template is available in GitHub — https://github.com/karthikeyan9475/Control-Number-of-Threads-In-Processor-Group-Level-in-NiFi

How does this Work?

As mentioned, this functionality is achieved using Wait and Notify NiFi processors. PG-1 controls the number of flow files that get into PG-2 using Gates (Wait processor). This is nothing but signals created by Notify. In this NiFi template, there are three Gates and you will see how they work below.

PG-1 triggers 3 flow files via the input port, and each of them performs certain actions.

1st flow file: It triggers Notify processor to open the Gate-1 (Wait processor) and allows 5 input flow files (configurable) and triggers the Notify processor to close the Gate-1.

2nd flow file: It triggers 5 input flow files to Gate-1 which was opened by the previous step and reaches PG-2.

3rd flow file: It triggers any remaining input flow files to Gate-2 (Wait processor) and it acts as a queue. This Gate-2 will get notifications if any of the previously triggered flow files complete their processing in PG-2. This Gate-2 releases remaining input files one at a time for each Notify signal it receives.

Initially, 5 flow files will be released from Gate-1, leaving 35 flow files queued in Gate-2. When one or more flow files get completed, it will send the notification signal to Gate-2. Each notify signal releases one flow file from Gate-2. This way, it ensures only 5 concurrent threads are running for PG-2.

Screenshot of NiFi Flow that Shows Controls for Concurrency:

Complete flow to control the threads inside the actual processor group

Screenshot that Shows Concurrent Threads on Top of PG-2:

Concurrency in the actual processor group

Handling Signals Created by Last Few Inputs

Once the last few input flow files (36, 37, 38, 39 and 40 in the above example) get processed by PG-2, it triggers signals to Gate-2 (which is a queue) that there are no further input files to be processed. These additional signals can lead to additional runs within PG-2 when a new set of inputs arrive. This is avoided using a special Neutralization gate that by-passes all these additional flow files from getting into Gate-2.

Enhancing this Solution to Multiple Applications

The above example was to just for one requirement i.e., processing various input tables that are received concurrently. What if at an enterprise level, this cleansing process has started being recognized and all source tables from the various applications are asked to be processed in this processor group PG-2.

Let us say, App 1 is for processing sales tables and App 2 is for processing finance tables. How do we achieve 5 concurrent threads maintained within PG-2 for each application?

All one would need is a small modification in the Wait/Notify release signal identifier. This would involve making the name of signals as attribute-driven instead of hardcoded signal names. It supports NiFi expression language. Hence by making the release signal identifier as attribute driven, one can control threads within a processor group at an application-level granularity.

Attribute driven signal identifier in Wait Processor Properties

By default, the expiration duration will be 10 minutes, which means queued flow files will wait in the queue for the notify signal until 10 mins, then these flow files will get expired. The expiration duration needs to be configured in order to avoid the flow file expiration.

Attribute-driven signal identifier in Notify Processor Properties

Debug vs Actual Execution Mode:

During development, there will be situations to simulate signal files using generate flow file processors and that can lead to orphan signals waiting for processing when there are no input flow files. In the subsequent run, input flow files will get processed using orphan signals that were already there during the debugging stage.

Debugging

In the above scenario, if any of the flow files go into the success queue, there is some unused signal from the Notify processor which is mistakenly stored in Wait processor while developing or testing. In the above image, it is clear that 18 notify signal are triggered by mistake. To make sure that Wait processor is working as expected, all the simulated flow files should go to the Wait queue, not to the success queue. If the debugging step is skipped, it may lead to run a higher number of parallel threads inside the actual processor group than expected.